ATTACH/DETACH PARTITION CONCURRENTLY

Started by David Rowleyover 7 years ago95 messages

david.rowley@2ndquadrant.com

over 7 years ago

4 attachment(s)

Hi,

One of the downsides of declarative partitioning vs old school
inheritance partitioning is that a new partition cannot be added to
the partitioned table without taking an AccessExclusiveLock on the
partitioned table. We've obviously got a bunch of features for
various other things where we work a bit harder to get around that
problem, e.g creating indexes concurrently.

I've started working on allowing partitions to be attached and
detached with just a ShareUpdateExclusiveLock on the table. If I'm
correct, then we can do this in a similar, but more simple way as to
how CREATE INDEX CONCURRENTLY works. We just need to pencil in that
the new partition exists, but not yet valid, then wait for snapshots
older than our own to finish before marking the partition is valid.

One problem I had with doing this is that there was not really a good
place to store that "isvalid" flag for partitions. We have pg_index
for indexes, but partition details are just spread over pg_inherits
and pg_class. So step 1 was to move all that into a new table called
pg_partition. I think this is quite nice as it also gets rid of
relpartbound out of pg_class. It probably just a matter of time before
someone complains that they can't create some partition with some
pretty large Datum due to it not being able to fit on a single heap
page (pg_class has no TOAST table). I ended up getting rid of
pg_class.relispartition replacing it with relpartitionparernt which is
just InvalidOid when the table or index is not a partition. This
allows various pieces of code to be more efficient since we can look
at the relcache instead of scanning pg_inherits all the time. It's now
also much faster to get a partitions ancestors.

So, patches 0001 is just one I've already submitted for the July
'fest. Nothing new. It was just required to start this work.

0002 migrates partitions out of pg_inherits into pg_partition. This
patch is at a stage where it appears to work, but is very unpolished
and requires me to stare at it much longer than I've done so far.
There's a bunch of code that gets repeated way too many times in
tablecmds.c, for example.

0003 does the same for partitioned indexes. The patch is in a similar,
maybe slightly worse state than 0002. Various comments will be out of
date.

0004
is the early workings of what I have in mind for the concurrent ATTACH
code. It's vastly incomplete. It does pass make check but really only
because there are no tests doing any concurrent attaches. There's a
mountain of code missing that ignores invalid partitions. I just have
a very simple case working. Partition-wise joins will be very much
broken by what I have so far, and likely a whole bunch of other stuff.

About the extent of my tests so far are the following:

--setup
create table listp (a int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 (a int);
insert into listp1 values(1);
insert into listp2 values(2);

-- example 1.
start transaction isolation level repeatable read; -- Session 1
select * from listp; -- Session 1
a
---
1
(1 row)

alter table listp attach partition concurrently listp2 for values in
(2); -- Session 2 (waits for release of session 1's snapshot)
select * from listp; -- Session 1
a
---
1

commit; -- session 1 (session 2's alter table now finishes waiting)
select * from listp; -- Session 1 (new partition now valid)
a
---
1
2
(2 rows)

-- example 2.
start transaction isolation level read committed; -- session 1;
select * from listp; -- session 1
a
---
1
(1 row)

alter table listp attach partition concurrently listp2 for values in
(2); -- Session 2 completes without waiting.

select * from listp; -- Session 1 (new partition visible while in transaction)
a
---
1
2
(2 rows)

This basically works by:

1. Do all the normal partition attach partition validation.
2. Insert a record into pg_partition with partisvalid=false
3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table.
4. Obtain a session-level AccessExclusiveLock on the partition being attached.
5. Commit.
6. Start a new transaction.
7. Wait for snapshots older than our own to be released.
8. Mark the partition as valid
9. Invalidate relcache for the partitioned table.
10. release session-level locks.

I've disallowed the feature when the partitioned table has a default
partition. I don't see how this can be made to work.

At the moment ALTER TABLE ... ATTACH PARTITION commands cannot contain
any other sub-commands in the ALTER TABLE, so performing the
additional transaction commit and begin inside the single sub-command
might be okay. It does mean that 'rel' which is passed down to
ATExecAttachPartition() must be closed and reopened again which
results in the calling function having a pointer into a closed
Relation. I worked around this by changing the code so it passes a
pointer to the Relation, and I've got ATExecAttachPartition() updating
that pointer before returning. It's not particularly pretty, but I
didn't really see how else this can be done.

I've not yet done anything about the DETACH CONCURRENTLY case. I think
it should just be the same steps in some roughly reverse order. We
can skip the waiting part of the partition being detached is still
marked as invalid from some failed concurrent ATTACH.

I've not thought much about pg_dump beyond just have it ignore invalid
partitions. I don't think it's very useful to support some command
that attaches an invalid partition since there will be no command to
revalidate an invalid partition. It's probably best to resolve that
with a DETACH followed by a new ATTACH. So probably pg_dump can just
do nothing for invalid partitions.

So anyway, my intentions of posting this patch now rather than when
it's closer to being finished is for design review. I'm interested in
hearing objections, comments, constructive criticism for patches
0002-0004. Patch 0001 comments can go to [1]/messages/by-id/CAKJS1f81TpxZ8twugrWCo=VDHEkmagxRx7a+1z4aaMeQy=nA7w@mail.gmail.com

Are there any blockers on this that I've overlooked?

[1]: /messages/by-id/CAKJS1f81TpxZ8twugrWCo=VDHEkmagxRx7a+1z4aaMeQy=nA7w@mail.gmail.com

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload

From 4fffd0df2226a5585930bd2f0e8b71019b174477 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 2 Aug 2018 15:55:55 +1200
Subject: [PATCH v1 1/4] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are inititialized
rather than in same order as partdesc.

The slowest part of ExecSetupPartitionTupleRouting still remains.  The
find_all_inheritors call still remains by far the slowest part of the
function. This patch just removes the other slow parts.

Initialization of the parent/child translation maps array is now only
performed when we need to store the first translation map.  If the column
order between the parent and its child are the same, then no map ever
needs to be stored, this (possibly large) array did nothing.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions the shutdown of the executor was also slow in comparison to
the actual execution, this was down to the loop which cleans up each
ResultRelInfo having to loop over an array which often contained mostly
NULLs, which had to be skipped.  Performance of this has now improved as
the array we loop over now no longer has to skip possibly many NULL
values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c            |  31 +-
 src/backend/executor/execPartition.c   | 764 +++++++++++++++++++--------------
 src/backend/executor/nodeModifyTable.c | 108 +----
 src/backend/utils/cache/partcache.c    |  11 +-
 src/include/catalog/partition.h        |   5 +-
 src/include/executor/execPartition.h   | 163 ++++---
 6 files changed, 579 insertions(+), 503 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 9bc67ce60f..752ba3d767 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2699,10 +2699,8 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2800,15 +2798,7 @@ CopyFrom(CopyState cstate)
 				 * one.
 				 */
 				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
+				Assert(resultRelInfo != NULL);
 
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
@@ -2864,11 +2854,16 @@ CopyFrom(CopyState cstate)
 			 * partition rowtype.  Don't free the already stored tuple as it
 			 * may still be required for a multi-insert batch.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot,
-											  false);
+			if (proute->parent_child_tupconv_maps)
+			{
+				TupleConversionMap *map =
+				proute->parent_child_tupconv_maps[leaf_part_index];
+
+				tuple = ConvertPartitionTupleSlot(map, tuple,
+												  proute->partition_tuple_slot,
+												  &slot,
+												  false);
+			}
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d13be4145f..7849e04bdb 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,11 +31,18 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch parent, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -62,138 +69,115 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-	}
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * lazily, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * The PartitionDispatch for the target partitioned table of the command
+	 * must be set up, but any sub-partitioned tables can be set up lazily as
+	 * and when the tuples get routed to (through) them.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+	proute->child_parent_map_not_required = NULL;
 
 	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
 	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
-	i = 0;
-	foreach(cell, leaf_parts)
+	/*
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go making one, we check for a pre-made one
+	 * in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
+	 */
+	if (node && node->operation == CMD_UPDATE)
 	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
-
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
-
-			update_rri_index++;
-		}
-
-		proute->partitions[i] = leaf_part_rri;
-		i++;
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
 	}
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * Initialize an empty slot that will be used to manipulate tuples of any
+	 * given partition's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot.
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
@@ -216,9 +200,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	while (true)
 	{
 		TupleConversionMap *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -244,37 +229,114 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
-		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			int			result = -1;
+
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We may require
+			 * building a new ResultRelInfo.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				result = dispatch->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						dispatch->indexes[partidx] = result;
+
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create one afresh. */
+				if (result < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+														  partdesc->oids[partidx],
+														  dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 
 			/*
 			 * Release the dedicated slot, if it was used.  Create a copy of
@@ -287,58 +349,131 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
 
-	/* A partition was not found. */
-	if (result < 0)
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
 
-	return result;
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_map_not_required = (bool *)
+			repalloc(proute->child_parent_map_not_required,
+					 sizeof(bool) * new_size);
+		memset(&proute->child_parent_map_not_required[old_size], 0,
+			   sizeof(bool) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in 'proute's partitions array and
+ *		return the index of that element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -514,15 +649,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -535,7 +680,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -548,7 +693,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -562,7 +707,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -572,8 +717,14 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = proute->parent_child_tupconv_maps ?
+				proute->parent_child_tupconv_maps[part_result_rel_index] :
+				NULL;
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -582,7 +733,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -673,12 +824,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -693,6 +841,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -703,10 +852,24 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
+
+	if (map)
+	{
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			int			size;
+
+			size = proute->partitions_allocsize;
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+	}
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -721,6 +884,85 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	partRelInfo->ri_PartitionReadyForRouting = true;
 }
 
+/*
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('partidx'), possibly expanding the array if there isn't
+ * enough space left in it.
+ */
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
+{
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
+
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
+
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
+
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap =
+			convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
+								   tupdesc,
+								   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
+
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
+
+	dispatchidx = proute->num_dispatch++;
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
+
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
+
+	return pd;
+}
+
 /*
  * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
  * child-to-root tuple conversion map array.
@@ -733,19 +975,22 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 void
 ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
 {
+	int			size;
+
 	Assert(proute != NULL);
 
+	size = proute->partitions_allocsize;
+
 	/*
 	 * These array elements get filled up with maps on an on-demand basis.
 	 * Initially just set all of them to NULL.
 	 */
 	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * size);
 
 	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
+	proute->child_parent_map_not_required = (bool *) palloc0(sizeof(bool) *
+															 size);
 }
 
 /*
@@ -756,15 +1001,15 @@ TupleConversionMap *
 TupConvMapForLeaf(PartitionTupleRouting *proute,
 				  ResultRelInfo *rootRelInfo, int leaf_index)
 {
-	ResultRelInfo **resultRelInfos = proute->partitions;
 	TupleConversionMap **map;
 	TupleDesc	tupdesc;
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/* If nobody else set up the per-leaf maps array, do so ourselves. */
+	if (proute->child_parent_tupconv_maps == NULL)
+		ExecSetupChildParentMapForLeaf(proute);
 
 	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
+	else if (proute->child_parent_map_not_required[leaf_index])
 		return NULL;
 
 	/* If we've already got a map, return it. */
@@ -773,13 +1018,16 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 		return *map;
 
 	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
+	tupdesc = RelationGetDescr(proute->partitions[leaf_index]->ri_RelationDesc);
 	*map =
 		convert_tuples_by_name(tupdesc,
 							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
 							   gettext_noop("could not convert row type"));
 
-	/* If it turns out no map is needed, remember for next time. */
+	/*
+	 * If it turns out no map is needed, remember that so we don't try making
+	 * one again next time.
+	 */
 	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
 
 	return *map;
@@ -827,8 +1075,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -849,10 +1097,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -861,21 +1105,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -889,144 +1131,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-											tupdesc,
-											gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d8d89c7983..bbffbd722e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1667,7 +1666,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1710,21 +1709,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	Assert(proute->partitions[partidx] != NULL);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
+	Assert(partrel != NULL);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1790,11 +1781,10 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot,
-							  true);
+	if (proute->parent_child_tupconv_maps)
+		ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
+								  tuple, proute->partition_tuple_slot, &slot,
+								  true);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
 	Assert(mtstate != NULL);
@@ -1830,17 +1820,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1862,79 +1841,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ouselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 115a9fe78f..82acfeb460 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,6 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -782,7 +783,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 1f49e5d3a9..4b3b5ae770 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,7 +26,10 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of length 'nparts' containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * a partition is a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index f6cd842cc9..0b03b9dd76 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -31,9 +31,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array with partdesc->nparts elements.  For leaf partitions the
+ *				index into the PartitionTupleRouting->partitions array is
+ *				stored.  When the partition is itself a partitioned table then
+ *				we store the index into
+ *				PartitionTupleRouting->partition_dispatch_info.  -1 means
+ *				we've not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -44,72 +48,114 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	TupleConversionMap *tupmap;
-	int		   *indexes;
+	int		   indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present as the first entry of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.  Also, if they're non-NULL, marks the size
+ *							of the 'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps' and
+ *							'child_parent_map_not_required' arrays.
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slot		This is a tuple slot used to store a tuple using
+ *							rowtype of the partition chosen by tuple
+ *							routing.  Maintained separately because partitions
+ *							may have different rowtype.
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype.
+ *
+ * child_parent_map_not_required	True if the corresponding
+ *							child_parent_tupconv_maps element has been
+ *							determined to require no translation or set to
+ *							NULL when child_parent_tupconv_maps is NULL.  This
+ *							is required in order to distinguish tuple
+ *							translations which have been seen to not be
+ *							required due to the TupleDescs being compatible
+ *							with transactions which have yet to be determined.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
 	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 	TupleTableSlot *root_tuple_slot;
+	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
 
 /*
@@ -200,14 +246,15 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
-- 
2.16.2.windows.1

v1-0002-Store-partition-details-in-pg_partition-instead-o.patchapplication/octet-stream; name=v1-0002-Store-partition-details-in-pg_partition-instead-o.patchDownload

From 99c32c8bb1d424aecdf24fcb0e96a470c91d83a0 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 2 Aug 2018 16:02:42 +1200
Subject: [PATCH v1 2/4] Store partition details in pg_partition instead of
 pg_inherits

---
 contrib/postgres_fdw/postgres_fdw.c        |   4 +-
 src/backend/catalog/Makefile               |   2 +-
 src/backend/catalog/heap.c                 |  72 ++---
 src/backend/catalog/index.c                |   2 +-
 src/backend/catalog/partition.c            | 140 +++++----
 src/backend/catalog/pg_inherits.c          |   6 +
 src/backend/commands/analyze.c             |  12 +-
 src/backend/commands/lockcmds.c            | 127 +++++---
 src/backend/commands/publicationcmds.c     |  10 +-
 src/backend/commands/tablecmds.c           | 477 +++++++++++++++++++++--------
 src/backend/commands/trigger.c             |  10 +-
 src/backend/commands/vacuum.c              |   3 +-
 src/backend/executor/execPartition.c       |   3 +-
 src/backend/optimizer/prep/prepunion.c     | 215 ++++++++-----
 src/backend/partitioning/partbounds.c      |  11 +-
 src/backend/rewrite/rewriteDefine.c        |   2 +-
 src/backend/tcop/utility.c                 |   9 +-
 src/backend/utils/cache/partcache.c        |  70 +++--
 src/backend/utils/cache/syscache.c         |  14 +-
 src/bin/pg_dump/common.c                   |  22 ++
 src/bin/pg_dump/pg_dump.c                  |  30 +-
 src/bin/pg_dump/pg_dump.h                  |   1 +
 src/bin/psql/describe.c                    |  60 +++-
 src/bin/psql/tab-complete.c                |   1 +
 src/include/catalog/heap.h                 |   4 +-
 src/include/catalog/indexing.h             |   6 +
 src/include/catalog/partition.h            |   2 +
 src/include/catalog/pg_class.dat           |  18 +-
 src/include/catalog/pg_class.h             |   3 +-
 src/include/catalog/toasting.h             |   1 +
 src/include/nodes/parsenodes.h             |   2 +-
 src/include/utils/rel.h                    |   6 +
 src/include/utils/syscache.h               |   1 +
 src/test/regress/expected/alter_table.out  |   4 +-
 src/test/regress/expected/misc_sanity.out  |   3 +-
 src/test/regress/expected/sanity_check.out |   1 +
 36 files changed, 907 insertions(+), 447 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 0803c30a48..beb867d613 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -4522,7 +4522,9 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 		deparseStringLiteral(&buf, stmt->remote_schema);
 
 		/* Partitions are supported since Postgres 10 */
-		if (PQserverVersion(conn) >= 100000)
+		if (PQserverVersion(conn) >= 120000)
+			appendStringInfoString(&buf, " AND c.relpartitionparent <> 0 ");
+		else if (PQserverVersion(conn) >= 100000)
 			appendStringInfoString(&buf, " AND NOT c.relispartition ");
 
 		/* Apply restrictions for LIMIT TO and EXCEPT */
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index 0865240f11..43d9e2eaaa 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -46,7 +46,7 @@ CATALOG_HEADERS := \
 	pg_default_acl.h pg_init_privs.h pg_seclabel.h pg_shseclabel.h \
 	pg_collation.h pg_partitioned_table.h pg_range.h pg_transform.h \
 	pg_sequence.h pg_publication.h pg_publication_rel.h pg_subscription.h \
-	pg_subscription_rel.h
+	pg_subscription_rel.h pg_partition.h
 
 GENERATED_HEADERS := $(CATALOG_HEADERS:%.h=%_d.h) schemapg.h
 
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 4cfc0c8911..c6429a3785 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -49,6 +49,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_opclass.h"
+#include "catalog/pg_partition.h"
 #include "catalog/pg_partitioned_table.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_subscription_rel.h"
@@ -810,7 +811,7 @@ InsertPgClassTuple(Relation pg_class_desc,
 	values[Anum_pg_class_relhassubclass - 1] = BoolGetDatum(rd_rel->relhassubclass);
 	values[Anum_pg_class_relispopulated - 1] = BoolGetDatum(rd_rel->relispopulated);
 	values[Anum_pg_class_relreplident - 1] = CharGetDatum(rd_rel->relreplident);
-	values[Anum_pg_class_relispartition - 1] = BoolGetDatum(rd_rel->relispartition);
+	values[Anum_pg_class_relpartitionparent - 1] = ObjectIdGetDatum(rd_rel->relpartitionparent);
 	values[Anum_pg_class_relrewrite - 1] = ObjectIdGetDatum(rd_rel->relrewrite);
 	values[Anum_pg_class_relfrozenxid - 1] = TransactionIdGetDatum(rd_rel->relfrozenxid);
 	values[Anum_pg_class_relminmxid - 1] = MultiXactIdGetDatum(rd_rel->relminmxid);
@@ -823,9 +824,6 @@ InsertPgClassTuple(Relation pg_class_desc,
 	else
 		nulls[Anum_pg_class_reloptions - 1] = true;
 
-	/* relpartbound is set by updating this tuple, if necessary */
-	nulls[Anum_pg_class_relpartbound - 1] = true;
-
 	tup = heap_form_tuple(RelationGetDescr(pg_class_desc), values, nulls);
 
 	/*
@@ -929,8 +927,8 @@ AddNewRelationTuple(Relation pg_class_desc,
 	new_rel_reltup->reltype = new_type_oid;
 	new_rel_reltup->reloftype = reloftype;
 
-	/* relispartition is always set by updating this tuple later */
-	new_rel_reltup->relispartition = false;
+	/* relpartitionparent is always set by updating this tuple later */
+	new_rel_reltup->relpartitionparent = InvalidOid;
 
 	new_rel_desc->rd_att->tdtypeid = new_type_oid;
 
@@ -1807,11 +1805,17 @@ heap_drop_with_catalog(Oid relid)
 	tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
 	if (!HeapTupleIsValid(tuple))
 		elog(ERROR, "cache lookup failed for relation %u", relid);
-	if (((Form_pg_class) GETSTRUCT(tuple))->relispartition)
+
+	parentOid = ((Form_pg_class) GETSTRUCT(tuple))->relpartitionparent;
+	if (OidIsValid(parentOid))
 	{
-		parentOid = get_partition_parent(relid);
+		HeapTuple parttup;
+		Relation	pgpart;
+
 		LockRelationOid(parentOid, AccessExclusiveLock);
 
+		pgpart =  heap_open(PartitionRelationId, RowExclusiveLock);
+
 		/*
 		 * If this is not the default partition, dropping it will change the
 		 * default partition's partition constraint, so we must lock it.
@@ -1819,6 +1823,14 @@ heap_drop_with_catalog(Oid relid)
 		defaultPartOid = get_default_partition_oid(parentOid);
 		if (OidIsValid(defaultPartOid) && relid != defaultPartOid)
 			LockRelationOid(defaultPartOid, AccessExclusiveLock);
+
+		parttup = SearchSysCacheCopy1(PARTSRELID, ObjectIdGetDatum(relid));
+		if (!HeapTupleIsValid(parttup))
+			elog(ERROR, "cache lookup failed for relation %u", relid);
+
+		CatalogTupleDelete(pgpart, &parttup->t_self);
+
+		heap_close(pgpart, RowExclusiveLock);
 	}
 
 	ReleaseSysCache(tuple);
@@ -2759,7 +2771,8 @@ MergeWithExistingConstraint(Relation rel, const char *ccname, Node *expr,
 			 * constraints are always non-local, including those that were
 			 * merged.
 			 */
-			if (is_local && !con->conislocal && !rel->rd_rel->relispartition)
+			if (is_local && !con->conislocal &&
+				!OidIsValid(rel->rd_rel->relpartitionparent))
 				allow_merge = true;
 
 			if (!found || !allow_merge)
@@ -2809,7 +2822,7 @@ MergeWithExistingConstraint(Relation rel, const char *ccname, Node *expr,
 			 * inherited only once since it cannot have multiple parents and
 			 * it is never considered local.
 			 */
-			if (rel->rd_rel->relispartition)
+			if (OidIsValid(rel->rd_rel->relpartitionparent))
 			{
 				con->coninhcount = 1;
 				con->conislocal = false;
@@ -3481,9 +3494,9 @@ RemovePartitionKeyByRelId(Oid relid)
 }
 
 /*
- * StorePartitionBound
- *		Update pg_class tuple of rel to store the partition bound and set
- *		relispartition to true
+ * MarkRelationPartitioned
+ *		Update pg_class tuple of rel to set relpartitionparent to the parent's
+ *		Oid.
  *
  * If this is the default partition, also update the default partition OID in
  * pg_partitioned_table.
@@ -3493,14 +3506,10 @@ RemovePartitionKeyByRelId(Oid relid)
  * default partition, we must invalidate its relcache entry as well.
  */
 void
-StorePartitionBound(Relation rel, Relation parent, PartitionBoundSpec *bound)
+MarkRelationPartitioned(Relation rel, Relation parent, bool is_default)
 {
 	Relation	classRel;
-	HeapTuple	tuple,
-				newtuple;
-	Datum		new_val[Natts_pg_class];
-	bool		new_null[Natts_pg_class],
-				new_repl[Natts_pg_class];
+	HeapTuple	tuple;
 	Oid			defaultPartOid;
 
 	/* Update pg_class tuple */
@@ -3514,36 +3523,23 @@ StorePartitionBound(Relation rel, Relation parent, PartitionBoundSpec *bound)
 #ifdef USE_ASSERT_CHECKING
 	{
 		Form_pg_class classForm;
-		bool		isnull;
 
 		classForm = (Form_pg_class) GETSTRUCT(tuple);
-		Assert(!classForm->relispartition);
-		(void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound,
-							   &isnull);
-		Assert(isnull);
+		Assert(!OidIsValid(classForm->relpartitionparent));
 	}
 #endif
 
-	/* Fill in relpartbound value */
-	memset(new_val, 0, sizeof(new_val));
-	memset(new_null, false, sizeof(new_null));
-	memset(new_repl, false, sizeof(new_repl));
-	new_val[Anum_pg_class_relpartbound - 1] = CStringGetTextDatum(nodeToString(bound));
-	new_null[Anum_pg_class_relpartbound - 1] = false;
-	new_repl[Anum_pg_class_relpartbound - 1] = true;
-	newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
-								 new_val, new_null, new_repl);
-	/* Also set the flag */
-	((Form_pg_class) GETSTRUCT(newtuple))->relispartition = true;
-	CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
-	heap_freetuple(newtuple);
+	/* Set the relpartitionparent */
+	((Form_pg_class) GETSTRUCT(tuple))->relpartitionparent = RelationGetRelid(parent);
+	CatalogTupleUpdate(classRel, &tuple->t_self, tuple);
+	heap_freetuple(tuple);
 	heap_close(classRel, RowExclusiveLock);
 
 	/*
 	 * If we're storing bounds for the default partition, update
 	 * pg_partitioned_table too.
 	 */
-	if (bound->is_default)
+	if (is_default)
 		update_default_partition_oid(RelationGetRelid(parent),
 									 RelationGetRelid(rel));
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8b276bc430..eda850edef 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -936,7 +936,7 @@ index_create(Relation heapRelation,
 	indexRelation->rd_rel->relowner = heapRelation->rd_rel->relowner;
 	indexRelation->rd_rel->relam = accessMethodObjectId;
 	indexRelation->rd_rel->relhasoids = false;
-	indexRelation->rd_rel->relispartition = OidIsValid(parentIndexRelid);
+	indexRelation->rd_rel->relpartitionparent = parentIndexRelid;
 
 	/*
 	 * store index's pg_class entry
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 558022647c..d3e6787885 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -22,6 +22,7 @@
 #include "catalog/indexing.h"
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
+#include "catalog/pg_partition.h"
 #include "catalog/pg_partitioned_table.h"
 #include "nodes/makefuncs.h"
 #include "optimizer/clauses.h"
@@ -29,124 +30,121 @@
 #include "optimizer/var.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
+#include "storage/lmgr.h"
 #include "utils/fmgroids.h"
 #include "utils/partcache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
-
-static Oid	get_partition_parent_worker(Relation inhRel, Oid relid);
-static void get_partition_ancestors_worker(Relation inhRel, Oid relid,
-							   List **ancestors);
+static void get_partition_descendants_worker(Oid relid, LOCKMODE lockmode,
+								 List **reloids);
 
 /*
  * get_partition_parent
  *		Obtain direct parent of given relation
  *
- * Returns inheritance parent of a partition by scanning pg_inherits
- *
- * Note: Because this function assumes that the relation whose OID is passed
- * as an argument will have precisely one parent, it should only be called
- * when it is known that the relation is a partition.
+ * Returns partition parent of a partition or InvalidOid if there is no parent
  */
 Oid
 get_partition_parent(Oid relid)
 {
-	Relation	catalogRelation;
 	Oid			result;
+	HeapTuple	tuple;
 
-	catalogRelation = heap_open(InheritsRelationId, AccessShareLock);
-
-	result = get_partition_parent_worker(catalogRelation, relid);
-
-	if (!OidIsValid(result))
-		elog(ERROR, "could not find tuple for parent of relation %u", relid);
+	tuple = SearchSysCache1(RELOID, relid);
+	if (!HeapTupleIsValid(tuple))
+		elog(ERROR, "cache lookup failed for relation %u", relid);
 
-	heap_close(catalogRelation, AccessShareLock);
+	result = ((Form_pg_class) GETSTRUCT(tuple))->relpartitionparent;
+	ReleaseSysCache(tuple);
 
 	return result;
 }
 
 /*
- * get_partition_parent_worker
- *		Scan the pg_inherits relation to return the OID of the parent of the
- *		given relation
+ * get_partition_ancestors
+ *		Obtain ancestors of a given partition or partitioned index.
+ *
+ * Follows pg_class.relpartitionparent links and returns a list of ancestors
+ * of the given partition or partitioned index starting with the parent and
+ * ending with the top-level partitioned table or index.
  */
-static Oid
-get_partition_parent_worker(Relation inhRel, Oid relid)
+List *
+get_partition_ancestors(Oid relid)
 {
-	SysScanDesc scan;
-	ScanKeyData key[2];
-	Oid			result = InvalidOid;
+	List	   *result = NIL;
 	HeapTuple	tuple;
 
-	ScanKeyInit(&key[0],
-				Anum_pg_inherits_inhrelid,
-				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(relid));
-	ScanKeyInit(&key[1],
-				Anum_pg_inherits_inhseqno,
-				BTEqualStrategyNumber, F_INT4EQ,
-				Int32GetDatum(1));
-
-	scan = systable_beginscan(inhRel, InheritsRelidSeqnoIndexId, true,
-							  NULL, 2, key);
-	tuple = systable_getnext(scan);
-	if (HeapTupleIsValid(tuple))
+	for (;;)
 	{
-		Form_pg_inherits form = (Form_pg_inherits) GETSTRUCT(tuple);
+		Oid parentOid;
 
-		result = form->inhparent;
-	}
+		tuple = SearchSysCache1(RELOID, relid);
+		if (!HeapTupleIsValid(tuple))
+			elog(ERROR, "cache lookup failed for relation %u", relid);
+
+		parentOid = ((Form_pg_class) GETSTRUCT(tuple))->relpartitionparent;
+		ReleaseSysCache(tuple);
 
-	systable_endscan(scan);
+		if (!OidIsValid(parentOid))
+			break;
+
+		result = lappend_oid(result, parentOid);
+
+		relid = parentOid;
+	}
 
 	return result;
 }
 
 /*
- * get_partition_ancestors
- *		Obtain ancestors of given relation
- *
- * Returns a list of ancestors of the given relation.
- *
- * Note: Because this function assumes that the relation whose OID is passed
- * as an argument and each ancestor will have precisely one parent, it should
- * only be called when it is known that the relation is a partition.
+ * get_partition_descendants
+ *		Returns a list of Oids of all partitions which descend from 'relid'
+ *		including 'relid' itself.  Obtains a 'lockmode' level lock on each
+ *		item in the list.
  */
 List *
-get_partition_ancestors(Oid relid)
+get_partition_descendants(Oid relid, LOCKMODE lockmode)
 {
-	List	   *result = NIL;
-	Relation	inhRel;
-
-	inhRel = heap_open(InheritsRelationId, AccessShareLock);
-
-	get_partition_ancestors_worker(inhRel, relid, &result);
+	List *reloids = NIL;
 
-	heap_close(inhRel, AccessShareLock);
+	get_partition_descendants_worker(relid, lockmode, &reloids);
 
-	return result;
+	return reloids;
 }
 
-/*
- * get_partition_ancestors_worker
- *		recursive worker for get_partition_ancestors
- */
 static void
-get_partition_ancestors_worker(Relation inhRel, Oid relid, List **ancestors)
+get_partition_descendants_worker(Oid relid, LOCKMODE lockmode, List **reloids)
 {
-	Oid			parentOid;
+	Relation rel = relation_open(relid, lockmode);
+	PartitionDesc partdesc;
+	int i;
+
+	partdesc = RelationGetPartitionDesc(rel);
+
+	Assert(partdesc);
 
-	/* Recursion ends at the topmost level, ie., when there's no parent */
-	parentOid = get_partition_parent_worker(inhRel, relid);
-	if (parentOid == InvalidOid)
-		return;
+	*reloids = lappend_oid(*reloids, relid);
 
-	*ancestors = lappend_oid(*ancestors, parentOid);
-	get_partition_ancestors_worker(inhRel, parentOid, ancestors);
+	for (i = 0; i < partdesc->nparts; i++)
+	{
+		Oid partoid = partdesc->oids[i];
+
+		if (!partdesc->is_leaf[i])
+			get_partition_descendants_worker(partoid, lockmode, reloids);
+		else
+		{
+			if (lockmode != NoLock)
+				LockRelationOid(partoid, lockmode);
+
+			*reloids = lappend_oid(*reloids, partoid);
+		}
+	}
+
+	relation_close(rel, NoLock);
 }
 
+
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
  * attno's of 'from_rel' to the attno's of 'to_rel' partition, each of which
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 85baca54cc..87c8fd3266 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -28,6 +28,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/lsyscache.h" // get_rel_relkind
 #include "utils/memutils.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
@@ -172,6 +173,8 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
 			   *rel_numparents;
 	ListCell   *l;
 
+	Assert(get_rel_relkind(parentrelId) != RELKIND_PARTITIONED_TABLE);
+
 	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
 	ctl.entrysize = sizeof(SeenRelsEntry);
@@ -267,6 +270,9 @@ has_subclass(Oid relationId)
 	if (!HeapTupleIsValid(tuple))
 		elog(ERROR, "cache lookup failed for relation %u", relationId);
 
+	Assert(!OidIsValid(((Form_pg_class) GETSTRUCT(tuple))->relpartitionparent));
+	Assert(((Form_pg_class) GETSTRUCT(tuple))->relkind != RELKIND_PARTITIONED_TABLE);
+
 	result = ((Form_pg_class) GETSTRUCT(tuple))->relhassubclass;
 	ReleaseSysCache(tuple);
 	return result;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 3e148f03d0..83f70fb165 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -26,6 +26,7 @@
 #include "catalog/catalog.h"
 #include "catalog/index.h"
 #include "catalog/indexing.h"
+#include "catalog/partition.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_namespace.h"
@@ -311,9 +312,10 @@ analyze_rel(Oid relid, RangeVar *relation, int options,
 					   relpages, false, in_outer_xact, elevel);
 
 	/*
-	 * If there are child tables, do recursive ANALYZE.
+	 * If there are child tables or it's a partitioned table, do recursive
+	 * ANALYZE.
 	 */
-	if (onerel->rd_rel->relhassubclass)
+	if (onerel->rd_rel->relhassubclass || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		do_analyze_rel(onerel, options, params, va_cols, acquirefunc, relpages,
 					   true, in_outer_xact, elevel);
 
@@ -1334,8 +1336,10 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 	 * Find all members of inheritance set.  We only need AccessShareLock on
 	 * the children.
 	 */
-	tableOIDs =
-		find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, NULL);
+	if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		tableOIDs = get_partition_descendants(RelationGetRelid(onerel), AccessShareLock);
+	else
+		tableOIDs = find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, NULL);
 
 	/*
 	 * Check that there's at least one descendant, else fail.  This could
diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index 71278b38cf..4a583c1c8b 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "commands/lockcmds.h"
 #include "miscadmin.h"
@@ -29,6 +30,8 @@
 
 static void LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait, Oid userid);
 static AclResult LockTableAclCheck(Oid relid, LOCKMODE lockmode, Oid userid);
+static bool LockSingleTable(Oid relid, LOCKMODE lockmode, bool nowait,
+				Oid userid);
 static void RangeVarCallbackForLockTable(const RangeVar *rv, Oid relid,
 							 Oid oldrelid, void *arg);
 static void LockViewRecurse(Oid reloid, LOCKMODE lockmode, bool nowait, List *ancestor_views);
@@ -107,65 +110,109 @@ RangeVarCallbackForLockTable(const RangeVar *rv, Oid relid, Oid oldrelid,
 }
 
 /*
- * Apply LOCK TABLE recursively over an inheritance tree
+ * LockSingleTable
+ *		Apply LOCK TABLE to a single table.
+ */
+static bool
+LockSingleTable(Oid relid, LOCKMODE lockmode, bool nowait, Oid userid)
+{
+	AclResult	aclresult;
+
+	/* Check permissions before acquiring the lock. */
+	aclresult = LockTableAclCheck(relid, lockmode, userid);
+	if (aclresult != ACLCHECK_OK)
+	{
+		char	   *relname = get_rel_name(relid);
+
+		if (!relname)
+			return false;		/* table concurrently dropped, just skip it */
+		aclcheck_error(aclresult, get_relkind_objtype(get_rel_relkind(relid)), relname);
+	}
+
+	/* We have enough rights to lock the relation; do so. */
+	if (!nowait)
+		LockRelationOid(relid, lockmode);
+	else if (!ConditionalLockRelationOid(relid, lockmode))
+	{
+		/* try to throw error by name; relation could be deleted... */
+		char	   *relname = get_rel_name(relid);
+
+		if (!relname)
+			return false;		/* table concurrently dropped, just skip it */
+		ereport(ERROR,
+				(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+					errmsg("could not obtain lock on relation \"%s\"",
+						relname)));
+	}
+
+	/*
+	 * Even if we got the lock, child might have been concurrently
+	 * dropped. If so, we can skip it.
+	 */
+	if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+	{
+		/* Release useless lock */
+		UnlockRelationOid(relid, lockmode);
+		return false;
+	}
+	return true;
+}
+
+/*
+ * Apply LOCK TABLE recursively over an inheritance tree or partitioned table
  *
  * We use find_inheritance_children not find_all_inheritors to avoid taking
  * locks far in advance of checking privileges.  This means we'll visit
- * multiply-inheriting children more than once, but that's no problem.
+ * multiply-inheriting children more than once, but that's no problem. For
+ * partitions, we simply loop through each partition checking if it's a
+ * sub-partitioned table or leaf partition, for the latter we needn't
+ * recurse, but for the former we must.
  */
 static void
 LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait, Oid userid)
 {
-	List	   *children;
-	ListCell   *lc;
+	Relation rel;
+
+	if (!LockSingleTable(reloid, lockmode, nowait, userid))
+		return;
 
-	children = find_inheritance_children(reloid, NoLock);
+	/* Lock taken above */
+	rel = relation_open(reloid, NoLock);
 
-	foreach(lc, children)
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		Oid			childreloid = lfirst_oid(lc);
-		AclResult	aclresult;
+		PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+		int			nparts = partdesc->nparts;
+		int			i;
 
-		/* Check permissions before acquiring the lock. */
-		aclresult = LockTableAclCheck(childreloid, lockmode, userid);
-		if (aclresult != ACLCHECK_OK)
+		for (i = 0; i < nparts; i++)
 		{
-			char	   *relname = get_rel_name(childreloid);
+			Oid partoid = partdesc->oids[i];
 
-			if (!relname)
-				continue;		/* child concurrently dropped, just skip it */
-			aclcheck_error(aclresult, get_relkind_objtype(get_rel_relkind(childreloid)), relname);
+			if (partdesc->is_leaf[i])
+				LockSingleTable(partoid, lockmode, nowait, userid);
+			else
+				LockTableRecurse(partoid, lockmode, nowait, userid);
 		}
+	}
 
-		/* We have enough rights to lock the relation; do so. */
-		if (!nowait)
-			LockRelationOid(childreloid, lockmode);
-		else if (!ConditionalLockRelationOid(childreloid, lockmode))
-		{
-			/* try to throw error by name; relation could be deleted... */
-			char	   *relname = get_rel_name(childreloid);
-
-			if (!relname)
-				continue;		/* child concurrently dropped, just skip it */
-			ereport(ERROR,
-					(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
-					 errmsg("could not obtain lock on relation \"%s\"",
-							relname)));
-		}
+	/* leaf partitions won't have inheritance children, so skip those */
+	else if (!OidIsValid(rel->rd_rel->relpartitionparent))
+	{
+		List	   *children;
+		ListCell   *lc;
 
-		/*
-		 * Even if we got the lock, child might have been concurrently
-		 * dropped. If so, we can skip it.
-		 */
-		if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(childreloid)))
+		children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+
+		foreach(lc, children)
 		{
-			/* Release useless lock */
-			UnlockRelationOid(childreloid, lockmode);
-			continue;
-		}
+			Oid			childreloid = lfirst_oid(lc);
 
-		LockTableRecurse(childreloid, lockmode, nowait, userid);
+			LockTableRecurse(childreloid, lockmode, nowait, userid);
+		}
 	}
+
+	relation_close(rel, NoLock);
 }
 
 /*
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 6f7762a906..4622173e8a 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -28,6 +28,7 @@
 #include "catalog/namespace.h"
 #include "catalog/objectaccess.h"
 #include "catalog/objectaddress.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_type.h"
 #include "catalog/pg_publication.h"
@@ -529,8 +530,13 @@ OpenTableList(List *tables)
 			ListCell   *child;
 			List	   *children;
 
-			children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
-										   NULL);
+			if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+				children = get_partition_descendants(myrelid,
+													 ShareUpdateExclusiveLock);
+			else
+				children = find_all_inheritors(myrelid,
+											   ShareUpdateExclusiveLock,
+											   NULL);
 
 			foreach(child, children)
 			{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index eb2d33dd86..8a0fcd7ece 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -40,6 +40,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_opclass.h"
+#include "catalog/pg_partition.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_trigger.h"
 #include "catalog/pg_type.h"
@@ -477,6 +478,8 @@ static void RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid,
 static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy);
 static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
 					  List **partexprs, Oid *partopclass, Oid *partcollation, char strategy);
+static void AttachPartition(Relation attachrel, Relation rel,
+				PartitionBoundSpec *bound);
 static void CreateInheritance(Relation child_rel, Relation parent_rel);
 static void RemoveInheritance(Relation child_rel, Relation parent_rel);
 static ObjectAddress ATExecAttachPartition(List **wqueue, Relation rel,
@@ -492,8 +495,8 @@ static ObjectAddress ATExecAttachPartitionIdx(List **wqueue, Relation rel,
 static void validatePartitionedIndex(Relation partedIdx, Relation partedTbl);
 static void refuseDupeIndexAttach(Relation parentIdx, Relation partIdx,
 					  Relation partitionTbl);
-static void update_relispartition(Relation classRel, Oid relationId,
-					  bool newval);
+static void update_relpartitionparent(Relation classRel, Oid relationId,
+					  Oid newparent);
 
 
 /* ----------------------------------------------------------------
@@ -771,7 +774,8 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 										  typaddress);
 
 	/* Store inheritance information for new rel. */
-	StoreCatalogInheritance(relationId, inheritOids, stmt->partbound != NULL);
+	if (stmt->partbound == NULL)
+		StoreCatalogInheritance(relationId, inheritOids, false);
 
 	/*
 	 * We must bump the command counter to make the newly-created relation
@@ -860,8 +864,11 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			heap_close(defaultRel, NoLock);
 		}
 
+		/* Add the pg_partition record */
+		AttachPartition(rel, parent, bound);
+
 		/* Update the pg_class entry. */
-		StorePartitionBound(rel, parent, bound);
+		MarkRelationPartitioned(rel, parent, bound->is_default);
 
 		heap_close(parent, NoLock);
 	}
@@ -1204,7 +1211,7 @@ RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid, Oid oldRelOid,
 	struct DropRelationCallbackState *state;
 	char		relkind;
 	char		expected_relkind;
-	bool		is_partition;
+	Oid			parentoid;
 	Form_pg_class classform;
 	LOCKMODE	heap_lockmode;
 
@@ -1243,7 +1250,7 @@ RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid, Oid oldRelOid,
 	if (!HeapTupleIsValid(tuple))
 		return;					/* concurrently dropped, so nothing to do */
 	classform = (Form_pg_class) GETSTRUCT(tuple);
-	is_partition = classform->relispartition;
+	parentoid = classform->relpartitionparent;
 
 	/*
 	 * Both RELKIND_RELATION and RELKIND_PARTITIONED_TABLE are OBJECT_TABLE,
@@ -1298,11 +1305,10 @@ RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid, Oid oldRelOid,
 	 * parent before its partitions, so we risk deadlock it we do it the other
 	 * way around.
 	 */
-	if (is_partition && relOid != oldRelOid)
+	if (OidIsValid(parentoid) && relOid != oldRelOid)
 	{
-		state->partParentOid = get_partition_parent(relOid);
-		if (OidIsValid(state->partParentOid))
-			LockRelationOid(state->partParentOid, AccessExclusiveLock);
+		state->partParentOid = parentoid;
+		LockRelationOid(parentoid, AccessExclusiveLock);
 	}
 }
 
@@ -1356,7 +1362,11 @@ ExecuteTruncate(TruncateStmt *stmt)
 			ListCell   *child;
 			List	   *children;
 
-			children = find_all_inheritors(myrelid, AccessExclusiveLock, NULL);
+			/* partitioned tables cannot have any inheritors */
+			if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+				children = get_partition_descendants(myrelid, AccessExclusiveLock);
+			else
+				children = find_all_inheritors(myrelid, AccessExclusiveLock, NULL);
 
 			foreach(child, children)
 			{
@@ -1972,7 +1982,7 @@ MergeAttributes(List *schema, List *supers, char relpersistence,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 					 errmsg("cannot inherit from partitioned table \"%s\"",
 							parent->relname)));
-		if (relation->rd_rel->relispartition && !is_partition)
+		if (OidIsValid(relation->rd_rel->relpartitionparent) && !is_partition)
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 					 errmsg("cannot inherit from partition \"%s\"",
@@ -2761,28 +2771,45 @@ renameatt_internal(Oid myrelid,
 		ListCell   *lo,
 				   *li;
 
-		/*
-		 * we need the number of parents for each child so that the recursive
-		 * calls to renameatt() can determine whether there are any parents
-		 * outside the inheritance hierarchy being processed.
-		 */
-		child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
-										 &child_numparents);
+		if (targetrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		{
+			child_oids = get_partition_descendants(myrelid, AccessExclusiveLock);
 
-		/*
-		 * find_all_inheritors does the recursive search of the inheritance
-		 * hierarchy, so all we have to do is process all of the relids in the
-		 * list that it returns.
-		 */
-		forboth(lo, child_oids, li, child_numparents)
+			foreach(lo, child_oids)
+			{
+				Oid			childrelid = lfirst_oid(lo);
+
+				if (childrelid == myrelid)
+					continue;
+				/* note we need not recurse again */
+				renameatt_internal(childrelid, oldattname, newattname, false, true, 1, behavior);
+			}
+		}
+		else if (!OidIsValid(targetrelation->rd_rel->relpartitionparent))
 		{
-			Oid			childrelid = lfirst_oid(lo);
-			int			numparents = lfirst_int(li);
+			/*
+			 * we need the number of parents for each child so that the recursive
+			 * calls to renameatt() can determine whether there are any parents
+			 * outside the inheritance hierarchy being processed.
+			 */
+			child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
+											 &child_numparents);
 
-			if (childrelid == myrelid)
-				continue;
-			/* note we need not recurse again */
-			renameatt_internal(childrelid, oldattname, newattname, false, true, numparents, behavior);
+			/*
+			 * find_all_inheritors does the recursive search of the inheritance
+			 * hierarchy, so all we have to do is process all of the relids in the
+			 * list that it returns.
+			 */
+			forboth(lo, child_oids, li, child_numparents)
+			{
+				Oid			childrelid = lfirst_oid(lo);
+				int			numparents = lfirst_int(li);
+
+				if (childrelid == myrelid)
+					continue;
+				/* note we need not recurse again */
+				renameatt_internal(childrelid, oldattname, newattname, false, true, numparents, behavior);
+			}
 		}
 	}
 	else
@@ -5043,12 +5070,17 @@ ATSimpleRecursion(List **wqueue, Relation rel,
 		ListCell   *child;
 		List	   *children;
 
-		children = find_all_inheritors(relid, lockmode, NULL);
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			children = get_partition_descendants(relid, lockmode);
+		else if (!OidIsValid(rel->rd_rel->relpartitionparent))
+			children = find_all_inheritors(relid, lockmode, NULL);
+		else
+			children = NIL;
 
 		/*
-		 * find_all_inheritors does the recursive search of the inheritance
-		 * hierarchy, so all we have to do is process all of the relids in the
-		 * list that it returns.
+		 * find_all_inheritors and get_partition_descendants performs the
+		 * recursive search for all descendant tables, so all we have to do is
+		 * process all of the relids in the list that it returns.
 		 */
 		foreach(child, children)
 		{
@@ -5057,7 +5089,7 @@ ATSimpleRecursion(List **wqueue, Relation rel,
 
 			if (childrelid == relid)
 				continue;
-			/* find_all_inheritors already got lock */
+			/* lock already obtained above */
 			childrel = relation_open(childrelid, NoLock);
 			CheckTableNotInUse(childrel, "ALTER TABLE");
 			ATPrepCmd(wqueue, childrel, cmd, false, true, lockmode);
@@ -5374,7 +5406,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	if (recursing)
 		ATSimplePermissions(rel, ATT_TABLE | ATT_FOREIGN_TABLE);
 
-	if (rel->rd_rel->relispartition && !recursing)
+	if (OidIsValid(rel->rd_rel->relpartitionparent) && !recursing)
 		ereport(ERROR,
 				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 				 errmsg("cannot add column to a partition")));
@@ -5697,7 +5729,18 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	 * routines, we have to do this one level of recursion at a time; we can't
 	 * use find_all_inheritors to do it in one pass.
 	 */
-	children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+		int i;
+		children = NIL;
+		for (i = 0; i < partdesc->nparts; i++)
+			children = lappend_oid(children, partdesc->oids[i]);
+	}
+	else if (OidIsValid(rel->rd_rel->relpartitionparent))
+		children = NIL;
+	else
+		children = find_inheritance_children(RelationGetRelid(rel), lockmode);
 
 	/*
 	 * If we are told not to recurse, there had better not be any child
@@ -5971,10 +6014,13 @@ ATExecDropNotNull(Relation rel, const char *colName, LOCKMODE lockmode)
 
 	list_free(indexoidlist);
 
-	/* If rel is partition, shouldn't drop NOT NULL if parent has the same */
-	if (rel->rd_rel->relispartition)
+	/*
+	 * Disallow dropping of partitions NOT NULL constraints when the
+	 * constraint is present on the parent.
+	 */
+	if (OidIsValid(RelationGetParentRelid(rel)))
 	{
-		Oid			parentId = get_partition_parent(RelationGetRelid(rel));
+		Oid			parentId = RelationGetParentRelid(rel);
 		Relation	parent = heap_open(parentId, AccessShareLock);
 		TupleDesc	tupDesc = RelationGetDescr(parent);
 		AttrNumber	parent_attnum;
@@ -6800,7 +6846,21 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 	 * routines, we have to do this one level of recursion at a time; we can't
 	 * use find_all_inheritors to do it in one pass.
 	 */
-	children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+		int i;
+		children = NIL;
+		for (i = 0; i < partdesc->nparts; i++)
+		{
+			LockRelationOid(partdesc->oids[i], lockmode);
+			children = lappend_int(children, partdesc->oids[i]);
+		}
+	}
+	else if (!OidIsValid(rel->rd_rel->relpartitionparent))
+		children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+	else
+		children = NIL;
 
 	if (children)
 	{
@@ -7251,7 +7311,21 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	 * routines, we have to do this one level of recursion at a time; we can't
 	 * use find_all_inheritors to do it in one pass.
 	 */
-	children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+		int i;
+		children = NIL;
+		for (i = 0; i < partdesc->nparts; i++)
+		{
+			LockRelationOid(partdesc->oids[i], lockmode);
+			children = lappend_int(children, partdesc->oids[i]);
+		}
+	}
+	else if (!OidIsValid(rel->rd_rel->relpartitionparent))
+		children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+	else
+		children = NIL;
 
 	/*
 	 * Check if ONLY was specified with ALTER TABLE.  If so, allow the
@@ -8025,8 +8099,14 @@ ATExecValidateConstraint(Relation rel, char *constrName, bool recurse,
 			 * shouldn't try to look for it in the children.
 			 */
 			if (!recursing && !con->connoinherit)
-				children = find_all_inheritors(RelationGetRelid(rel),
+			{
+				if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+					children = get_partition_descendants(RelationGetRelid(rel),
+														 lockmode);
+				else if (!OidIsValid(rel->rd_rel->relpartitionparent))
+					children = find_all_inheritors(RelationGetRelid(rel),
 											   lockmode, NULL);
+			}
 
 			/*
 			 * For CHECK constraints, we must ensure that we only mark the
@@ -8939,7 +9019,23 @@ ATExecDropConstraint(Relation rel, const char *constrName,
 	 * use find_all_inheritors to do it in one pass.
 	 */
 	if (!is_no_inherit_constraint)
-		children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+	{
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		{
+			PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+			int i;
+			children = NIL;
+			for (i = 0; i < partdesc->nparts; i++)
+			{
+				LockRelationOid(partdesc->oids[i], lockmode);
+				children = lappend_int(children, partdesc->oids[i]);
+			}
+		}
+		else if (!OidIsValid(rel->rd_rel->relpartitionparent))
+			children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+		else
+			children = NIL;
+	}
 	else
 		children = NIL;
 
@@ -9230,12 +9326,15 @@ ATPrepAlterColumnType(List **wqueue,
 		ListCell   *child;
 		List	   *children;
 
-		children = find_all_inheritors(relid, lockmode, NULL);
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			children = get_partition_descendants(relid, lockmode);
+		else
+			children = find_all_inheritors(relid, lockmode, NULL);
 
 		/*
-		 * find_all_inheritors does the recursive search of the inheritance
-		 * hierarchy, so all we have to do is process all of the relids in the
-		 * list that it returns.
+		 * find_all_inheritors and get_partition_descendants does the
+		 * recursive search of all descendants tables, so all we have to do is
+		 * process all of the relids in the list that it returns.
 		 */
 		foreach(child, children)
 		{
@@ -9245,7 +9344,7 @@ ATPrepAlterColumnType(List **wqueue,
 			if (childrelid == relid)
 				continue;
 
-			/* find_all_inheritors already got lock */
+			/* lock already obtained above */
 			childrel = relation_open(childrelid, NoLock);
 			CheckTableNotInUse(childrel, "ALTER TABLE");
 
@@ -11379,7 +11478,7 @@ ATPrepAddInherit(Relation child_rel)
 				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 				 errmsg("cannot change inheritance of typed table")));
 
-	if (child_rel->rd_rel->relispartition)
+	if (OidIsValid(child_rel->rd_rel->relpartitionparent))
 		ereport(ERROR,
 				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 				 errmsg("cannot change inheritance of a partition")));
@@ -11443,7 +11542,7 @@ ATExecAddInherit(Relation child_rel, RangeVar *parent, LOCKMODE lockmode)
 						parent->relname)));
 
 	/* Likewise for partitions */
-	if (parent_rel->rd_rel->relispartition)
+	if (OidIsValid(parent_rel->rd_rel->relpartitionparent))
 		ereport(ERROR,
 				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 				 errmsg("cannot inherit from a partition")));
@@ -11506,6 +11605,59 @@ ATExecAddInherit(Relation child_rel, RangeVar *parent, LOCKMODE lockmode)
 	return address;
 }
 
+static void
+AttachPartition(Relation attachrel, Relation rel, PartitionBoundSpec *bound)
+{
+	Datum		values[Natts_pg_partition];
+	bool		nulls[Natts_pg_partition];
+	HeapTuple	tuple;
+	Relation	partRelation;
+	Oid			attachrelid = RelationGetRelid(attachrel);
+	Oid			partedrelid = RelationGetRelid(rel);
+	ObjectAddress childobject,
+				parentobject;
+
+	partRelation = heap_open(PartitionRelationId, RowExclusiveLock);
+
+	/*
+	 * Make the pg_partition entry
+	 */
+	values[Anum_pg_partition_partrelid - 1] = ObjectIdGetDatum(attachrelid);
+	values[Anum_pg_partition_parentrelid - 1] = ObjectIdGetDatum(partedrelid);
+	values[Anum_pg_partition_partbound - 1] = CStringGetTextDatum(nodeToString(bound));
+
+	memset(nulls, 0, sizeof(nulls));
+
+	tuple = heap_form_tuple(RelationGetDescr(partRelation), values, nulls);
+
+	CatalogTupleInsert(partRelation, tuple);
+
+	heap_freetuple(tuple);
+
+	heap_close(partRelation, RowExclusiveLock);
+
+	/*
+	 * Store a dependency too
+	 */
+	parentobject.classId = RelationRelationId;
+	parentobject.objectId = partedrelid;
+	parentobject.objectSubId = 0;
+	childobject.classId = RelationRelationId;
+	childobject.objectId = attachrelid;
+	childobject.objectSubId = 0;
+
+	recordDependencyOn(&childobject, &parentobject, DEPENDENCY_AUTO);
+
+	/*
+	 * Post creation hook of this partition. Since object_access_hook
+	 * doesn't take multiple object identifiers, we relay oid of parent
+	 * relation using auxiliary_id argument.
+	 */
+	InvokeObjectPostAlterHookArg(PartitionRelationId,
+								 attachrelid, 0,
+								 partedrelid, false);
+}
+
 /*
  * CreateInheritance
  *		Catalog manipulation portion of creating inheritance between a child
@@ -11922,7 +12074,7 @@ ATExecDropInherit(Relation rel, RangeVar *parent, LOCKMODE lockmode)
 	ObjectAddress address;
 	Relation	parent_rel;
 
-	if (rel->rd_rel->relispartition)
+	if (OidIsValid(rel->rd_rel->relpartitionparent))
 		ereport(ERROR,
 				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 				 errmsg("cannot change inheritance of a partition")));
@@ -11967,7 +12119,7 @@ ATExecDropInherit(Relation rel, RangeVar *parent, LOCKMODE lockmode)
  * coninhcount and conislocal for inherited constraints are adjusted in
  * exactly the same way.
  *
- * Common to ATExecDropInherit() and ATExecDetachPartition().
+ * Used in ATExecDropInherit()
  */
 static void
 RemoveInheritance(Relation child_rel, Relation parent_rel)
@@ -11979,28 +12131,16 @@ RemoveInheritance(Relation child_rel, Relation parent_rel)
 				constraintTuple;
 	List	   *connames;
 	bool		found;
-	bool		child_is_partition = false;
-
-	/* If parent_rel is a partitioned table, child_rel must be a partition */
-	if (parent_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		child_is_partition = true;
 
 	found = DeleteInheritsTuple(RelationGetRelid(child_rel),
 								RelationGetRelid(parent_rel));
 	if (!found)
 	{
-		if (child_is_partition)
-			ereport(ERROR,
-					(errcode(ERRCODE_UNDEFINED_TABLE),
-					 errmsg("relation \"%s\" is not a partition of relation \"%s\"",
-							RelationGetRelationName(child_rel),
-							RelationGetRelationName(parent_rel))));
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_UNDEFINED_TABLE),
-					 errmsg("relation \"%s\" is not a parent of relation \"%s\"",
-							RelationGetRelationName(parent_rel),
-							RelationGetRelationName(child_rel))));
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_TABLE),
+					errmsg("relation \"%s\" is not a parent of relation \"%s\"",
+						RelationGetRelationName(parent_rel),
+						RelationGetRelationName(child_rel))));
 	}
 
 	/*
@@ -12119,7 +12259,7 @@ RemoveInheritance(Relation child_rel, Relation parent_rel)
 	drop_parent_dependency(RelationGetRelid(child_rel),
 						   RelationRelationId,
 						   RelationGetRelid(parent_rel),
-						   child_dependency_type(child_is_partition));
+						   child_dependency_type(false));
 
 	/*
 	 * Post alter hook of this inherits. Since object_access_hook doesn't take
@@ -14039,7 +14179,6 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 {
 	Relation	attachrel,
 				catalog;
-	List	   *attachrel_children;
 	List	   *partConstraint;
 	SysScanDesc scan;
 	ScanKeyData skey;
@@ -14072,7 +14211,7 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 	ATSimplePermissions(attachrel, ATT_TABLE | ATT_FOREIGN_TABLE);
 
 	/* A partition can only have one parent */
-	if (attachrel->rd_rel->relispartition)
+	if (OidIsValid(attachrel->rd_rel->relpartitionparent))
 		ereport(ERROR,
 				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 				 errmsg("\"%s\" is already a partition",
@@ -14084,8 +14223,8 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 				 errmsg("cannot attach a typed table as partition")));
 
 	/*
-	 * Table being attached should not already be part of inheritance; either
-	 * as a child table...
+	 * The table being attached should not be part of any inheritance
+	 * hierarchy as a child or as a parent.
 	 */
 	catalog = heap_open(InheritsRelationId, AccessShareLock);
 	ScanKeyInit(&skey,
@@ -14100,15 +14239,13 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 				 errmsg("cannot attach inheritance child as partition")));
 	systable_endscan(scan);
 
-	/* ...or as a parent table (except the case when it is partitioned) */
 	ScanKeyInit(&skey,
 				Anum_pg_inherits_inhparent,
 				BTEqualStrategyNumber, F_OIDEQ,
 				ObjectIdGetDatum(RelationGetRelid(attachrel)));
 	scan = systable_beginscan(catalog, InheritsParentIndexId, true, NULL,
 							  1, &skey);
-	if (HeapTupleIsValid(systable_getnext(scan)) &&
-		attachrel->rd_rel->relkind == RELKIND_RELATION)
+	if (HeapTupleIsValid(systable_getnext(scan)))
 		ereport(ERROR,
 				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 				 errmsg("cannot attach inheritance parent as partition")));
@@ -14129,16 +14266,24 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 	 * definition is present in all the partitions, we need not scan the
 	 * table, nor its partitions.  But we cannot risk a deadlock by taking a
 	 * weaker lock now and the stronger one only when needed.
+	 *
+	 * The the attachrel is a leaf partition, then it can have no partitions
+	 * so we needn't bother checking this.
 	 */
-	attachrel_children = find_all_inheritors(RelationGetRelid(attachrel),
-											 AccessExclusiveLock, NULL);
-	if (list_member_oid(attachrel_children, RelationGetRelid(rel)))
-		ereport(ERROR,
-				(errcode(ERRCODE_DUPLICATE_TABLE),
-				 errmsg("circular inheritance not allowed"),
-				 errdetail("\"%s\" is already a child of \"%s\".",
-						   RelationGetRelationName(rel),
-						   RelationGetRelationName(attachrel))));
+	if (attachrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		List	   *attachrel_children;
+
+		attachrel_children = get_partition_descendants(RelationGetRelid(attachrel),
+													   AccessExclusiveLock);
+		if (list_member_oid(attachrel_children, RelationGetRelid(rel)))
+			ereport(ERROR,
+					(errcode(ERRCODE_DUPLICATE_TABLE),
+					 errmsg("circular partitioning is not allowed"),
+					 errdetail("\"%s\" is already a child of \"%s\".",
+							   RelationGetRelationName(rel),
+							   RelationGetRelationName(attachrel))));
+	}
 
 	/* If the parent is permanent, so must be all of its partitions. */
 	if (rel->rd_rel->relpersistence != RELPERSISTENCE_TEMP &&
@@ -14223,8 +14368,11 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 						trigger_name, RelationGetRelationName(attachrel)),
 				 errdetail("ROW triggers with transition tables are not supported on partitions")));
 
-	/* OK to create inheritance.  Rest of the checks performed there */
-	CreateInheritance(attachrel, rel);
+	/* Match up the columns and bump attinhcount as needed */
+	MergeAttributesIntoExisting(attachrel, rel);
+
+	/* Match up the constraints and bump coninhcount as needed */
+	MergeConstraintsIntoExisting(attachrel, rel);
 
 	/*
 	 * Check that the new partition's bound is valid and does not overlap any
@@ -14234,8 +14382,10 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 	check_new_partition_bound(RelationGetRelationName(attachrel), rel,
 							  cmd->bound);
 
+	AttachPartition(attachrel, rel, cmd->bound);
+
 	/* Update the pg_class entry. */
-	StorePartitionBound(attachrel, rel, cmd->bound);
+	MarkRelationPartitioned(attachrel, rel, cmd->bound->is_default);
 
 	/* Ensure there exists a correct set of indexes in the partition. */
 	AttachPartitionEnsureIndexes(rel, attachrel);
@@ -14427,7 +14577,7 @@ AttachPartitionEnsureIndexes(Relation rel, Relation attachrel)
 			Oid			cldConstrOid = InvalidOid;
 
 			/* does this index have a parent?  if so, can't use it */
-			if (attachrelIdxRels[i]->rd_rel->relispartition)
+			if (OidIsValid(attachrelIdxRels[i]->rd_rel->relpartitionparent))
 				continue;
 
 			if (CompareIndexInfo(attachInfos[i], info,
@@ -14458,7 +14608,7 @@ AttachPartitionEnsureIndexes(Relation rel, Relation attachrel)
 				IndexSetParentIndex(attachrelIdxRels[i], idx);
 				if (OidIsValid(constraintOid))
 					ConstraintSetParentConstraint(cldConstrOid, constraintOid);
-				update_relispartition(NULL, cldIdxId, true);
+				update_relpartitionparent(NULL, cldIdxId, idx);
 				found = true;
 				break;
 			}
@@ -14628,18 +14778,20 @@ static ObjectAddress
 ATExecDetachPartition(Relation rel, RangeVar *name)
 {
 	Relation	partRel,
-				classRel;
-	HeapTuple	tuple,
-				newtuple;
-	Datum		new_val[Natts_pg_class];
-	bool		isnull,
-				new_null[Natts_pg_class],
-				new_repl[Natts_pg_class];
+				pgclass,
+				pgattr,
+				pgpart;
+	HeapTuple	tuple;
+	HeapTuple	parttup;
+	bool		isnull;
+	SysScanDesc scan;
+	ScanKeyData key[3];
 	ObjectAddress address;
 	Oid			defaultPartOid;
 	List	   *indexes;
 	ListCell   *cell;
 
+
 	/*
 	 * We must lock the default partition, because detaching this partition
 	 * will change its partition constraint.
@@ -14651,35 +14803,85 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
 
 	partRel = heap_openrv(name, AccessShareLock);
 
-	/* All inheritance related checks are performed within the function */
-	RemoveInheritance(partRel, rel);
-
 	/* Update pg_class tuple */
-	classRel = heap_open(RelationRelationId, RowExclusiveLock);
+	pgclass = heap_open(RelationRelationId, RowExclusiveLock);
 	tuple = SearchSysCacheCopy1(RELOID,
 								ObjectIdGetDatum(RelationGetRelid(partRel)));
 	if (!HeapTupleIsValid(tuple))
 		elog(ERROR, "cache lookup failed for relation %u",
 			 RelationGetRelid(partRel));
-	Assert(((Form_pg_class) GETSTRUCT(tuple))->relispartition);
 
-	(void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound,
+	if (!OidIsValid(((Form_pg_class) GETSTRUCT(tuple))->relpartitionparent))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_TABLE),
+					errmsg("relation \"%s\" is not a partition of relation \"%s\"",
+						RelationGetRelationName(partRel),
+						RelationGetRelationName(rel))));
+
+	pgpart = heap_open(PartitionRelationId, RowExclusiveLock);
+
+	parttup = SearchSysCacheCopy1(PARTSRELID,
+								  ObjectIdGetDatum(RelationGetRelid(partRel)));
+	if (!HeapTupleIsValid(parttup))
+		elog(ERROR, "cache lookup failed for relation %u",
+			 RelationGetRelid(partRel));
+	/* XXX or use relpartitionparent to check? */
+	if (((Form_pg_partition) GETSTRUCT(parttup))->parentrelid != RelationGetRelid(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_TABLE),
+					errmsg("relation \"%s\" is not a partition of relation \"%s\"",
+						RelationGetRelationName(partRel),
+						RelationGetRelationName(rel))));
+
+	(void) SysCacheGetAttr(PARTSRELID, parttup, Anum_pg_partition_partbound,
 						   &isnull);
 	Assert(!isnull);
 
-	/* Clear relpartbound and reset relispartition */
-	memset(new_val, 0, sizeof(new_val));
-	memset(new_null, false, sizeof(new_null));
-	memset(new_repl, false, sizeof(new_repl));
-	new_val[Anum_pg_class_relpartbound - 1] = (Datum) 0;
-	new_null[Anum_pg_class_relpartbound - 1] = true;
-	new_repl[Anum_pg_class_relpartbound - 1] = true;
-	newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
-								 new_val, new_null, new_repl);
-
-	((Form_pg_class) GETSTRUCT(newtuple))->relispartition = false;
-	CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
-	heap_freetuple(newtuple);
+	/* unset relpartitionparent */
+	((Form_pg_class) GETSTRUCT(tuple))->relpartitionparent = InvalidOid;
+	CatalogTupleUpdate(pgclass, &tuple->t_self, tuple);
+	heap_freetuple(tuple);
+
+	CatalogTupleDelete(pgpart, &parttup->t_self);
+	heap_freetuple(parttup);
+
+	/*
+	 * Search through child columns looking for ones matching parent partition
+	 */
+	pgattr = heap_open(AttributeRelationId, RowExclusiveLock);
+	ScanKeyInit(&key[0],
+				Anum_pg_attribute_attrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationGetRelid(partRel)));
+	scan = systable_beginscan(pgattr, AttributeRelidNumIndexId,
+							  true, NULL, 1, key);
+	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
+	{
+		Form_pg_attribute att = (Form_pg_attribute) GETSTRUCT(tuple);
+
+		/* Ignore if dropped or not inherited */
+		if (att->attisdropped)
+			continue;
+		if (att->attinhcount <= 0)
+			continue;
+
+		if (SearchSysCacheExistsAttName(RelationGetRelid(rel),
+										NameStr(att->attname)))
+		{
+			/* Decrement inhcount and possibly set islocal to true */
+			HeapTuple	copyTuple = heap_copytuple(tuple);
+			Form_pg_attribute copy_att = (Form_pg_attribute) GETSTRUCT(copyTuple);
+
+			copy_att->attinhcount--;
+			if (copy_att->attinhcount == 0)
+				copy_att->attislocal = true;
+
+			CatalogTupleUpdate(pgattr, &copyTuple->t_self, copyTuple);
+			heap_freetuple(copyTuple);
+		}
+	}
+	systable_endscan(scan);
+	heap_close(pgattr, RowExclusiveLock);
 
 	if (OidIsValid(defaultPartOid))
 	{
@@ -14712,10 +14914,11 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
 
 		idx = index_open(idxid, AccessExclusiveLock);
 		IndexSetParentIndex(idx, InvalidOid);
-		update_relispartition(classRel, idxid, false);
+		update_relpartitionparent(pgclass, idxid, InvalidOid);
 		relation_close(idx, AccessExclusiveLock);
 	}
-	heap_close(classRel, RowExclusiveLock);
+	heap_close(pgclass, RowExclusiveLock);
+	heap_close(pgpart, RowExclusiveLock);
 
 	/*
 	 * Invalidate the parent's relcache so that the partition is no longer
@@ -14728,6 +14931,20 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
 	/* keep our lock until commit */
 	heap_close(partRel, NoLock);
 
+	drop_parent_dependency(RelationGetRelid(partRel),
+						   RelationRelationId,
+						   RelationGetRelid(rel),
+						   child_dependency_type(true));
+
+	/*
+	 * Post alter hook of this inherits. Since object_access_hook doesn't take
+	 * multiple object identifiers, we relay oid of parent relation using
+	 * auxiliary_id argument.
+	 */
+	InvokeObjectPostAlterHookArg(PartitionRelationId,
+								 RelationGetRelid(partRel), 0,
+								 RelationGetRelid(rel), false);
+
 	return address;
 }
 
@@ -14837,8 +15054,7 @@ ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx, RangeVar *name)
 	ObjectAddressSet(address, RelationRelationId, RelationGetRelid(partIdx));
 
 	/* Silently do nothing if already in the right state */
-	currParent = partIdx->rd_rel->relispartition ?
-		get_partition_parent(partIdxId) : InvalidOid;
+	currParent = partIdx->rd_rel->relpartitionparent;
 	if (currParent != RelationGetRelid(parentIdx))
 	{
 		IndexInfo  *childInfo;
@@ -14932,7 +15148,7 @@ ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx, RangeVar *name)
 		IndexSetParentIndex(partIdx, RelationGetRelid(parentIdx));
 		if (OidIsValid(constraintOid))
 			ConstraintSetParentConstraint(cldConstrId, constraintOid);
-		update_relispartition(NULL, partIdxId, true);
+		update_relpartitionparent(NULL, partIdxId, RelationGetRelid(parentIdx));
 
 		pfree(attmap);
 
@@ -15060,7 +15276,7 @@ validatePartitionedIndex(Relation partedIdx, Relation partedTbl)
 	 * If this index is in turn a partition of a larger index, validating it
 	 * might cause the parent to become valid also.  Try that.
 	 */
-	if (updated && partedIdx->rd_rel->relispartition)
+	if (updated && OidIsValid(RelationGetParentRelid(partedIdx)))
 	{
 		Oid			parentIdxId,
 					parentTblId;
@@ -15070,8 +15286,8 @@ validatePartitionedIndex(Relation partedIdx, Relation partedTbl)
 		/* make sure we see the validation we just did */
 		CommandCounterIncrement();
 
-		parentIdxId = get_partition_parent(RelationGetRelid(partedIdx));
-		parentTblId = get_partition_parent(RelationGetRelid(partedTbl));
+		parentIdxId = RelationGetParentRelid(partedIdx);
+		parentTblId = RelationGetParentRelid(partedTbl);
 		parentIdx = relation_open(parentIdxId, AccessExclusiveLock);
 		parentTbl = relation_open(parentTblId, AccessExclusiveLock);
 		Assert(!parentIdx->rd_index->indisvalid);
@@ -15084,13 +15300,14 @@ validatePartitionedIndex(Relation partedIdx, Relation partedTbl)
 }
 
 /*
- * Update the relispartition flag of the given relation to the given value.
+ * Update the relpartitionparent value of the given relation to the given
+ * value.
  *
  * classRel is the pg_class relation, already open and suitably locked.
  * It can be passed as NULL, in which case it's opened and closed locally.
  */
 static void
-update_relispartition(Relation classRel, Oid relationId, bool newval)
+update_relpartitionparent(Relation classRel, Oid relationId, Oid newparent)
 {
 	HeapTuple	tup;
 	HeapTuple	newtup;
@@ -15106,7 +15323,7 @@ update_relispartition(Relation classRel, Oid relationId, bool newval)
 	tup = SearchSysCache1(RELOID, ObjectIdGetDatum(relationId));
 	newtup = heap_copytuple(tup);
 	classForm = (Form_pg_class) GETSTRUCT(newtup);
-	classForm->relispartition = newval;
+	classForm->relpartitionparent = newparent;
 	CatalogTupleUpdate(classRel, &tup->t_self, newtup);
 	heap_freetuple(newtup);
 	ReleaseSysCache(tup);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 2436692eb8..8c712b97cc 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -363,8 +363,8 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 	partition_recurse = !isInternal && stmt->row &&
 		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
 	if (partition_recurse)
-		list_free(find_all_inheritors(RelationGetRelid(rel),
-									  ShareRowExclusiveLock, NULL));
+		list_free(get_partition_descendants(RelationGetRelid(rel),
+											ShareRowExclusiveLock));
 
 	/* Compute tgtype */
 	TRIGGER_CLEAR_TYPE(tgtype);
@@ -453,14 +453,14 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 			 * tuples each child should see.  See also the prohibitions in
 			 * ATExecAttachPartition() and ATExecAddInherit().
 			 */
-			if (TRIGGER_FOR_ROW(tgtype) && has_superclass(rel->rd_id))
+			if (TRIGGER_FOR_ROW(tgtype))
 			{
 				/* Use appropriate error message. */
-				if (rel->rd_rel->relispartition)
+				if (OidIsValid(rel->rd_rel->relpartitionparent))
 					ereport(ERROR,
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("ROW triggers with transition tables are not supported on partitions")));
-				else
+				else if (has_superclass(rel->rd_id))
 					ereport(ERROR,
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("ROW triggers with transition tables are not supported on inheritance children")));
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ee32fe8871..6ec8cee33f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -31,6 +31,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_namespace.h"
@@ -485,7 +486,7 @@ expand_vacuum_rel(VacuumRelation *vrel)
 		 */
 		if (include_parts)
 		{
-			List	   *part_oids = find_all_inheritors(relid, NoLock, NULL);
+			List	   *part_oids = get_partition_descendants(relid, NoLock);
 			ListCell   *part_lc;
 
 			foreach(part_lc, part_oids)
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 7849e04bdb..448526d9f2 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -14,7 +14,6 @@
 #include "postgres.h"
 
 #include "catalog/partition.h"
-#include "catalog/pg_inherits.h"
 #include "catalog/pg_type.h"
 #include "executor/execPartition.h"
 #include "executor/executor.h"
@@ -80,7 +79,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
 	/* Lock all the partitions. */
-	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
+	(void) get_partition_descendants(RelationGetRelid(rel), RowExclusiveLock);
 
 	/*
 	 * Here we attempt to expend as little effort as possible in setting up
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 690b6bbab7..3896617760 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -99,9 +99,11 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,
 					  List *input_tlists,
 					  List *refnames_tlist);
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
+static void expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *rte,
+						   Index rti);
 static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
 						 Index rti);
-static void expand_partitioned_rtentry(PlannerInfo *root,
+static void expand_partitioned_rtentry_recurse(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
@@ -1487,7 +1489,20 @@ expand_inherited_tables(PlannerInfo *root)
 	{
 		RangeTblEntry *rte = (RangeTblEntry *) lfirst(rl);
 
-		expand_inherited_rtentry(root, rte, rti);
+		if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+			expand_partitioned_rtentry(root, rte, rti);
+		else if (OidIsValid(rte->relid))
+		{
+			/*
+			 * Partitioned tables cannot be part of an inheritance hierarchy,
+			 * so we needn't bother trying to expand those.
+			 */
+			if (OidIsValid(get_partition_parent(rte->relid)))
+				rte->inh = false;
+			else
+				expand_inherited_rtentry(root, rte, rti);
+		}
+		/* no need to attempt to expand anything else */
 		rl = lnext(rl);
 	}
 }
@@ -1522,6 +1537,11 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	LOCKMODE	lockmode;
 	List	   *inhOIDs;
 	ListCell   *l;
+	List	   *appinfos = NIL;
+	RangeTblEntry *childrte;
+	Index		childRTindex;
+
+	Assert(rte->relkind != RELKIND_PARTITIONED_TABLE);
 
 	/* Does RT entry allow inheritance? */
 	if (!rte->inh)
@@ -1534,7 +1554,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	}
 	/* Fast path for common case of childless table */
 	parentOID = rte->relid;
-	if (!has_subclass(parentOID))
+	if (rte->relkind != RELKIND_PARTITIONED_TABLE && !has_subclass(parentOID))
 	{
 		/* Clear flag before returning */
 		rte->inh = false;
@@ -1591,93 +1611,137 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	 */
 	oldrelation = heap_open(parentOID, NoLock);
 
-	/* Scan the inheritance set and expand it */
-	if (RelationGetPartitionDesc(oldrelation) != NULL)
+	Assert(!RelationGetPartitionDesc(oldrelation));
+
+	/*
+		* This table has no partitions.  Expand any plain inheritance
+		* children in the order the OIDs were returned by
+		* find_all_inheritors.
+		*/
+	foreach(l, inhOIDs)
 	{
-		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
+		Oid			childOID = lfirst_oid(l);
+		Relation	newrelation;
+
+		/* Open rel if needed; we already have required locks */
+		if (childOID != parentOID)
+			newrelation = heap_open(childOID, NoLock);
+		else
+			newrelation = oldrelation;
 
 		/*
-		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.  While at it, also
-		 * extract the partition key columns of all the partitioned tables.
-		 */
-		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
-								   lockmode, &root->append_rel_list);
+			* It is possible that the parent table has children that are temp
+			* tables of other backends.  We cannot safely access such tables
+			* (because of buffering issues), and the best thing to do seems
+			* to be to silently ignore them.
+			*/
+		if (childOID != parentOID && RELATION_IS_OTHER_TEMP(newrelation))
+		{
+			heap_close(newrelation, lockmode);
+			continue;
+		}
+
+		expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
+										newrelation,
+										&appinfos, &childrte,
+										&childRTindex);
+
+		/* Close child relations, but keep locks */
+		if (childOID != parentOID)
+			heap_close(newrelation, NoLock);
 	}
+
+	/*
+		* If all the children were temp tables, pretend it's a
+		* non-inheritance situation; we don't need Append node in that case.
+		* The duplicate RTE we added for the parent table is harmless, so we
+		* don't bother to get rid of it; ditto for the useless PlanRowMark
+		* node.
+		*/
+	if (list_length(appinfos) < 2)
+		rte->inh = false;
 	else
-	{
-		List	   *appinfos = NIL;
-		RangeTblEntry *childrte;
-		Index		childRTindex;
+		root->append_rel_list = list_concat(root->append_rel_list,
+											appinfos);
 
-		/*
-		 * This table has no partitions.  Expand any plain inheritance
-		 * children in the order the OIDs were returned by
-		 * find_all_inheritors.
-		 */
-		foreach(l, inhOIDs)
-		{
-			Oid			childOID = lfirst_oid(l);
-			Relation	newrelation;
+	heap_close(oldrelation, NoLock);
+}
 
-			/* Open rel if needed; we already have required locks */
-			if (childOID != parentOID)
-				newrelation = heap_open(childOID, NoLock);
-			else
-				newrelation = oldrelation;
+static void
+expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
+{
+	Query	   *parse = root->parse;
+	PlanRowMark *partrc;
+	Relation	partrel;
+	LOCKMODE	lockmode;
+	PartitionDesc partdesc;
 
-			/*
-			 * It is possible that the parent table has children that are temp
-			 * tables of other backends.  We cannot safely access such tables
-			 * (because of buffering issues), and the best thing to do seems
-			 * to be to silently ignore them.
-			 */
-			if (childOID != parentOID && RELATION_IS_OTHER_TEMP(newrelation))
-			{
-				heap_close(newrelation, lockmode);
-				continue;
-			}
+	Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
+
+	/* Support legacy ONLY syntax for partitions */
+	if (!rte->inh)
+		return;
 
-			expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
-											newrelation,
-											&appinfos, &childrte,
-											&childRTindex);
+	/*
+	 * The rewriter should already have obtained an appropriate lock on each
+	 * relation named in the query.  However, for each child relation we add
+	 * to the query, we must obtain an appropriate lock, because this will be
+	 * the first use of those relations in the parse/rewrite/plan pipeline.
+	 *
+	 * If the parent relation is the query's result relation, then we need
+	 * RowExclusiveLock.  Otherwise, if it's accessed FOR UPDATE/SHARE, we
+	 * need RowShareLock; otherwise AccessShareLock.  We can't just grab
+	 * AccessShareLock because then the executor would be trying to upgrade
+	 * the lock, leading to possible deadlocks.  (This code should match the
+	 * parser and rewriter.)
+	 */
+	partrc = get_plan_rowmark(root->rowMarks, rti);
+	if (rti == parse->resultRelation)
+		lockmode = RowExclusiveLock;
+	else if (partrc && RowMarkRequiresRowShareLock(partrc->markType))
+		lockmode = RowShareLock;
+	else
+		lockmode = AccessShareLock;
 
-			/* Close child relations, but keep locks */
-			if (childOID != parentOID)
-				heap_close(newrelation, NoLock);
-		}
+	/*
+	 * If parent relation is selected FOR UPDATE/SHARE, we need to mark its
+	 * PlanRowMark as isParent = true, and generate a new PlanRowMark for each
+	 * child.
+	 */
+	if (partrc)
+		partrc->isParent = true;
 
-		/*
-		 * If all the children were temp tables, pretend it's a
-		 * non-inheritance situation; we don't need Append node in that case.
-		 * The duplicate RTE we added for the parent table is harmless, so we
-		 * don't bother to get rid of it; ditto for the useless PlanRowMark
-		 * node.
-		 */
-		if (list_length(appinfos) < 2)
-			rte->inh = false;
-		else
-			root->append_rel_list = list_concat(root->append_rel_list,
-												appinfos);
+	/*
+	 * Must open the partitioned relation to examine its tupdesc.  We need not
+	 * lock it; we assume the rewriter already did.
+	 */
+	partrel = heap_open(rte->relid, NoLock);
 
-	}
+	partdesc = RelationGetPartitionDesc(partrel);
 
-	heap_close(oldrelation, NoLock);
+	/*
+	 * If this table has partitions, recursively expand them in the order in
+	 * which they appear in the PartitionDesc.  While at it, also extract the
+	 * partition key columns of all the partitioned tables.
+	 */
+	expand_partitioned_rtentry_recurse(root, rte, rti, partrel, partrc,
+									   lockmode, &root->append_rel_list);
+
+	heap_close(partrel, NoLock);
 }
 
 /*
- * expand_partitioned_rtentry
+ * expand_partitioned_rtentry_recurse
  *		Recursively expand an RTE for a partitioned table.
  *
  * Note that RelationGetPartitionDispatchInfo will expand partitions in the
  * same order as this code.
  */
 static void
-expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
-						   Index parentRTindex, Relation parentrel,
-						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos)
+expand_partitioned_rtentry_recurse(PlannerInfo *root, RangeTblEntry *parentrte,
+								   Index parentRTindex, Relation parentrel,
+								   PlanRowMark *top_parentrc, LOCKMODE lockmode,
+								   List **appinfos)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1689,8 +1753,6 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 	/* A partitioned table should always have a partition descriptor. */
 	Assert(partdesc);
 
-	Assert(parentrte->inh);
-
 	/*
 	 * Note down whether any partition key cols are being updated. Though it's
 	 * the root partitioned table's updatedCols we are interested in, we
@@ -1722,16 +1784,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		Oid			childOID = partdesc->oids[i];
 		Relation	childrel;
 
-		/* Open rel; we already have required locks */
-		childrel = heap_open(childOID, NoLock);
-
-		/*
-		 * Temporary partitions belonging to other sessions should have been
-		 * disallowed at definition, but for paranoia's sake, let's double
-		 * check.
-		 */
-		if (RELATION_IS_OTHER_TEMP(childrel))
-			elog(ERROR, "temporary relation from another session found as partition");
+		childrel = heap_open(childOID, lockmode);
 
 		expand_single_inheritance_child(root, parentrte, parentRTindex,
 										parentrel, top_parentrc, childrel,
@@ -1739,7 +1792,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 		/* If this child is itself partitioned, recurse */
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-			expand_partitioned_rtentry(root, childrte, childRTindex,
+			expand_partitioned_rtentry_recurse(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
 									   appinfos);
 
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 9015a05d32..360d4e08cd 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -15,6 +15,7 @@
 
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
+#include "catalog/pg_partition.h"
 #include "catalog/pg_type.h"
 #include "commands/tablecmds.h"
 #include "executor/executor.h"
@@ -628,8 +629,8 @@ check_default_partition_contents(Relation parent, Relation default_rel,
 	 * that do not satisfy the revised partition constraints.
 	 */
 	if (default_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		all_parts = find_all_inheritors(RelationGetRelid(default_rel),
-										AccessExclusiveLock, NULL);
+		all_parts = get_partition_descendants(RelationGetRelid(default_rel),
+											  AccessExclusiveLock);
 	else
 		all_parts = list_make1_oid(RelationGetRelid(default_rel));
 
@@ -1634,12 +1635,12 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 			bool		isnull;
 			PartitionBoundSpec *bspec;
 
-			tuple = SearchSysCache1(RELOID, inhrelid);
+			tuple = SearchSysCache1(PARTSRELID, inhrelid);
 			if (!HeapTupleIsValid(tuple))
 				elog(ERROR, "cache lookup failed for relation %u", inhrelid);
 
-			datum = SysCacheGetAttr(RELOID, tuple,
-									Anum_pg_class_relpartbound,
+			datum = SysCacheGetAttr(PARTSRELID, tuple,
+									Anum_pg_partition_partbound,
 									&isnull);
 
 			Assert(!isnull);
diff --git a/src/backend/rewrite/rewriteDefine.c b/src/backend/rewrite/rewriteDefine.c
index d81a2ea342..3c5e8e75d6 100644
--- a/src/backend/rewrite/rewriteDefine.c
+++ b/src/backend/rewrite/rewriteDefine.c
@@ -428,7 +428,7 @@ DefineQueryRewrite(const char *rulename,
 						 errmsg("cannot convert partitioned table \"%s\" to a view",
 								RelationGetRelationName(event_relation))));
 
-			if (event_relation->rd_rel->relispartition)
+			if (OidIsValid(event_relation->rd_rel->relpartitionparent))
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("cannot convert partition \"%s\" to a view",
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index b5804f64ad..51e0e789f3 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -25,6 +25,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/toasting.h"
+#include "catalog/partition.h"
 #include "commands/alter.h"
 #include "commands/async.h"
 #include "commands/cluster.h"
@@ -1323,10 +1324,10 @@ ProcessUtilitySlow(ParseState *pstate,
 						get_rel_relkind(relid) == RELKIND_PARTITIONED_TABLE)
 					{
 						ListCell   *lc;
-						List	   *inheritors = NIL;
+						List	   *partoids = NIL;
 
-						inheritors = find_all_inheritors(relid, lockmode, NULL);
-						foreach(lc, inheritors)
+						partoids = get_partition_descendants(relid, lockmode);
+						foreach(lc, partoids)
 						{
 							char		relkind = get_rel_relkind(lfirst_oid(lc));
 
@@ -1340,7 +1341,7 @@ ProcessUtilitySlow(ParseState *pstate,
 										 errdetail("Table \"%s\" contains partitions that are foreign tables.",
 												   stmt->relation->relname)));
 						}
-						list_free(inheritors);
+						list_free(partoids);
 					}
 
 					/* Run parse analysis ... */
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 82acfeb460..51a21c4793 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -18,9 +18,10 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "catalog/indexing.h"
 #include "catalog/partition.h"
-#include "catalog/pg_inherits.h"
 #include "catalog/pg_opclass.h"
+#include "catalog/pg_partition.h"
 #include "catalog/pg_partitioned_table.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -30,6 +31,7 @@
 #include "partitioning/partbounds.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/partcache.h"
@@ -260,8 +262,7 @@ RelationBuildPartitionKey(Relation relation)
 void
 RelationBuildPartitionDesc(Relation rel)
 {
-	List	   *inhoids,
-			   *partoids;
+	List	   *partoids;
 	Oid		   *oids = NULL;
 	List	   *boundspecs = NIL;
 	ListCell   *cell;
@@ -270,6 +271,10 @@ RelationBuildPartitionDesc(Relation rel)
 	PartitionKey key = RelationGetPartitionKey(rel);
 	PartitionDesc result;
 	MemoryContext oldcxt;
+	SysScanDesc scan;
+	ScanKeyData scankey[1];
+	Relation	pgpart;
+	HeapTuple	partTuple;
 
 	int			ndatums = 0;
 	int			default_index = -1;
@@ -285,39 +290,47 @@ RelationBuildPartitionDesc(Relation rel)
 	PartitionRangeBound **rbounds = NULL;
 
 	/* Get partition oids from pg_inherits */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+	pgpart = heap_open(PartitionRelationId, AccessShareLock);
+
+	ScanKeyInit(&scankey[0],
+				Anum_pg_partition_parentrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationGetRelid(rel)));
+
+	scan = systable_beginscan(pgpart, PartitionParentrelidIndexId, true,
+							  NULL, 1, scankey);
 
 	/* Collect bound spec nodes in a list */
-	i = 0;
 	partoids = NIL;
-	foreach(cell, inhoids)
+	while ((partTuple = systable_getnext(scan)) != NULL)
 	{
-		Oid			inhrelid = lfirst_oid(cell);
+		Oid			partrelid = ((Form_pg_partition) GETSTRUCT(partTuple))->partrelid;
 		HeapTuple	tuple;
 		Datum		datum;
 		bool		isnull;
 		Node	   *boundspec;
 
-		tuple = SearchSysCache1(RELOID, inhrelid);
+		tuple = SearchSysCache1(RELOID, partrelid);
 		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
+			elog(ERROR, "cache lookup failed for relation %u", partrelid);
 
 		/*
 		 * It is possible that the pg_class tuple of a partition has not been
 		 * updated yet to set its relpartbound field.  The only case where
 		 * this happens is when we open the parent relation to check using its
 		 * partition descriptor that a new partition's bound does not overlap
-		 * some existing partition.
+		 * some existing partition. XXX needs updated! Figure out if this is
+		 * still required.
 		 */
-		if (!((Form_pg_class) GETSTRUCT(tuple))->relispartition)
+		if (!OidIsValid(((Form_pg_class) GETSTRUCT(tuple))->relpartitionparent))
 		{
 			ReleaseSysCache(tuple);
 			continue;
 		}
 
-		datum = SysCacheGetAttr(RELOID, tuple,
-								Anum_pg_class_relpartbound,
-								&isnull);
+		datum = fastgetattr(partTuple, Anum_pg_partition_partbound,
+							pgpart->rd_att, &isnull);
+
 		Assert(!isnull);
 		boundspec = (Node *) stringToNode(TextDatumGetCString(datum));
 
@@ -331,16 +344,20 @@ RelationBuildPartitionDesc(Relation rel)
 			Oid			partdefid;
 
 			partdefid = get_default_partition_oid(RelationGetRelid(rel));
-			if (partdefid != inhrelid)
+			if (partdefid != partrelid)
 				elog(ERROR, "expected partdefid %u, but got %u",
-					 inhrelid, partdefid);
+					 partrelid, partdefid);
 		}
 
 		boundspecs = lappend(boundspecs, boundspec);
-		partoids = lappend_oid(partoids, inhrelid);
+		partoids = lappend_oid(partoids, partrelid);
 		ReleaseSysCache(tuple);
 	}
 
+	systable_endscan(scan);
+
+	heap_close(pgpart, AccessShareLock);
+
 	nparts = list_length(partoids);
 
 	if (nparts > 0)
@@ -808,7 +825,7 @@ List *
 RelationGetPartitionQual(Relation rel)
 {
 	/* Quick exit */
-	if (!rel->rd_rel->relispartition)
+	if (!OidIsValid(rel->rd_rel->relpartitionparent))
 		return NIL;
 
 	return generate_partition_qual(rel);
@@ -829,7 +846,7 @@ get_partition_qual_relid(Oid relid)
 	List	   *and_args;
 
 	/* Do the work only if this relation is a partition. */
-	if (rel->rd_rel->relispartition)
+	if (OidIsValid(rel->rd_rel->relpartitionparent))
 	{
 		and_args = generate_partition_qual(rel);
 
@@ -880,20 +897,19 @@ generate_partition_qual(Relation rel)
 		return copyObject(rel->rd_partcheck);
 
 	/* Grab at least an AccessShareLock on the parent table */
-	parent = heap_open(get_partition_parent(RelationGetRelid(rel)),
-					   AccessShareLock);
+	parent = heap_open(RelationGetParentRelid(rel), AccessShareLock);
 
-	/* Get pg_class.relpartbound */
-	tuple = SearchSysCache1(RELOID, RelationGetRelid(rel));
+	/* Get pg_partition.partbound */
+	tuple = SearchSysCache1(PARTSRELID, RelationGetRelid(rel));
 	if (!HeapTupleIsValid(tuple))
 		elog(ERROR, "cache lookup failed for relation %u",
 			 RelationGetRelid(rel));
 
-	boundDatum = SysCacheGetAttr(RELOID, tuple,
-								 Anum_pg_class_relpartbound,
+	boundDatum = SysCacheGetAttr(PARTSRELID, tuple,
+								 Anum_pg_partition_partbound,
 								 &isnull);
 	if (isnull)					/* should not happen */
-		elog(ERROR, "relation \"%s\" has relpartbound = null",
+		elog(ERROR, "relation \"%s\" has partbound = null",
 			 RelationGetRelationName(rel));
 	bound = castNode(PartitionBoundSpec,
 					 stringToNode(TextDatumGetCString(boundDatum)));
@@ -902,7 +918,7 @@ generate_partition_qual(Relation rel)
 	my_qual = get_qual_from_partbound(rel, parent, bound);
 
 	/* Add the parent's quals to the list (if any) */
-	if (parent->rd_rel->relispartition)
+	if (OidIsValid(parent->rd_rel->relpartitionparent))
 		result = list_concat(generate_partition_qual(parent), my_qual);
 	else
 		result = my_qual;
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index 2b381782a3..8ad9d2fa3b 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -48,6 +48,7 @@
 #include "catalog/pg_opclass.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_opfamily.h"
+#include "catalog/pg_partition.h"
 #include "catalog/pg_partitioned_table.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_publication.h"
@@ -584,6 +585,17 @@ static const struct cachedesc cacheinfo[] = {
 		},
 		32
 	},
+	{PartitionRelationId,		/* PARTSRELID */
+		PartitionRelidIndexId,
+		1,
+		{
+			Anum_pg_partition_partrelid,
+			0,
+			0,
+			0
+		},
+		32
+	},
 	{ProcedureRelationId,		/* PROCNAMEARGSNSP */
 		ProcedureNameArgsNspIndexId,
 		3,
@@ -1238,7 +1250,7 @@ GetSysCacheOid(int cacheId,
 
 /*
  * SearchSysCacheAttName
- *
+ *subqu
  * This routine is equivalent to SearchSysCache on the ATTNAME cache,
  * except that it will return NULL if the found attribute is marked
  * attisdropped.  This is convenient for callers that want to act as
diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index 0d147cb08d..4f6c68e844 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -1022,6 +1022,28 @@ findParentsByOid(TableInfo *self,
 				j;
 	int			numParents;
 
+	/*
+	 * For PG12 and above pg_class has a relpartitionparent column that allows
+	 * us to determine the parent without looking at the inheritance data.
+	 */
+	if (self->partitionparent != 0)
+	{
+		self->parents = (TableInfo **) pg_malloc(sizeof(TableInfo *));
+
+		self->parents[0] = findTableByOid(self->partitionparent);
+
+		if (self->parents[0] == NULL)
+		{
+			write_msg(NULL, "failed sanity check, parent OID %u of table \"%s\" (OID %u) not found\n",
+						self->partitionparent,
+						self->dobj.name,
+						oid);
+			exit_nicely(1);
+		}
+		self->numParents = 1;
+		return;
+	}
+
 	numParents = 0;
 	for (i = 0; i < numInherits; i++)
 	{
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 20e8aedb19..af7e2bd813 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -5878,6 +5878,7 @@ getTables(Archive *fout, int *numTables)
 	int			i_changed_acl;
 	int			i_partkeydef;
 	int			i_ispartition;
+	int			i_partitionparent;
 	int			i_partbound;
 
 	/*
@@ -5904,7 +5905,9 @@ getTables(Archive *fout, int *numTables)
 	{
 		char	   *partkeydef = "NULL";
 		char	   *ispartition = "false";
+		char	   *partitionparent = "0";
 		char	   *partbound = "NULL";
+		char	   *partjoin = "";
 
 		PQExpBuffer acl_subquery = createPQExpBuffer();
 		PQExpBuffer racl_subquery = createPQExpBuffer();
@@ -5920,14 +5923,23 @@ getTables(Archive *fout, int *numTables)
 		 * Collect the information about any partitioned tables, which were
 		 * added in PG10.
 		 */
-
-		if (fout->remoteVersion >= 100000)
+		/* XXX how much do we add here before we split this out completely? */
+		if (fout->remoteVersion >= 120000)
+		{
+			partjoin = "LEFT JOIN pg_partition p ON (c.oid = p.partrelid) ";
+			partkeydef = "pg_get_partkeydef(c.oid)";
+			ispartition = "c.relpartitionparent <> 0";
+			partitionparent = "c.relpartitionparent";
+			partbound = "pg_get_expr(p.partbound, c.oid)";
+		}
+		else if (fout->remoteVersion >= 100000)
 		{
 			partkeydef = "pg_get_partkeydef(c.oid)";
 			ispartition = "c.relispartition";
 			partbound = "pg_get_expr(c.relpartbound, c.oid)";
 		}
 
+
 		/*
 		 * Left join to pick up dependency info linking sequences to their
 		 * owning column, if any (note this dependency is AUTO as of 8.2)
@@ -5982,6 +5994,7 @@ getTables(Archive *fout, int *numTables)
 						  "AS changed_acl, "
 						  "%s AS partkeydef, "
 						  "%s AS ispartition, "
+						  "%s AS partitionparent, "
 						  "%s AS partbound "
 						  "FROM pg_class c "
 						  "LEFT JOIN pg_depend d ON "
@@ -5994,6 +6007,7 @@ getTables(Archive *fout, int *numTables)
 						  "(c.oid = pip.objoid "
 						  "AND pip.classoid = 'pg_class'::regclass "
 						  "AND pip.objsubid = 0) "
+						  "%s"
 						  "WHERE c.relkind in ('%c', '%c', '%c', '%c', '%c', '%c', '%c') "
 						  "ORDER BY c.oid",
 						  acl_subquery->data,
@@ -6008,8 +6022,10 @@ getTables(Archive *fout, int *numTables)
 						  attinitracl_subquery->data,
 						  partkeydef,
 						  ispartition,
+						  partitionparent,
 						  partbound,
 						  RELKIND_SEQUENCE,
+						  partjoin,
 						  RELKIND_RELATION, RELKIND_SEQUENCE,
 						  RELKIND_VIEW, RELKIND_COMPOSITE_TYPE,
 						  RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE,
@@ -6057,6 +6073,7 @@ getTables(Archive *fout, int *numTables)
 						  "NULL AS changed_acl, "
 						  "NULL AS partkeydef, "
 						  "false AS ispartition, "
+						  "0 AS partitionparent, "
 						  "NULL AS partbound "
 						  "FROM pg_class c "
 						  "LEFT JOIN pg_depend d ON "
@@ -6106,6 +6123,7 @@ getTables(Archive *fout, int *numTables)
 						  "NULL AS changed_acl, "
 						  "NULL AS partkeydef, "
 						  "false AS ispartition, "
+						  "0 AS partitionparent, "
 						  "NULL AS partbound "
 						  "FROM pg_class c "
 						  "LEFT JOIN pg_depend d ON "
@@ -6155,6 +6173,7 @@ getTables(Archive *fout, int *numTables)
 						  "NULL AS changed_acl, "
 						  "NULL AS partkeydef, "
 						  "false AS ispartition, "
+						  "0 AS partitionparent, "
 						  "NULL AS partbound "
 						  "FROM pg_class c "
 						  "LEFT JOIN pg_depend d ON "
@@ -6202,6 +6221,7 @@ getTables(Archive *fout, int *numTables)
 						  "NULL AS changed_acl, "
 						  "NULL AS partkeydef, "
 						  "false AS ispartition, "
+						  "0 AS partitionparent, "
 						  "NULL AS partbound "
 						  "FROM pg_class c "
 						  "LEFT JOIN pg_depend d ON "
@@ -6249,6 +6269,7 @@ getTables(Archive *fout, int *numTables)
 						  "NULL AS changed_acl, "
 						  "NULL AS partkeydef, "
 						  "false AS ispartition, "
+						  "0 AS partitionparent, "
 						  "NULL AS partbound "
 						  "FROM pg_class c "
 						  "LEFT JOIN pg_depend d ON "
@@ -6295,6 +6316,7 @@ getTables(Archive *fout, int *numTables)
 						  "NULL AS changed_acl, "
 						  "NULL AS partkeydef, "
 						  "false AS ispartition, "
+						  "0 AS partitionparent, "
 						  "NULL AS partbound "
 						  "FROM pg_class c "
 						  "LEFT JOIN pg_depend d ON "
@@ -6341,6 +6363,7 @@ getTables(Archive *fout, int *numTables)
 						  "NULL AS changed_acl, "
 						  "NULL AS partkeydef, "
 						  "false AS ispartition, "
+						  "0 AS partitionparent, "
 						  "NULL AS partbound "
 						  "FROM pg_class c "
 						  "LEFT JOIN pg_depend d ON "
@@ -6386,6 +6409,7 @@ getTables(Archive *fout, int *numTables)
 						  "NULL AS changed_acl, "
 						  "NULL AS partkeydef, "
 						  "false AS ispartition, "
+						  "0 AS partitionparent, "
 						  "NULL AS partbound "
 						  "FROM pg_class c "
 						  "LEFT JOIN pg_depend d ON "
@@ -6455,6 +6479,7 @@ getTables(Archive *fout, int *numTables)
 	i_changed_acl = PQfnumber(res, "changed_acl");
 	i_partkeydef = PQfnumber(res, "partkeydef");
 	i_ispartition = PQfnumber(res, "ispartition");
+	i_partitionparent = PQfnumber(res, "partitionparent");
 	i_partbound = PQfnumber(res, "partbound");
 
 	if (dopt->lockWaitTimeout)
@@ -6561,6 +6586,7 @@ getTables(Archive *fout, int *numTables)
 		/* Partition key string or NULL */
 		tblinfo[i].partkeydef = pg_strdup(PQgetvalue(res, i, i_partkeydef));
 		tblinfo[i].ispartition = (strcmp(PQgetvalue(res, i, i_ispartition), "t") == 0);
+		tblinfo[i].partitionparent = atooid(PQgetvalue(res, i, i_partitionparent));
 		tblinfo[i].partbound = pg_strdup(PQgetvalue(res, i, i_partbound));
 
 		/*
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 1448005f30..e5155f760e 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -296,6 +296,7 @@ typedef struct _tableInfo
 	bool		dummy_view;		/* view's real definition must be postponed */
 	bool		postponed_def;	/* matview must be postponed into post-data */
 	bool		ispartition;	/* is table a partition? */
+	Oid			partitionparent; /* owning partitioned table, or 0 */
 
 	/*
 	 * These fields are computed only if we decide the table is interesting
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 80d8338b96..7fde1114a0 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2044,18 +2044,36 @@ describeOneTableDetails(const char *schemaname,
 		char	   *partdef;
 		char	   *partconstraintdef = NULL;
 
-		printfPQExpBuffer(&buf,
-						  "SELECT inhparent::pg_catalog.regclass,\n"
-						  "  pg_catalog.pg_get_expr(c.relpartbound, inhrelid)");
-		/* If verbose, also request the partition constraint definition */
-		if (verbose)
+		if (pset.sversion >= 120000)
+		{
+			printfPQExpBuffer(&buf,
+							  "SELECT p.parentrelid::pg_catalog.regclass,\n"
+							  "  pg_catalog.pg_get_expr(p.partbound, p.partrelid)");
+			/* If verbose, also request the partition constraint definition */
+			if (verbose)
+				appendPQExpBuffer(&buf,
+								  ",\n  pg_catalog.pg_get_partition_constraintdef(p.partrelid)");
 			appendPQExpBuffer(&buf,
-							  ",\n  pg_catalog.pg_get_partition_constraintdef(inhrelid)");
-		appendPQExpBuffer(&buf,
-						  "\nFROM pg_catalog.pg_class c"
-						  " JOIN pg_catalog.pg_inherits i"
-						  " ON c.oid = inhrelid"
-						  "\nWHERE c.oid = '%s' AND c.relispartition;", oid);
+							  "\nFROM pg_catalog.pg_class c"
+							  " JOIN pg_catalog.pg_partition p"
+							  " ON c.oid = p.partrelid"
+							  "\nWHERE c.oid = '%s' AND c.relpartitionparent <> 0;", oid);
+		}
+		else
+		{
+			printfPQExpBuffer(&buf,
+							  "SELECT inhparent::pg_catalog.regclass,\n"
+							  "  pg_catalog.pg_get_expr(c.relpartbound, inhrelid)");
+			/* If verbose, also request the partition constraint definition */
+			if (verbose)
+				appendPQExpBuffer(&buf,
+								  ",\n  pg_catalog.pg_get_partition_constraintdef(inhrelid)");
+			appendPQExpBuffer(&buf,
+							  "\nFROM pg_catalog.pg_class c"
+							  " JOIN pg_catalog.pg_inherits i"
+							  " ON c.oid = inhrelid"
+							  "\nWHERE c.oid = '%s' AND c.relispartition;", oid);
+		}
 		result = PSQLexec(buf.data);
 		if (!result)
 			goto error_return;
@@ -2984,7 +3002,25 @@ describeOneTableDetails(const char *schemaname,
 		}
 
 		/* print child tables (with additional info if partitions) */
-		if (pset.sversion >= 100000)
+		if (pset.sversion >= 120000)
+		{
+			if (tableinfo.relkind == RELKIND_PARTITIONED_TABLE)
+				printfPQExpBuffer(&buf,
+								  "SELECT c.oid::pg_catalog.regclass,"
+								  "       pg_catalog.pg_get_expr(p.partbound, p.partrelid),"
+								  "       c.relkind"
+								  " FROM pg_catalog.pg_class c, pg_catalog.pg_partition p"
+								  " WHERE c.oid=p.partrelid AND p.parentrelid = '%s'"
+								  " ORDER BY pg_catalog.pg_get_expr(p.partbound, p.partrelid) = 'DEFAULT',"
+								  "          c.oid::pg_catalog.regclass::pg_catalog.text;", oid);
+			else
+				printfPQExpBuffer(&buf,
+								  "SELECT c.oid::pg_catalog.regclass,NULL,c.relkind"
+								  " FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i"
+								  " WHERE c.oid=i.inhrelid AND i.inhparent = '%s'"
+								  " ORDER BY c.oid::pg_catalog.regclass::pg_catalog.text;", oid);
+		}
+		else if (pset.sversion >= 100000)
 			printfPQExpBuffer(&buf,
 							  "SELECT c.oid::pg_catalog.regclass,"
 							  "       pg_catalog.pg_get_expr(c.relpartbound, c.oid),"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index bb696f8ee9..b8dbc5b539 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1134,6 +1134,7 @@ static const SchemaQuery Query_for_list_of_statistics = {
 "   SELECT 'DEFAULT' ) ss "\
 "  WHERE pg_catalog.substring(name,1,%%d)='%%s'"
 
+/* XXX fix this */
 /* the silly-looking length condition is just to eat up the current word */
 #define Query_for_partition_of_table \
 "SELECT pg_catalog.quote_ident(c2.relname) "\
diff --git a/src/include/catalog/heap.h b/src/include/catalog/heap.h
index c5e40ff017..d66a2f0b0d 100644
--- a/src/include/catalog/heap.h
+++ b/src/include/catalog/heap.h
@@ -149,7 +149,7 @@ extern void StorePartitionKey(Relation rel,
 				  Oid *partopclass,
 				  Oid *partcollation);
 extern void RemovePartitionKeyByRelId(Oid relid);
-extern void StorePartitionBound(Relation rel, Relation parent,
-					PartitionBoundSpec *bound);
+extern void MarkRelationPartitioned(Relation rel, Relation parent,
+					bool is_default);
 
 #endif							/* HEAP_H */
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index 24915824ca..a42c7041d5 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -339,6 +339,12 @@ DECLARE_UNIQUE_INDEX(pg_replication_origin_roname_index, 6002, on pg_replication
 DECLARE_UNIQUE_INDEX(pg_partitioned_table_partrelid_index, 3351, on pg_partitioned_table using btree(partrelid oid_ops));
 #define PartitionedRelidIndexId			 3351
 
+DECLARE_UNIQUE_INDEX(pg_partition_partrelid_index, 3998, on pg_partition using btree(partrelid oid_ops));
+#define PartitionRelidIndexId			 3998
+
+DECLARE_INDEX(pg_partition_parentrelid_index, 4001, on pg_partition using btree(parentrelid oid_ops));
+#define PartitionParentrelidIndexId			 4001
+
 DECLARE_UNIQUE_INDEX(pg_publication_oid_index, 6110, on pg_publication using btree(oid oid_ops));
 #define PublicationObjectIndexId 6110
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 4b3b5ae770..8cbbd227f4 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -15,6 +15,7 @@
 
 #include "fmgr.h"
 #include "partitioning/partdefs.h"
+#include "storage/lock.h"
 #include "utils/relcache.h"
 
 /* Seed for the extended hash function */
@@ -35,6 +36,7 @@ typedef struct PartitionDescData
 
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_partition_ancestors(Oid relid);
+extern List *get_partition_descendants(Oid relid, LOCKMODE lockmode);
 extern List *map_partition_varattnos(List *expr, int fromrel_varno,
 						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
diff --git a/src/include/catalog/pg_class.dat b/src/include/catalog/pg_class.dat
index 9fffdef379..d0c5ab466e 100644
--- a/src/include/catalog/pg_class.dat
+++ b/src/include/catalog/pg_class.dat
@@ -28,9 +28,9 @@
   relpersistence => 'p', relkind => 'r', relnatts => '30', relchecks => '0',
   relhasoids => 't', relhasrules => 'f', relhastriggers => 'f',
   relhassubclass => 'f', relrowsecurity => 'f', relforcerowsecurity => 'f',
-  relispopulated => 't', relreplident => 'n', relispartition => 'f',
+  relispopulated => 't', relreplident => 'n', relpartitionparent => '0',
   relrewrite => '0', relfrozenxid => '3', relminmxid => '1', relacl => '_null_',
-  reloptions => '_null_', relpartbound => '_null_' },
+  reloptions => '_null_' },
 { oid => '1249',
   relname => 'pg_attribute', relnamespace => 'PGNSP', reltype => '75',
   reloftype => '0', relowner => 'PGUID', relam => '0', relfilenode => '0',
@@ -39,9 +39,9 @@
   relpersistence => 'p', relkind => 'r', relnatts => '24', relchecks => '0',
   relhasoids => 'f', relhasrules => 'f', relhastriggers => 'f',
   relhassubclass => 'f', relrowsecurity => 'f', relforcerowsecurity => 'f',
-  relispopulated => 't', relreplident => 'n', relispartition => 'f',
+  relispopulated => 't', relreplident => 'n', relpartitionparent => '0',
   relrewrite => '0', relfrozenxid => '3', relminmxid => '1', relacl => '_null_',
-  reloptions => '_null_', relpartbound => '_null_' },
+  reloptions => '_null_' },
 { oid => '1255',
   relname => 'pg_proc', relnamespace => 'PGNSP', reltype => '81',
   reloftype => '0', relowner => 'PGUID', relam => '0', relfilenode => '0',
@@ -50,19 +50,19 @@
   relpersistence => 'p', relkind => 'r', relnatts => '28', relchecks => '0',
   relhasoids => 't', relhasrules => 'f', relhastriggers => 'f',
   relhassubclass => 'f', relrowsecurity => 'f', relforcerowsecurity => 'f',
-  relispopulated => 't', relreplident => 'n', relispartition => 'f',
+  relispopulated => 't', relreplident => 'n', relpartitionparent => '0',
   relrewrite => '0', relfrozenxid => '3', relminmxid => '1', relacl => '_null_',
-  reloptions => '_null_', relpartbound => '_null_' },
+  reloptions => '_null_' },
 { oid => '1259',
   relname => 'pg_class', relnamespace => 'PGNSP', reltype => '83',
   reloftype => '0', relowner => 'PGUID', relam => '0', relfilenode => '0',
   reltablespace => '0', relpages => '0', reltuples => '0', relallvisible => '0',
   reltoastrelid => '0', relhasindex => 'f', relisshared => 'f',
-  relpersistence => 'p', relkind => 'r', relnatts => '33', relchecks => '0',
+  relpersistence => 'p', relkind => 'r', relnatts => '32', relchecks => '0',
   relhasoids => 't', relhasrules => 'f', relhastriggers => 'f',
   relhassubclass => 'f', relrowsecurity => 'f', relforcerowsecurity => 'f',
-  relispopulated => 't', relreplident => 'n', relispartition => 'f',
+  relispopulated => 't', relreplident => 'n', relpartitionparent => '0',
   relrewrite => '0', relfrozenxid => '3', relminmxid => '1', relacl => '_null_',
-  reloptions => '_null_', relpartbound => '_null_' },
+  reloptions => '_null_' },
 
 ]
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index dc6c415c58..a8f2cadcf6 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -66,7 +66,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 										 * not */
 	bool		relispopulated; /* matview currently holds query results */
 	char		relreplident;	/* see REPLICA_IDENTITY_xxx constants  */
-	bool		relispartition; /* is relation a partition? */
+	Oid			relpartitionparent; /* Oid of parent if partition, or 0 */
 	Oid			relrewrite;		/* heap for rewrite during DDL, link to
 								 * original rel */
 	TransactionId relfrozenxid; /* all Xids < this are frozen in this rel */
@@ -77,7 +77,6 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	/* NOTE: These fields are not present in a relcache entry's rd_rel field. */
 	aclitem		relacl[1];		/* access permissions */
 	text		reloptions[1];	/* access-method-specific options */
-	pg_node_tree relpartbound;	/* partition bound node tree */
 #endif
 } FormData_pg_class;
 
diff --git a/src/include/catalog/toasting.h b/src/include/catalog/toasting.h
index f259890e43..7d3934865c 100644
--- a/src/include/catalog/toasting.h
+++ b/src/include/catalog/toasting.h
@@ -60,6 +60,7 @@ DECLARE_TOAST(pg_foreign_table, 4153, 4154);
 DECLARE_TOAST(pg_init_privs, 4155, 4156);
 DECLARE_TOAST(pg_language, 4157, 4158);
 DECLARE_TOAST(pg_namespace, 4163, 4164);
+DECLARE_TOAST(pg_partition, 3424, 3425);
 DECLARE_TOAST(pg_partitioned_table, 4165, 4166);
 DECLARE_TOAST(pg_policy, 4167, 4168);
 DECLARE_TOAST(pg_proc, 2836, 2837);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 07ab1a3dde..61b31fb2bb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -803,7 +803,7 @@ typedef struct PartitionSpec
  * PartitionBoundSpec - a partition bound specification
  *
  * This represents the portion of the partition key space assigned to a
- * particular partition.  These are stored on disk in pg_class.relpartbound.
+ * particular partition.  These are stored on disk in pg_partition.partbound.
  */
 struct PartitionBoundSpec
 {
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 6ecbdb6294..eb1858aa92 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -406,6 +406,12 @@ typedef struct ViewOptions
  */
 #define RelationGetRelid(relation) ((relation)->rd_id)
 
+/*
+ * RelationGetParentRelid
+ *		Returns the OID of the relations parent, or InvalidOid if the
+ *		relation has no parent
+ */
+#define RelationGetParentRelid(relation) ((relation)->rd_rel->relpartitionparent)
 /*
  * RelationGetNumberOfAttributes
  *		Returns the total number of attributes in a relation.
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 4f333586ee..44d94f001a 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -73,6 +73,7 @@ enum SysCacheIdentifier
 	OPFAMILYAMNAMENSP,
 	OPFAMILYOID,
 	PARTRELID,
+	PARTSRELID,
 	PROCNAMEARGSNSP,
 	PROCOID,
 	PUBLICATIONNAME,
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 0218c2c362..bf62b042ac 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -3654,10 +3654,10 @@ ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2);
 ERROR:  "part_2" is already a partition
 -- check that circular inheritance is not allowed
 ALTER TABLE part_5 ATTACH PARTITION list_parted2 FOR VALUES IN ('b');
-ERROR:  circular inheritance not allowed
+ERROR:  circular partitioning is not allowed
 DETAIL:  "part_5" is already a child of "list_parted2".
 ALTER TABLE list_parted2 ATTACH PARTITION list_parted2 FOR VALUES IN (0);
-ERROR:  circular inheritance not allowed
+ERROR:  circular partitioning is not allowed
 DETAIL:  "list_parted2" is already a child of "list_parted2".
 -- If a partitioned table being created or an existing table being attached
 -- as a partition does not have a constraint that would allow validation scan
diff --git a/src/test/regress/expected/misc_sanity.out b/src/test/regress/expected/misc_sanity.out
index 2d3522b500..6285da1ee1 100644
--- a/src/test/regress/expected/misc_sanity.out
+++ b/src/test/regress/expected/misc_sanity.out
@@ -100,10 +100,9 @@ ORDER BY 1, 2;
  pg_attribute            | attoptions    | text[]
  pg_class                | relacl        | aclitem[]
  pg_class                | reloptions    | text[]
- pg_class                | relpartbound  | pg_node_tree
  pg_index                | indexprs      | pg_node_tree
  pg_index                | indpred       | pg_node_tree
  pg_largeobject          | data          | bytea
  pg_largeobject_metadata | lomacl        | aclitem[]
-(11 rows)
+(10 rows)
 
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index 0aa5357917..44ad08e6e5 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -135,6 +135,7 @@ pg_namespace|t
 pg_opclass|t
 pg_operator|t
 pg_opfamily|t
+pg_partition|t
 pg_partitioned_table|t
 pg_pltemplate|t
 pg_policy|t
-- 
2.16.2.windows.1

v1-0003-Don-t-store-partition-index-details-in-pg_inherit.patchapplication/octet-stream; name=v1-0003-Don-t-store-partition-index-details-in-pg_inherit.patchDownload

From 95d8fb7c2288765d044cfa95b0ccbebd955ecb19 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 2 Aug 2018 16:06:10 +1200
Subject: [PATCH v1 3/4] Don't store partition index details in pg_inherits

---
 src/backend/catalog/index.c            |   4 -
 src/backend/commands/indexcmds.c       | 163 ++++++++-------------------
 src/backend/commands/tablecmds.c       | 152 ++++++++++++-------------
 src/bin/pg_dump/pg_dump.c              |  36 +++++-
 src/test/regress/expected/indexing.out | 195 ++++++++++++++++-----------------
 src/test/regress/sql/indexing.sql      |  69 ++++++------
 6 files changed, 288 insertions(+), 331 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index eda850edef..26683d8960 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -978,10 +978,6 @@ index_create(Relation heapRelation,
 						!concurrent && !invalid,
 						!concurrent);
 
-	/* update pg_inherits, if needed */
-	if (OidIsValid(parentIndexRelid))
-		StoreSingleInheritance(indexRelationId, parentIndexRelid, 1);
-
 	/*
 	 * Register constraint and dependencies for the index.
 	 *
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b9dad9672e..10aa1b1f84 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -924,11 +924,15 @@ DefineIndex(Oid relationId,
 					Relation	cldidx;
 					IndexInfo  *cldIdxInfo;
 
-					/* this index is already partition of another one */
-					if (has_superclass(cldidxid))
+					cldidx = index_open(cldidxid, lockmode);
+
+					/* Don't try to use any indexes which are already parented */
+					if (OidIsValid(RelationGetParentRelid(cldidx)))
+					{
+						index_close(cldidx, lockmode);
 						continue;
+					}
 
-					cldidx = index_open(cldidxid, lockmode);
 					cldIdxInfo = BuildIndexInfo(cldidx);
 					if (CompareIndexInfo(cldIdxInfo, indexInfo,
 										 cldidx->rd_indcollation,
@@ -2478,142 +2482,71 @@ ReindexPartitionedIndex(Relation parentIdx)
 }
 
 /*
- * Insert or delete an appropriate pg_inherits tuple to make the given index
- * be a partition of the indicated parent index.
+ * IndexSetParentIndex
+ *		Update pg_class record to mark or unmark the parent of 'partitionIdx'.
  *
  * This also corrects the pg_depend information for the affected index.
  */
 void
 IndexSetParentIndex(Relation partitionIdx, Oid parentOid)
 {
-	Relation	pg_inherits;
-	ScanKeyData key[2];
-	SysScanDesc scan;
 	Oid			partRelid = RelationGetRelid(partitionIdx);
+	Relation	classRel;
 	HeapTuple	tuple;
-	bool		fix_dependencies;
+	ObjectAddress partIdx;
 
 	/* Make sure this is an index */
 	Assert(partitionIdx->rd_rel->relkind == RELKIND_INDEX ||
 		   partitionIdx->rd_rel->relkind == RELKIND_PARTITIONED_INDEX);
 
+	/* Update pg_class tuple */
+	classRel = heap_open(RelationRelationId, RowExclusiveLock);
+	tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(partRelid));
+	if (!HeapTupleIsValid(tuple))
+		elog(ERROR, "cache lookup failed for relation %u", partRelid);
+
 	/*
-	 * Scan pg_inherits for rows linking our index to some parent.
+	 * Sanity check that we're not trying to overwrite another parent, or
+	 * trying to unset it when it's not set.
 	 */
-	pg_inherits = relation_open(InheritsRelationId, RowExclusiveLock);
-	ScanKeyInit(&key[0],
-				Anum_pg_inherits_inhrelid,
-				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(partRelid));
-	ScanKeyInit(&key[1],
-				Anum_pg_inherits_inhseqno,
-				BTEqualStrategyNumber, F_INT4EQ,
-				Int32GetDatum(1));
-	scan = systable_beginscan(pg_inherits, InheritsRelidSeqnoIndexId, true,
-							  NULL, 2, key);
-	tuple = systable_getnext(scan);
+	Assert(OidIsValid(((Form_pg_class) GETSTRUCT(tuple))->relpartitionparent)
+		!= OidIsValid(parentOid));
 
-	if (!HeapTupleIsValid(tuple))
-	{
-		if (parentOid == InvalidOid)
-		{
-			/*
-			 * No pg_inherits row, and no parent wanted: nothing to do in this
-			 * case.
-			 */
-			fix_dependencies = false;
-		}
-		else
-		{
-			Datum		values[Natts_pg_inherits];
-			bool		isnull[Natts_pg_inherits];
+	/* Set the relpartitionparent */
+	((Form_pg_class) GETSTRUCT(tuple))->relpartitionparent = parentOid;
+	CatalogTupleUpdate(classRel, &tuple->t_self, tuple);
+	heap_freetuple(tuple);
+	heap_close(classRel, RowExclusiveLock);
 
-			/*
-			 * No pg_inherits row exists, and we want a parent for this index,
-			 * so insert it.
-			 */
-			values[Anum_pg_inherits_inhrelid - 1] = ObjectIdGetDatum(partRelid);
-			values[Anum_pg_inherits_inhparent - 1] =
-				ObjectIdGetDatum(parentOid);
-			values[Anum_pg_inherits_inhseqno - 1] = Int32GetDatum(1);
-			memset(isnull, false, sizeof(isnull));
-
-			tuple = heap_form_tuple(RelationGetDescr(pg_inherits),
-									values, isnull);
-			CatalogTupleInsert(pg_inherits, tuple);
+	/*
+	 * Insert/delete pg_depend rows.  If setting a parent, add an
+	 * INTERNAL_AUTO dependency to the parent index; if making standalone,
+	 * remove all existing rows and put back the regular dependency on the
+	 * table.
+	 */
+	ObjectAddressSet(partIdx, RelationRelationId, partRelid);
 
-			fix_dependencies = true;
-		}
-	}
-	else
+	if (OidIsValid(parentOid))
 	{
-		Form_pg_inherits inhForm = (Form_pg_inherits) GETSTRUCT(tuple);
+		ObjectAddress parentIdx;
 
-		if (parentOid == InvalidOid)
-		{
-			/*
-			 * There exists a pg_inherits row, which we want to clear; do so.
-			 */
-			CatalogTupleDelete(pg_inherits, &tuple->t_self);
-			fix_dependencies = true;
-		}
-		else
-		{
-			/*
-			 * A pg_inherits row exists.  If it's the same we want, then we're
-			 * good; if it differs, that amounts to a corrupt catalog and
-			 * should not happen.
-			 */
-			if (inhForm->inhparent != parentOid)
-			{
-				/* unexpected: we should not get called in this case */
-				elog(ERROR, "bogus pg_inherit row: inhrelid %u inhparent %u",
-					 inhForm->inhrelid, inhForm->inhparent);
-			}
-
-			/* already in the right state */
-			fix_dependencies = false;
-		}
+		ObjectAddressSet(parentIdx, RelationRelationId, parentOid);
+		recordDependencyOn(&partIdx, &parentIdx, DEPENDENCY_INTERNAL_AUTO);
 	}
-
-	/* done with pg_inherits */
-	systable_endscan(scan);
-	relation_close(pg_inherits, RowExclusiveLock);
-
-	if (fix_dependencies)
+	else
 	{
-		ObjectAddress partIdx;
+		ObjectAddress partitionTbl;
 
-		/*
-		 * Insert/delete pg_depend rows.  If setting a parent, add an
-		 * INTERNAL_AUTO dependency to the parent index; if making standalone,
-		 * remove all existing rows and put back the regular dependency on the
-		 * table.
-		 */
-		ObjectAddressSet(partIdx, RelationRelationId, partRelid);
+		ObjectAddressSet(partitionTbl, RelationRelationId,
+							partitionIdx->rd_index->indrelid);
 
-		if (OidIsValid(parentOid))
-		{
-			ObjectAddress parentIdx;
+		deleteDependencyRecordsForClass(RelationRelationId, partRelid,
+										RelationRelationId,
+										DEPENDENCY_INTERNAL_AUTO);
 
-			ObjectAddressSet(parentIdx, RelationRelationId, parentOid);
-			recordDependencyOn(&partIdx, &parentIdx, DEPENDENCY_INTERNAL_AUTO);
-		}
-		else
-		{
-			ObjectAddress partitionTbl;
-
-			ObjectAddressSet(partitionTbl, RelationRelationId,
-							 partitionIdx->rd_index->indrelid);
-
-			deleteDependencyRecordsForClass(RelationRelationId, partRelid,
-											RelationRelationId,
-											DEPENDENCY_INTERNAL_AUTO);
-
-			recordDependencyOn(&partIdx, &partitionTbl, DEPENDENCY_AUTO);
-		}
-
-		/* make our updates visible */
-		CommandCounterIncrement();
+		recordDependencyOn(&partIdx, &partitionTbl, DEPENDENCY_AUTO);
 	}
+
+	/* make our updates visible */
+	CommandCounterIncrement();
 }
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8a0fcd7ece..e01ca8211a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14906,15 +14906,18 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
 		Oid			idxid = lfirst_oid(cell);
 		Relation	idx;
 
-		if (!has_superclass(idxid))
-			continue;
-
-		Assert((IndexGetRelation(get_partition_parent(idxid), false) ==
-				RelationGetRelid(rel)));
 
 		idx = index_open(idxid, AccessExclusiveLock);
-		IndexSetParentIndex(idx, InvalidOid);
-		update_relpartitionparent(pgclass, idxid, InvalidOid);
+
+		if (OidIsValid(RelationGetParentRelid(idx)))
+		{
+			Assert((IndexGetRelation(RelationGetParentRelid(idx), false) ==
+					RelationGetRelid(rel)));
+
+			IndexSetParentIndex(idx, InvalidOid);
+			update_relpartitionparent(pgclass, idxid, InvalidOid);
+		}
+
 		relation_close(idx, AccessExclusiveLock);
 	}
 	heap_close(pgclass, RowExclusiveLock);
@@ -15149,6 +15152,7 @@ ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx, RangeVar *name)
 		if (OidIsValid(constraintOid))
 			ConstraintSetParentConstraint(cldConstrId, constraintOid);
 		update_relpartitionparent(NULL, partIdxId, RelationGetRelid(parentIdx));
+		CommandCounterIncrement();
 
 		pfree(attmap);
 
@@ -15170,25 +15174,16 @@ ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx, RangeVar *name)
 static void
 refuseDupeIndexAttach(Relation parentIdx, Relation partIdx, Relation partitionTbl)
 {
-	Relation	pg_inherits;
-	ScanKeyData key;
-	HeapTuple	tuple;
-	SysScanDesc scan;
+	List *indexoids;
+	ListCell *lc;
 
-	pg_inherits = heap_open(InheritsRelationId, AccessShareLock);
-	ScanKeyInit(&key, Anum_pg_inherits_inhparent,
-				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(RelationGetRelid(parentIdx)));
-	scan = systable_beginscan(pg_inherits, InheritsParentIndexId, true,
-							  NULL, 1, &key);
-	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
+	indexoids = RelationGetIndexList(partitionTbl);
+
+	foreach(lc, indexoids)
 	{
-		Form_pg_inherits inhForm;
-		Oid			tab;
+		Oid		idxOid = lfirst_oid(lc);
 
-		inhForm = (Form_pg_inherits) GETSTRUCT(tuple);
-		tab = IndexGetRelation(inhForm->inhrelid, false);
-		if (tab == RelationGetRelid(partitionTbl))
+		if (get_partition_parent(idxOid) == RelationGetRelid(parentIdx))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("cannot attach index \"%s\" as a partition of index \"%s\"",
@@ -15197,9 +15192,6 @@ refuseDupeIndexAttach(Relation parentIdx, Relation partIdx, Relation partitionTb
 					 errdetail("Another index is already attached for partition \"%s\".",
 							   RelationGetRelationName(partitionTbl))));
 	}
-
-	systable_endscan(scan);
-	heap_close(pg_inherits, AccessShareLock);
 }
 
 /*
@@ -15211,72 +15203,82 @@ refuseDupeIndexAttach(Relation parentIdx, Relation partIdx, Relation partitionTb
 static void
 validatePartitionedIndex(Relation partedIdx, Relation partedTbl)
 {
-	Relation	inheritsRel;
-	SysScanDesc scan;
-	ScanKeyData key;
-	int			tuples = 0;
-	HeapTuple	inhTup;
-	bool		updated = false;
+	PartitionDesc	partdesc;
+	int				i;
+	int				nparts;
+	Oid				partedidxoid = RelationGetRelid(partedIdx);
+	Relation	idxRel;
+	HeapTuple	newtup;
 
 	Assert(partedIdx->rd_rel->relkind == RELKIND_PARTITIONED_INDEX);
 
-	/*
-	 * Scan pg_inherits for this parent index.  Count each valid index we find
-	 * (verifying the pg_index entry for each), and if we reach the total
-	 * amount we expect, we can mark this parent index as valid.
-	 */
-	inheritsRel = heap_open(InheritsRelationId, AccessShareLock);
-	ScanKeyInit(&key, Anum_pg_inherits_inhparent,
-				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(RelationGetRelid(partedIdx)));
-	scan = systable_beginscan(inheritsRel, InheritsParentIndexId, true,
-							  NULL, 1, &key);
-	while ((inhTup = systable_getnext(scan)) != NULL)
-	{
-		Form_pg_inherits inhForm = (Form_pg_inherits) GETSTRUCT(inhTup);
-		HeapTuple	indTup;
-		Form_pg_index indexForm;
-
-		indTup = SearchSysCache1(INDEXRELID,
-								 ObjectIdGetDatum(inhForm->inhrelid));
-		if (!indTup)
-			elog(ERROR, "cache lookup failed for index %u",
-				 inhForm->inhrelid);
-		indexForm = (Form_pg_index) GETSTRUCT(indTup);
-		if (IndexIsValid(indexForm))
-			tuples += 1;
-		ReleaseSysCache(indTup);
-	}
-
-	/* Done with pg_inherits */
-	systable_endscan(scan);
-	heap_close(inheritsRel, AccessShareLock);
+	partdesc = RelationGetPartitionDesc(partedTbl);
+	nparts = partdesc->nparts;
 
 	/*
-	 * If we found as many inherited indexes as the partitioned table has
-	 * partitions, we're good; update pg_index to set indisvalid.
+	 * Check if all partitions have an index defined for this partitioned index.
+	 * If they all have one then we can mark the partitioned index as valid.
 	 */
-	if (tuples == RelationGetPartitionDesc(partedTbl)->nparts)
+	for (i = 0; i < nparts; i++)
 	{
-		Relation	idxRel;
-		HeapTuple	newtup;
+		Relation part = relation_open(partdesc->oids[i], AccessShareLock);
+		List *indexoids = RelationGetIndexList(part);
+		ListCell *lc;
+		bool	found = false;
+
+		foreach(lc, indexoids)
+		{
+			Oid		indexoid = lfirst_oid(lc);
+			HeapTuple	tup;
+			Form_pg_class classForm;
+			Form_pg_index indexForm;
+
+			tup = SearchSysCache1(RELOID, ObjectIdGetDatum(indexoid));
+			if (!HeapTupleIsValid(tup))
+				elog(ERROR, "cache lookup failed for relation %u", indexoid);
+			classForm = (Form_pg_class) GETSTRUCT(tup);
+			if (classForm->relpartitionparent != partedidxoid)
+			{
+				ReleaseSysCache(tup);
+				continue;
+			}
+			ReleaseSysCache(tup);
+
 
-		idxRel = heap_open(IndexRelationId, RowExclusiveLock);
+			tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(indexoid));
+			if (!HeapTupleIsValid(tup))
+				elog(ERROR, "cache lookup failed for index %u", indexoid);
+			indexForm = (Form_pg_index) GETSTRUCT(tup);
 
-		newtup = heap_copytuple(partedIdx->rd_indextuple);
-		((Form_pg_index) GETSTRUCT(newtup))->indisvalid = true;
-		updated = true;
+			if (IndexIsValid(indexForm))
+				found = true;
 
-		CatalogTupleUpdate(idxRel, &partedIdx->rd_indextuple->t_self, newtup);
+			ReleaseSysCache(tup);
+			break;
+		}
 
-		heap_close(idxRel, RowExclusiveLock);
+		relation_close(part, AccessShareLock);
+
+		/* If the index was not found then we can't mark the index as valid */
+		if (!found)
+			return;
 	}
 
+	/* We're good; update pg_index to set indisvalid. */
+	idxRel = heap_open(IndexRelationId, RowExclusiveLock);
+
+	newtup = heap_copytuple(partedIdx->rd_indextuple);
+	((Form_pg_index) GETSTRUCT(newtup))->indisvalid = true;
+
+	CatalogTupleUpdate(idxRel, &partedIdx->rd_indextuple->t_self, newtup);
+
+	heap_close(idxRel, RowExclusiveLock);
+
 	/*
 	 * If this index is in turn a partition of a larger index, validating it
 	 * might cause the parent to become valid also.  Try that.
 	 */
-	if (updated && OidIsValid(RelationGetParentRelid(partedIdx)))
+	if (OidIsValid(RelationGetParentRelid(partedIdx)))
 	{
 		Oid			parentIdxId,
 					parentTblId;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index af7e2bd813..b616515157 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -384,7 +384,7 @@ main(int argc, char **argv)
 	};
 
 	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_dump"));
-
+	pg_usleep(15000000);
 	/*
 	 * Initialize what we need for parallel execution, especially for thread
 	 * support on Windows.
@@ -6808,7 +6808,39 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables)
 		 * is not.
 		 */
 		resetPQExpBuffer(query);
-		if (fout->remoteVersion >= 110000)
+
+		if (fout->remoteVersion >= 120000)
+		{
+			appendPQExpBuffer(query,
+							  "SELECT t.tableoid, t.oid, "
+							  "t.relname AS indexname, "
+							  "t.relpartitionparent AS parentidx, "
+							  "pg_catalog.pg_get_indexdef(i.indexrelid) AS indexdef, "
+							  "i.indnkeyatts AS indnkeyatts, "
+							  "i.indnatts AS indnatts, "
+							  "i.indkey, i.indisclustered, "
+							  "i.indisreplident, t.relpages, "
+							  "c.contype, c.conname, "
+							  "c.condeferrable, c.condeferred, "
+							  "c.tableoid AS contableoid, "
+							  "c.oid AS conoid, "
+							  "pg_catalog.pg_get_constraintdef(c.oid, false) AS condef, "
+							  "(SELECT spcname FROM pg_catalog.pg_tablespace s WHERE s.oid = t.reltablespace) AS tablespace, "
+							  "t.reloptions AS indreloptions "
+							  "FROM pg_catalog.pg_index i "
+							  "JOIN pg_catalog.pg_class t ON (t.oid = i.indexrelid) "
+							  "JOIN pg_catalog.pg_class t2 ON (t2.oid = i.indrelid) "
+							  "LEFT JOIN pg_catalog.pg_constraint c "
+							  "ON (i.indrelid = c.conrelid AND "
+							  "i.indexrelid = c.conindid AND "
+							  "c.contype IN ('p','u','x')) "
+							  "WHERE i.indrelid = '%u'::pg_catalog.oid "
+							  "AND (i.indisvalid OR t2.relkind = 'p') "
+							  "AND i.indisready "
+							  "ORDER BY indexname",
+							  tbinfo->dobj.catId.oid);
+		}
+		else if (fout->remoteVersion >= 110000)
 		{
 			appendPQExpBuffer(query,
 							  "SELECT t.tableoid, t.oid, "
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index b9297c98d2..77ce91e847 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -6,20 +6,19 @@ create table idxpart2 partition of idxpart for values from (10) to (100)
 	partition by range (b);
 create table idxpart21 partition of idxpart2 for values from (0) to (100);
 create index on idxpart (a);
-select relname, relkind, inhparent::regclass
-    from pg_class left join pg_index ix on (indexrelid = oid)
-	left join pg_inherits on (ix.indexrelid = inhrelid)
+select relname, relkind, relpartitionparent::regclass
+    from pg_class
 	where relname like 'idxpart%' order by relname;
-     relname     | relkind |   inhparent    
------------------+---------+----------------
- idxpart         | p       | 
- idxpart1        | r       | 
+     relname     | relkind | relpartitionparent 
+-----------------+---------+--------------------
+ idxpart         | p       | -
+ idxpart1        | r       | idxpart
  idxpart1_a_idx  | i       | idxpart_a_idx
- idxpart2        | p       | 
- idxpart21       | r       | 
+ idxpart2        | p       | idxpart
+ idxpart21       | r       | idxpart2
  idxpart21_a_idx | i       | idxpart2_a_idx
  idxpart2_a_idx  | I       | idxpart_a_idx
- idxpart_a_idx   | I       | 
+ idxpart_a_idx   | I       | -
 (8 rows)
 
 drop table idxpart;
@@ -91,16 +90,15 @@ Partition of: idxpart FOR VALUES FROM (0, 0) TO (10, 10)
 Indexes:
     "idxpart1_a_b_idx" btree (a, b)
 
-select relname, relkind, inhparent::regclass
-    from pg_class left join pg_index ix on (indexrelid = oid)
-	left join pg_inherits on (ix.indexrelid = inhrelid)
+select relname, relkind, relpartitionparent::regclass
+    from pg_class
 	where relname like 'idxpart%' order by relname;
-     relname      | relkind |    inhparent    
-------------------+---------+-----------------
- idxpart          | p       | 
- idxpart1         | r       | 
+     relname      | relkind | relpartitionparent 
+------------------+---------+--------------------
+ idxpart          | p       | -
+ idxpart1         | r       | idxpart
  idxpart1_a_b_idx | i       | idxpart_a_b_idx
- idxpart_a_b_idx  | I       | 
+ idxpart_a_b_idx  | I       | -
 (4 rows)
 
 drop table idxpart;
@@ -238,29 +236,29 @@ Number of partitions: 2 (Use \d+ to list them.)
  a      | integer |           |          | 
 Partition of: idxpart2 FOR VALUES FROM (100) TO (200)
 
-select indexrelid::regclass, indrelid::regclass, inhparent::regclass
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+select indexrelid::regclass, indrelid::regclass, relpartitionparent::regclass
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
 where indexrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
-   indexrelid    | indrelid  |   inhparent   
------------------+-----------+---------------
+   indexrelid    | indrelid  | relpartitionparent 
+-----------------+-----------+--------------------
  idxpart1_a_idx  | idxpart1  | idxpart_a_idx
- idxpart22_a_idx | idxpart22 | 
+ idxpart22_a_idx | idxpart22 | -
  idxpart2_a_idx  | idxpart2  | idxpart_a_idx
- idxpart_a_idx   | idxpart   | 
+ idxpart_a_idx   | idxpart   | -
 (4 rows)
 
 alter index idxpart2_a_idx attach partition idxpart22_a_idx;
-select indexrelid::regclass, indrelid::regclass, inhparent::regclass
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+select indexrelid::regclass, indrelid::regclass, relpartitionparent::regclass
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
 where indexrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
-   indexrelid    | indrelid  |   inhparent    
------------------+-----------+----------------
+   indexrelid    | indrelid  | relpartitionparent 
+-----------------+-----------+--------------------
  idxpart1_a_idx  | idxpart1  | idxpart_a_idx
  idxpart22_a_idx | idxpart22 | idxpart2_a_idx
  idxpart2_a_idx  | idxpart2  | idxpart_a_idx
- idxpart_a_idx   | idxpart   | 
+ idxpart_a_idx   | idxpart   | -
 (4 rows)
 
 -- attaching idxpart22 is not enough to set idxpart22_a_idx valid ...
@@ -309,18 +307,17 @@ Indexes:
     "idxpart1_a_idx" btree (a)
     "idxpart1_b_c_idx" btree (b, c)
 
-select relname, relkind, inhparent::regclass
-    from pg_class left join pg_index ix on (indexrelid = oid)
-	left join pg_inherits on (ix.indexrelid = inhrelid)
+select relname, relkind, relpartitionparent::regclass
+    from pg_class
 	where relname like 'idxpart%' order by relname;
-     relname      | relkind | inhparent 
-------------------+---------+-----------
- idxpart          | p       | 
- idxpart1         | r       | 
- idxpart1_a_idx   | i       | 
- idxpart1_b_c_idx | i       | 
- idxparti         | I       | 
- idxparti2        | I       | 
+     relname      | relkind | relpartitionparent 
+------------------+---------+--------------------
+ idxpart          | p       | -
+ idxpart1         | r       | -
+ idxpart1_a_idx   | i       | -
+ idxpart1_b_c_idx | i       | -
+ idxparti         | I       | -
+ idxparti2        | I       | -
 (6 rows)
 
 alter table idxpart attach partition idxpart1 for values from (0) to (10);
@@ -336,18 +333,17 @@ Indexes:
     "idxpart1_a_idx" btree (a)
     "idxpart1_b_c_idx" btree (b, c)
 
-select relname, relkind, inhparent::regclass
-    from pg_class left join pg_index ix on (indexrelid = oid)
-	left join pg_inherits on (ix.indexrelid = inhrelid)
+select relname, relkind, relpartitionparent::regclass
+    from pg_class
 	where relname like 'idxpart%' order by relname;
-     relname      | relkind | inhparent 
-------------------+---------+-----------
- idxpart          | p       | 
- idxpart1         | r       | 
+     relname      | relkind | relpartitionparent 
+------------------+---------+--------------------
+ idxpart          | p       | -
+ idxpart1         | r       | idxpart
  idxpart1_a_idx   | i       | idxparti
  idxpart1_b_c_idx | i       | idxparti2
- idxparti         | I       | 
- idxparti2        | I       | 
+ idxparti         | I       | -
+ idxparti2        | I       | -
 (6 rows)
 
 drop table idxpart;
@@ -482,7 +478,7 @@ select relname, relkind from pg_class where relname like 'idxpart%' order by rel
 ---------+---------
 (0 rows)
 
--- Verify that expression indexes inherit correctly
+-- Verify that expression indexes have their parents set correctly
 create table idxpart (a int, b int) partition by range (a);
 create table idxpart1 (like idxpart);
 create index on idxpart1 ((a + b));
@@ -491,10 +487,11 @@ create table idxpart2 (like idxpart);
 alter table idxpart attach partition idxpart1 for values from (0000) to (1000);
 alter table idxpart attach partition idxpart2 for values from (1000) to (2000);
 create table idxpart3 partition of idxpart for values from (2000) to (3000);
-select relname as child, inhparent::regclass as parent, pg_get_indexdef as childdef
-  from pg_class join pg_inherits on inhrelid = oid,
+select relname as child, relpartitionparent::regclass as parent, pg_get_indexdef as childdef
+  from pg_class,
   lateral pg_get_indexdef(pg_class.oid)
-  where relkind in ('i', 'I') and relname like 'idxpart%' order by relname;
+  where relpartitionparent <> 0 and relkind in ('i', 'I') and relname like 'idxpart%'
+  order by relname;
        child       |      parent      |                                 childdef                                  
 -------------------+------------------+---------------------------------------------------------------------------
  idxpart1_expr_idx | idxpart_expr_idx | CREATE INDEX idxpart1_expr_idx ON public.idxpart1 USING btree (((a + b)))
@@ -515,19 +512,19 @@ alter table idxpart attach partition idxpart2 for values from ('bbb') to ('ccc')
 create table idxpart3 partition of idxpart for values from ('ccc') to ('ddd');
 create index on idxpart (a collate "C");
 create table idxpart4 partition of idxpart for values from ('ddd') to ('eee');
-select relname as child, inhparent::regclass as parent, pg_get_indexdef as childdef
-  from pg_class left join pg_inherits on inhrelid = oid,
+select relname as child, relpartitionparent::regclass as parent, pg_get_indexdef as childdef
+  from pg_class,
   lateral pg_get_indexdef(pg_class.oid)
   where relkind in ('i', 'I') and relname like 'idxpart%' order by relname;
       child      |    parent     |                                    childdef                                    
 -----------------+---------------+--------------------------------------------------------------------------------
  idxpart1_a_idx  | idxpart_a_idx | CREATE INDEX idxpart1_a_idx ON public.idxpart1 USING btree (a COLLATE "C")
- idxpart2_a_idx  |               | CREATE INDEX idxpart2_a_idx ON public.idxpart2 USING btree (a COLLATE "POSIX")
- idxpart2_a_idx1 |               | CREATE INDEX idxpart2_a_idx1 ON public.idxpart2 USING btree (a)
+ idxpart2_a_idx  | -             | CREATE INDEX idxpart2_a_idx ON public.idxpart2 USING btree (a COLLATE "POSIX")
+ idxpart2_a_idx1 | -             | CREATE INDEX idxpart2_a_idx1 ON public.idxpart2 USING btree (a)
  idxpart2_a_idx2 | idxpart_a_idx | CREATE INDEX idxpart2_a_idx2 ON public.idxpart2 USING btree (a COLLATE "C")
  idxpart3_a_idx  | idxpart_a_idx | CREATE INDEX idxpart3_a_idx ON public.idxpart3 USING btree (a COLLATE "C")
  idxpart4_a_idx  | idxpart_a_idx | CREATE INDEX idxpart4_a_idx ON public.idxpart4 USING btree (a COLLATE "C")
- idxpart_a_idx   |               | CREATE INDEX idxpart_a_idx ON ONLY public.idxpart USING btree (a COLLATE "C")
+ idxpart_a_idx   | -             | CREATE INDEX idxpart_a_idx ON ONLY public.idxpart USING btree (a COLLATE "C")
 (7 rows)
 
 drop table idxpart;
@@ -542,18 +539,18 @@ create table idxpart3 partition of idxpart for values from ('ccc') to ('ddd');
 create index on idxpart (a text_pattern_ops);
 create table idxpart4 partition of idxpart for values from ('ddd') to ('eee');
 -- must *not* have attached the index we created on idxpart2
-select relname as child, inhparent::regclass as parent, pg_get_indexdef as childdef
-  from pg_class left join pg_inherits on inhrelid = oid,
+select relname as child, relpartitionparent::regclass as parent, pg_get_indexdef as childdef
+  from pg_class,
   lateral pg_get_indexdef(pg_class.oid)
   where relkind in ('i', 'I') and relname like 'idxpart%' order by relname;
       child      |    parent     |                                      childdef                                      
 -----------------+---------------+------------------------------------------------------------------------------------
  idxpart1_a_idx  | idxpart_a_idx | CREATE INDEX idxpart1_a_idx ON public.idxpart1 USING btree (a text_pattern_ops)
- idxpart2_a_idx  |               | CREATE INDEX idxpart2_a_idx ON public.idxpart2 USING btree (a)
+ idxpart2_a_idx  | -             | CREATE INDEX idxpart2_a_idx ON public.idxpart2 USING btree (a)
  idxpart2_a_idx1 | idxpart_a_idx | CREATE INDEX idxpart2_a_idx1 ON public.idxpart2 USING btree (a text_pattern_ops)
  idxpart3_a_idx  | idxpart_a_idx | CREATE INDEX idxpart3_a_idx ON public.idxpart3 USING btree (a text_pattern_ops)
  idxpart4_a_idx  | idxpart_a_idx | CREATE INDEX idxpart4_a_idx ON public.idxpart4 USING btree (a text_pattern_ops)
- idxpart_a_idx   |               | CREATE INDEX idxpart_a_idx ON ONLY public.idxpart USING btree (a text_pattern_ops)
+ idxpart_a_idx   | -             | CREATE INDEX idxpart_a_idx ON ONLY public.idxpart USING btree (a text_pattern_ops)
 (6 rows)
 
 drop index idxpart_a_idx;
@@ -588,19 +585,19 @@ alter index idxpart_2_idx attach partition idxpart1_2c_idx;	-- fail
 ERROR:  cannot attach index "idxpart1_2c_idx" as a partition of index "idxpart_2_idx"
 DETAIL:  The index definitions do not match.
 alter index idxpart_2_idx attach partition idxpart1_2_idx;	-- ok
-select relname as child, inhparent::regclass as parent, pg_get_indexdef as childdef
-  from pg_class left join pg_inherits on inhrelid = oid,
+select relname as child, relpartitionparent::regclass as parent, pg_get_indexdef as childdef
+  from pg_class,
   lateral pg_get_indexdef(pg_class.oid)
   where relkind in ('i', 'I') and relname like 'idxpart%' order by relname;
       child      |    parent     |                                        childdef                                         
 -----------------+---------------+-----------------------------------------------------------------------------------------
  idxpart1_1_idx  | idxpart_1_idx | CREATE INDEX idxpart1_1_idx ON public.idxpart1 USING btree (b, a)
- idxpart1_1b_idx |               | CREATE INDEX idxpart1_1b_idx ON public.idxpart1 USING btree (b)
+ idxpart1_1b_idx | -             | CREATE INDEX idxpart1_1b_idx ON public.idxpart1 USING btree (b)
  idxpart1_2_idx  | idxpart_2_idx | CREATE INDEX idxpart1_2_idx ON public.idxpart1 USING btree (((b + a))) WHERE (a > 1)
- idxpart1_2b_idx |               | CREATE INDEX idxpart1_2b_idx ON public.idxpart1 USING btree (((a + b))) WHERE (a > 1)
- idxpart1_2c_idx |               | CREATE INDEX idxpart1_2c_idx ON public.idxpart1 USING btree (((b + a))) WHERE (b > 1)
- idxpart_1_idx   |               | CREATE INDEX idxpart_1_idx ON ONLY public.idxpart USING btree (b, a)
- idxpart_2_idx   |               | CREATE INDEX idxpart_2_idx ON ONLY public.idxpart USING btree (((b + a))) WHERE (a > 1)
+ idxpart1_2b_idx | -             | CREATE INDEX idxpart1_2b_idx ON public.idxpart1 USING btree (((a + b))) WHERE (a > 1)
+ idxpart1_2c_idx | -             | CREATE INDEX idxpart1_2c_idx ON public.idxpart1 USING btree (((b + a))) WHERE (b > 1)
+ idxpart_1_idx   | -             | CREATE INDEX idxpart_1_idx ON ONLY public.idxpart USING btree (b, a)
+ idxpart_2_idx   | -             | CREATE INDEX idxpart_2_idx ON ONLY public.idxpart USING btree (((b + a))) WHERE (a > 1)
 (7 rows)
 
 drop table idxpart;
@@ -929,17 +926,17 @@ create table idxpart0 partition of idxpart (i) for values with (modulus 2, remai
 create table idxpart1 partition of idxpart (i) for values with (modulus 2, remainder 1);
 alter table idxpart0 add primary key(i);
 alter table idxpart add primary key(i);
-select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid,
+select indrelid::regclass, indexrelid::regclass, relpartitionparent::regclass, indisvalid,
   conname, conislocal, coninhcount, connoinherit, convalidated
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
   left join pg_constraint con on (idx.indexrelid = con.conindid)
   where indrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
- indrelid |  indexrelid   |  inhparent   | indisvalid |    conname    | conislocal | coninhcount | connoinherit | convalidated 
-----------+---------------+--------------+------------+---------------+------------+-------------+--------------+--------------
- idxpart0 | idxpart0_pkey | idxpart_pkey | t          | idxpart0_pkey | f          |           1 | t            | t
- idxpart1 | idxpart1_pkey | idxpart_pkey | t          | idxpart1_pkey | f          |           1 | f            | t
- idxpart  | idxpart_pkey  |              | t          | idxpart_pkey  | t          |           0 | t            | t
+ indrelid |  indexrelid   | relpartitionparent | indisvalid |    conname    | conislocal | coninhcount | connoinherit | convalidated 
+----------+---------------+--------------------+------------+---------------+------------+-------------+--------------+--------------
+ idxpart0 | idxpart0_pkey | idxpart_pkey       | t          | idxpart0_pkey | f          |           1 | t            | t
+ idxpart1 | idxpart1_pkey | idxpart_pkey       | t          | idxpart1_pkey | f          |           1 | f            | t
+ idxpart  | idxpart_pkey  | -                  | t          | idxpart_pkey  | t          |           0 | t            | t
 (3 rows)
 
 drop index idxpart0_pkey;								-- fail
@@ -953,14 +950,14 @@ ERROR:  cannot drop inherited constraint "idxpart0_pkey" of relation "idxpart0"
 alter table idxpart1 drop constraint idxpart1_pkey;		-- fail
 ERROR:  cannot drop inherited constraint "idxpart1_pkey" of relation "idxpart1"
 alter table idxpart drop constraint idxpart_pkey;		-- ok
-select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid,
+select indrelid::regclass, indexrelid::regclass, relpartitionparent::regclass, indisvalid,
   conname, conislocal, coninhcount, connoinherit, convalidated
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
   left join pg_constraint con on (idx.indexrelid = con.conindid)
   where indrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
- indrelid | indexrelid | inhparent | indisvalid | conname | conislocal | coninhcount | connoinherit | convalidated 
-----------+------------+-----------+------------+---------+------------+-------------+--------------+--------------
+ indrelid | indexrelid | relpartitionparent | indisvalid | conname | conislocal | coninhcount | connoinherit | convalidated 
+----------+------------+--------------------+------------+---------+------------+-------------+--------------+--------------
 (0 rows)
 
 drop table idxpart;
@@ -987,29 +984,29 @@ create table idxpart0 (like idxpart);
 alter table idxpart0 add primary key (a);
 alter table idxpart attach partition idxpart0 for values from (0) to (1000);
 alter table only idxpart add primary key (a);
-select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid,
+select indrelid::regclass, indexrelid::regclass, relpartitionparent::regclass, indisvalid,
   conname, conislocal, coninhcount, connoinherit, convalidated
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
   left join pg_constraint con on (idx.indexrelid = con.conindid)
   where indrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
- indrelid |  indexrelid   | inhparent | indisvalid |    conname    | conislocal | coninhcount | connoinherit | convalidated 
-----------+---------------+-----------+------------+---------------+------------+-------------+--------------+--------------
- idxpart0 | idxpart0_pkey |           | t          | idxpart0_pkey | t          |           0 | t            | t
- idxpart  | idxpart_pkey  |           | f          | idxpart_pkey  | t          |           0 | t            | t
+ indrelid |  indexrelid   | relpartitionparent | indisvalid |    conname    | conislocal | coninhcount | connoinherit | convalidated 
+----------+---------------+--------------------+------------+---------------+------------+-------------+--------------+--------------
+ idxpart0 | idxpart0_pkey | -                  | t          | idxpart0_pkey | t          |           0 | t            | t
+ idxpart  | idxpart_pkey  | -                  | f          | idxpart_pkey  | t          |           0 | t            | t
 (2 rows)
 
 alter index idxpart_pkey attach partition idxpart0_pkey;
-select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid,
+select indrelid::regclass, indexrelid::regclass, relpartitionparent::regclass, indisvalid,
   conname, conislocal, coninhcount, connoinherit, convalidated
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
   left join pg_constraint con on (idx.indexrelid = con.conindid)
   where indrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
- indrelid |  indexrelid   |  inhparent   | indisvalid |    conname    | conislocal | coninhcount | connoinherit | convalidated 
-----------+---------------+--------------+------------+---------------+------------+-------------+--------------+--------------
- idxpart0 | idxpart0_pkey | idxpart_pkey | t          | idxpart0_pkey | f          |           1 | t            | t
- idxpart  | idxpart_pkey  |              | t          | idxpart_pkey  | t          |           0 | t            | t
+ indrelid |  indexrelid   | relpartitionparent | indisvalid |    conname    | conislocal | coninhcount | connoinherit | convalidated 
+----------+---------------+--------------------+------------+---------------+------------+-------------+--------------+--------------
+ idxpart0 | idxpart0_pkey | idxpart_pkey       | t          | idxpart0_pkey | f          |           1 | t            | t
+ idxpart  | idxpart_pkey  | -                  | t          | idxpart_pkey  | t          |           0 | t            | t
 (2 rows)
 
 drop table idxpart;
@@ -1020,17 +1017,17 @@ create table idxpart1 (a int not null, b int);
 create unique index on idxpart1 (a);
 alter table idxpart add primary key (a);
 alter table idxpart attach partition idxpart1 for values from (1) to (1000);
-select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid,
+select indrelid::regclass, indexrelid::regclass, relpartitionparent::regclass, indisvalid,
   conname, conislocal, coninhcount, connoinherit, convalidated
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
   left join pg_constraint con on (idx.indexrelid = con.conindid)
   where indrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
- indrelid |   indexrelid   |  inhparent   | indisvalid |    conname    | conislocal | coninhcount | connoinherit | convalidated 
-----------+----------------+--------------+------------+---------------+------------+-------------+--------------+--------------
- idxpart1 | idxpart1_a_idx |              | t          |               |            |             |              | 
- idxpart1 | idxpart1_pkey  | idxpart_pkey | t          | idxpart1_pkey | f          |           1 | f            | t
- idxpart  | idxpart_pkey   |              | t          | idxpart_pkey  | t          |           0 | t            | t
+ indrelid |   indexrelid   | relpartitionparent | indisvalid |    conname    | conislocal | coninhcount | connoinherit | convalidated 
+----------+----------------+--------------------+------------+---------------+------------+-------------+--------------+--------------
+ idxpart1 | idxpart1_a_idx | -                  | t          |               |            |             |              | 
+ idxpart1 | idxpart1_pkey  | idxpart_pkey       | t          | idxpart1_pkey | f          |           1 | f            | t
+ idxpart  | idxpart_pkey   | -                  | t          | idxpart_pkey  | t          |           0 | t            | t
 (3 rows)
 
 drop table idxpart;
diff --git a/src/test/regress/sql/indexing.sql b/src/test/regress/sql/indexing.sql
index 2091a87ff5..efc5a63d6b 100644
--- a/src/test/regress/sql/indexing.sql
+++ b/src/test/regress/sql/indexing.sql
@@ -6,9 +6,8 @@ create table idxpart2 partition of idxpart for values from (10) to (100)
 	partition by range (b);
 create table idxpart21 partition of idxpart2 for values from (0) to (100);
 create index on idxpart (a);
-select relname, relkind, inhparent::regclass
-    from pg_class left join pg_index ix on (indexrelid = oid)
-	left join pg_inherits on (ix.indexrelid = inhrelid)
+select relname, relkind, relpartitionparent::regclass
+    from pg_class
 	where relname like 'idxpart%' order by relname;
 drop table idxpart;
 
@@ -52,9 +51,8 @@ create table idxpart1 partition of idxpart for values from (0, 0) to (10, 10);
 create index on idxpart1 (a, b);
 create index on idxpart (a, b);
 \d idxpart1
-select relname, relkind, inhparent::regclass
-    from pg_class left join pg_index ix on (indexrelid = oid)
-	left join pg_inherits on (ix.indexrelid = inhrelid)
+select relname, relkind, relpartitionparent::regclass
+    from pg_class
 	where relname like 'idxpart%' order by relname;
 drop table idxpart;
 
@@ -129,13 +127,13 @@ create index on idxpart (a);
 \d idxpart1
 \d idxpart2
 \d idxpart21
-select indexrelid::regclass, indrelid::regclass, inhparent::regclass
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+select indexrelid::regclass, indrelid::regclass, relpartitionparent::regclass
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
 where indexrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
 alter index idxpart2_a_idx attach partition idxpart22_a_idx;
-select indexrelid::regclass, indrelid::regclass, inhparent::regclass
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+select indexrelid::regclass, indrelid::regclass, relpartitionparent::regclass
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
 where indexrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
 -- attaching idxpart22 is not enough to set idxpart22_a_idx valid ...
@@ -155,15 +153,13 @@ create index idxparti on idxpart (a);
 create index idxparti2 on idxpart (b, c);
 create table idxpart1 (like idxpart including indexes);
 \d idxpart1
-select relname, relkind, inhparent::regclass
-    from pg_class left join pg_index ix on (indexrelid = oid)
-	left join pg_inherits on (ix.indexrelid = inhrelid)
+select relname, relkind, relpartitionparent::regclass
+    from pg_class
 	where relname like 'idxpart%' order by relname;
 alter table idxpart attach partition idxpart1 for values from (0) to (10);
 \d idxpart1
-select relname, relkind, inhparent::regclass
-    from pg_class left join pg_index ix on (indexrelid = oid)
-	left join pg_inherits on (ix.indexrelid = inhrelid)
+select relname, relkind, relpartitionparent::regclass
+    from pg_class
 	where relname like 'idxpart%' order by relname;
 drop table idxpart;
 
@@ -230,7 +226,7 @@ select relname, relkind from pg_class where relname like 'idxpart%' order by rel
 drop table idxpart, idxpart1, idxpart2, idxpart3;
 select relname, relkind from pg_class where relname like 'idxpart%' order by relname;
 
--- Verify that expression indexes inherit correctly
+-- Verify that expression indexes have their parents set correctly
 create table idxpart (a int, b int) partition by range (a);
 create table idxpart1 (like idxpart);
 create index on idxpart1 ((a + b));
@@ -239,10 +235,11 @@ create table idxpart2 (like idxpart);
 alter table idxpart attach partition idxpart1 for values from (0000) to (1000);
 alter table idxpart attach partition idxpart2 for values from (1000) to (2000);
 create table idxpart3 partition of idxpart for values from (2000) to (3000);
-select relname as child, inhparent::regclass as parent, pg_get_indexdef as childdef
-  from pg_class join pg_inherits on inhrelid = oid,
+select relname as child, relpartitionparent::regclass as parent, pg_get_indexdef as childdef
+  from pg_class,
   lateral pg_get_indexdef(pg_class.oid)
-  where relkind in ('i', 'I') and relname like 'idxpart%' order by relname;
+  where relpartitionparent <> 0 and relkind in ('i', 'I') and relname like 'idxpart%'
+  order by relname;
 drop table idxpart;
 
 -- Verify behavior for collation (mis)matches
@@ -257,8 +254,8 @@ alter table idxpart attach partition idxpart2 for values from ('bbb') to ('ccc')
 create table idxpart3 partition of idxpart for values from ('ccc') to ('ddd');
 create index on idxpart (a collate "C");
 create table idxpart4 partition of idxpart for values from ('ddd') to ('eee');
-select relname as child, inhparent::regclass as parent, pg_get_indexdef as childdef
-  from pg_class left join pg_inherits on inhrelid = oid,
+select relname as child, relpartitionparent::regclass as parent, pg_get_indexdef as childdef
+  from pg_class,
   lateral pg_get_indexdef(pg_class.oid)
   where relkind in ('i', 'I') and relname like 'idxpart%' order by relname;
 drop table idxpart;
@@ -274,8 +271,8 @@ create table idxpart3 partition of idxpart for values from ('ccc') to ('ddd');
 create index on idxpart (a text_pattern_ops);
 create table idxpart4 partition of idxpart for values from ('ddd') to ('eee');
 -- must *not* have attached the index we created on idxpart2
-select relname as child, inhparent::regclass as parent, pg_get_indexdef as childdef
-  from pg_class left join pg_inherits on inhrelid = oid,
+select relname as child, relpartitionparent::regclass as parent, pg_get_indexdef as childdef
+  from pg_class,
   lateral pg_get_indexdef(pg_class.oid)
   where relkind in ('i', 'I') and relname like 'idxpart%' order by relname;
 drop index idxpart_a_idx;
@@ -303,8 +300,8 @@ alter index idxpart_1_idx attach partition idxpart1_1_idx;
 alter index idxpart_2_idx attach partition idxpart1_2b_idx;	-- fail
 alter index idxpart_2_idx attach partition idxpart1_2c_idx;	-- fail
 alter index idxpart_2_idx attach partition idxpart1_2_idx;	-- ok
-select relname as child, inhparent::regclass as parent, pg_get_indexdef as childdef
-  from pg_class left join pg_inherits on inhrelid = oid,
+select relname as child, relpartitionparent::regclass as parent, pg_get_indexdef as childdef
+  from pg_class,
   lateral pg_get_indexdef(pg_class.oid)
   where relkind in ('i', 'I') and relname like 'idxpart%' order by relname;
 drop table idxpart;
@@ -485,9 +482,9 @@ create table idxpart0 partition of idxpart (i) for values with (modulus 2, remai
 create table idxpart1 partition of idxpart (i) for values with (modulus 2, remainder 1);
 alter table idxpart0 add primary key(i);
 alter table idxpart add primary key(i);
-select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid,
+select indrelid::regclass, indexrelid::regclass, relpartitionparent::regclass, indisvalid,
   conname, conislocal, coninhcount, connoinherit, convalidated
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
   left join pg_constraint con on (idx.indexrelid = con.conindid)
   where indrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
@@ -496,9 +493,9 @@ drop index idxpart1_pkey;								-- fail
 alter table idxpart0 drop constraint idxpart0_pkey;		-- fail
 alter table idxpart1 drop constraint idxpart1_pkey;		-- fail
 alter table idxpart drop constraint idxpart_pkey;		-- ok
-select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid,
+select indrelid::regclass, indexrelid::regclass, relpartitionparent::regclass, indisvalid,
   conname, conislocal, coninhcount, connoinherit, convalidated
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
   left join pg_constraint con on (idx.indexrelid = con.conindid)
   where indrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
@@ -527,16 +524,16 @@ create table idxpart0 (like idxpart);
 alter table idxpart0 add primary key (a);
 alter table idxpart attach partition idxpart0 for values from (0) to (1000);
 alter table only idxpart add primary key (a);
-select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid,
+select indrelid::regclass, indexrelid::regclass, relpartitionparent::regclass, indisvalid,
   conname, conislocal, coninhcount, connoinherit, convalidated
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
   left join pg_constraint con on (idx.indexrelid = con.conindid)
   where indrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
 alter index idxpart_pkey attach partition idxpart0_pkey;
-select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid,
+select indrelid::regclass, indexrelid::regclass, relpartitionparent::regclass, indisvalid,
   conname, conislocal, coninhcount, connoinherit, convalidated
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
   left join pg_constraint con on (idx.indexrelid = con.conindid)
   where indrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
@@ -549,9 +546,9 @@ create table idxpart1 (a int not null, b int);
 create unique index on idxpart1 (a);
 alter table idxpart add primary key (a);
 alter table idxpart attach partition idxpart1 for values from (1) to (1000);
-select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid,
+select indrelid::regclass, indexrelid::regclass, relpartitionparent::regclass, indisvalid,
   conname, conislocal, coninhcount, connoinherit, convalidated
-  from pg_index idx left join pg_inherits inh on (idx.indexrelid = inh.inhrelid)
+  from pg_index idx left join pg_class c on (idx.indexrelid = c.oid)
   left join pg_constraint con on (idx.indexrelid = con.conindid)
   where indrelid::regclass::text like 'idxpart%'
   order by indexrelid::regclass::text collate "C";
-- 
2.16.2.windows.1

v1-0004-Allow-partitions-to-be-attached-without-blocking-.patchapplication/octet-stream; name=v1-0004-Allow-partitions-to-be-attached-without-blocking-.patchDownload

From 01bc09bb547b992c9781df15b1ea9ff0e073b349 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 2 Aug 2018 16:07:08 +1200
Subject: [PATCH v1 4/4] Allow partitions to be attached without blocking
 queries

---
 src/backend/commands/tablecmds.c       | 231 +++++++++++++++++++++++++++++----
 src/backend/nodes/copyfuncs.c          |   1 +
 src/backend/nodes/equalfuncs.c         |   1 +
 src/backend/optimizer/plan/planner.c   |   4 +
 src/backend/optimizer/prep/prepunion.c |  33 +++--
 src/backend/optimizer/util/plancat.c   |   3 +
 src/backend/optimizer/util/relnode.c   |  19 +--
 src/backend/parser/gram.y              |  16 ++-
 src/backend/partitioning/partprune.c   |   6 +-
 src/backend/utils/cache/partcache.c    |  24 +++-
 src/backend/utils/cache/relcache.c     |   6 +-
 src/bin/psql/describe.c                |  21 ++-
 src/include/catalog/partition.h        |   7 +
 src/include/nodes/parsenodes.h         |   2 +
 src/include/nodes/relation.h           |  11 ++
 15 files changed, 318 insertions(+), 67 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e01ca8211a..d3db294665 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -87,6 +87,7 @@
 #include "storage/lmgr.h"
 #include "storage/lock.h"
 #include "storage/predicate.h"
+#include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -344,7 +345,7 @@ static void ATController(AlterTableStmt *parsetree,
 static void ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 		  bool recurse, bool recursing, LOCKMODE lockmode);
 static void ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode);
-static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
+static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation *relp,
 		  AlterTableCmd *cmd, LOCKMODE lockmode);
 static void ATRewriteTables(AlterTableStmt *parsetree,
 				List **wqueue, LOCKMODE lockmode);
@@ -479,19 +480,19 @@ static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partsp
 static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
 					  List **partexprs, Oid *partopclass, Oid *partcollation, char strategy);
 static void AttachPartition(Relation attachrel, Relation rel,
-				PartitionBoundSpec *bound);
+				PartitionBoundSpec *bound, bool valid);
 static void CreateInheritance(Relation child_rel, Relation parent_rel);
 static void RemoveInheritance(Relation child_rel, Relation parent_rel);
-static ObjectAddress ATExecAttachPartition(List **wqueue, Relation rel,
+static ObjectAddress ATExecAttachPartition(List **wqueue, Relation *relp,
 					  PartitionCmd *cmd);
 static void AttachPartitionEnsureIndexes(Relation rel, Relation attachrel);
 static void QueuePartitionConstraintValidation(List **wqueue, Relation scanrel,
 								   List *partConstraint,
 								   bool validate_default);
 static void CloneRowTriggersToPartition(Relation parent, Relation partition);
-static ObjectAddress ATExecDetachPartition(Relation rel, RangeVar *name);
+static ObjectAddress ATExecDetachPartition(Relation rel, PartitionCmd *cmd);
 static ObjectAddress ATExecAttachPartitionIdx(List **wqueue, Relation rel,
-						 RangeVar *name);
+						PartitionCmd *cmd);
 static void validatePartitionedIndex(Relation partedIdx, Relation partedTbl);
 static void refuseDupeIndexAttach(Relation parentIdx, Relation partIdx,
 					  Relation partitionTbl);
@@ -865,7 +866,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 
 		/* Add the pg_partition record */
-		AttachPartition(rel, parent, bound);
+		AttachPartition(rel, parent, bound, true);
 
 		/* Update the pg_class entry. */
 		MarkRelationPartitioned(rel, parent, bound->is_default);
@@ -3619,7 +3620,11 @@ AlterTableGetLockLevel(List *cmds)
 
 			case AT_AttachPartition:
 			case AT_DetachPartition:
-				cmd_lockmode = AccessExclusiveLock;
+				/* CONCURRENTLY option does not use an AccessExclusiveLock */
+				if (IsA(cmd->def, PartitionCmd) && ((PartitionCmd *) cmd->def)->concurrently)
+					cmd_lockmode = ShareUpdateExclusiveLock;
+				else
+					cmd_lockmode = AccessExclusiveLock;
 				break;
 
 			default:			/* oops */
@@ -4012,8 +4017,14 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode)
 			 */
 			rel = relation_open(tab->relid, NoLock);
 
+			/*
+			 * We must pass a pointer to rel as some sub commands such as
+			 * ATTACH PARTITION CONCURRENTLY commit the transaction and start
+			 * a new one, meaning that rel must be closed and reopened.
+			 * Without this we'd end up with a pointer to the closed copy.
+			 */
 			foreach(lcmd, subcmds)
-				ATExecCmd(wqueue, tab, rel,
+				ATExecCmd(wqueue, tab, &rel,
 						  castNode(AlterTableCmd, lfirst(lcmd)),
 						  lockmode);
 
@@ -4051,10 +4062,11 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode)
  * ATExecCmd: dispatch a subcommand to appropriate execution routine
  */
 static void
-ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
+ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation *relp,
 		  AlterTableCmd *cmd, LOCKMODE lockmode)
 {
 	ObjectAddress address = InvalidObjectAddress;
+	Relation		rel = *relp;
 
 	switch (cmd->subtype)
 	{
@@ -4306,15 +4318,15 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
 			break;
 		case AT_AttachPartition:
 			if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-				ATExecAttachPartition(wqueue, rel, (PartitionCmd *) cmd->def);
+				ATExecAttachPartition(wqueue, relp, (PartitionCmd *) cmd->def);
 			else
 				ATExecAttachPartitionIdx(wqueue, rel,
-										 ((PartitionCmd *) cmd->def)->name);
+										 (PartitionCmd *) cmd->def);
 			break;
 		case AT_DetachPartition:
 			/* ATPrepCmd ensures it must be a table */
 			Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-			ATExecDetachPartition(rel, ((PartitionCmd *) cmd->def)->name);
+			ATExecDetachPartition(rel, (PartitionCmd *) cmd->def);
 			break;
 		default:				/* oops */
 			elog(ERROR, "unrecognized alter table type: %d",
@@ -11606,7 +11618,8 @@ ATExecAddInherit(Relation child_rel, RangeVar *parent, LOCKMODE lockmode)
 }
 
 static void
-AttachPartition(Relation attachrel, Relation rel, PartitionBoundSpec *bound)
+AttachPartition(Relation attachrel, Relation rel, PartitionBoundSpec *bound,
+				bool valid)
 {
 	Datum		values[Natts_pg_partition];
 	bool		nulls[Natts_pg_partition];
@@ -11624,6 +11637,7 @@ AttachPartition(Relation attachrel, Relation rel, PartitionBoundSpec *bound)
 	 */
 	values[Anum_pg_partition_partrelid - 1] = ObjectIdGetDatum(attachrelid);
 	values[Anum_pg_partition_parentrelid - 1] = ObjectIdGetDatum(partedrelid);
+	values[Anum_pg_partition_partvalid - 1] = BoolGetDatum(valid);
 	values[Anum_pg_partition_partbound - 1] = CStringGetTextDatum(nodeToString(bound));
 
 	memset(nulls, 0, sizeof(nulls));
@@ -14170,15 +14184,16 @@ QueuePartitionConstraintValidation(List **wqueue, Relation scanrel,
 }
 
 /*
- * ALTER TABLE <name> ATTACH PARTITION <partition-name> FOR VALUES
+ * ALTER TABLE <name> ATTACH PARTITION [CONCURRENTLY] <partition-name> FOR VALUES
  *
  * Return the address of the newly attached partition.
  */
 static ObjectAddress
-ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
+ATExecAttachPartition(List **wqueue, Relation *relp, PartitionCmd *cmd)
 {
 	Relation	attachrel,
-				catalog;
+				catalog,
+				rel;
 	List	   *partConstraint;
 	SysScanDesc scan;
 	ScanKeyData skey;
@@ -14192,6 +14207,10 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 	List	   *partBoundConstraint;
 	List	   *cloned;
 	ListCell   *l;
+	LOCKMODE	lockmode;
+
+	lockmode = cmd->concurrently ? ShareUpdateExclusiveLock : AccessExclusiveLock;
+	rel = *relp;
 
 	/*
 	 * We must lock the default partition if one exists, because attaching a
@@ -14200,8 +14219,28 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 	defaultPartOid =
 		get_default_oid_from_partdesc(RelationGetPartitionDesc(rel));
 	if (OidIsValid(defaultPartOid))
+	{
+		/*
+		 * When attaching a partition to a partitioned table which has a
+		 * default partition, the default partition must be locked with an
+		 * AccessExclusiveLock so that tuples which are in the default
+		 * partition which should now belong to the newly attached partition
+		 * can be moved.  Moving these tuples while there is concurrent
+		 * activity on the table is difficult to do transparently, so for now
+		 * we'll just disallow the CONCURRENTLY option when there is a default
+		 * partition.
+		 */
+		if (cmd->concurrently)
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("concurrent attach to a partitioned table with a default partition is unsupported")));
+		}
+
 		LockRelationOid(defaultPartOid, AccessExclusiveLock);
+	}
 
+	/* Always take an AccessExclusiveLock on the relation being attached */
 	attachrel = heap_openrv(cmd->name, AccessExclusiveLock);
 
 	/*
@@ -14275,7 +14314,7 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 		List	   *attachrel_children;
 
 		attachrel_children = get_partition_descendants(RelationGetRelid(attachrel),
-													   AccessExclusiveLock);
+													   lockmode);
 		if (list_member_oid(attachrel_children, RelationGetRelid(rel)))
 			ereport(ERROR,
 					(errcode(ERRCODE_DUPLICATE_TABLE),
@@ -14382,7 +14421,11 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 	check_new_partition_bound(RelationGetRelationName(attachrel), rel,
 							  cmd->bound);
 
-	AttachPartition(attachrel, rel, cmd->bound);
+	/*
+	 * When the CONCURRENTLY option was not specified we mark the partition as
+	 * valid right away.
+	 */
+	AttachPartition(attachrel, rel, cmd->bound, !cmd->concurrently);
 
 	/* Update the pg_class entry. */
 	MarkRelationPartitioned(attachrel, rel, cmd->bound->is_default);
@@ -14489,6 +14532,138 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 
 	ObjectAddressSet(address, RelationRelationId, RelationGetRelid(attachrel));
 
+
+	if (cmd->concurrently)
+	{
+		LOCKTAG		attachlocktag;
+		LockRelId	attachrelid;
+		LOCKTAG		partedlocktag;
+		LockRelId	partedrelid;
+		Relation	partRelation;
+		HeapTuple	tuple;
+		Oid			attachoid = RelationGetRelid(attachrel);
+		Oid			relid = RelationGetRelid(rel);
+		TransactionId limitXmin;
+		Snapshot	snapshot;
+		VirtualTransactionId *old_snapshots;
+		int			n_old_snapshots;
+		int			i;
+
+		/*
+		 * To allow the CONCURRENT ATTACH operation we need to make this
+		 * partition visible to other transactions.  To do that we must commit
+		 * this transaction.  In order to prevent another transaction dropping
+		 * or detaching this newly attached partition we must obtain a session
+		 * level lock on it.  We must also maintain a ShareUpdateExclusiveLock
+		 * on the partitioned table to prevent other sessions attaching any
+		 * other partitions. XXX is that needed?
+		 */
+		attachrelid = attachrel->rd_lockInfo.lockRelId;
+		SET_LOCKTAG_RELATION(attachlocktag, attachrelid.dbId, attachrelid.relId);
+		heap_close(attachrel, NoLock);
+
+		partedrelid = rel->rd_lockInfo.lockRelId;
+		SET_LOCKTAG_RELATION(partedlocktag, partedrelid.dbId, partedrelid.relId);
+		heap_close(rel, NoLock);
+
+		LockRelationIdForSession(&partedrelid, ShareUpdateExclusiveLock);
+		LockRelationIdForSession(&attachrelid, AccessExclusiveLock);
+
+		snapshot = GetTransactionSnapshot();
+		limitXmin = snapshot->xmin;
+
+		/* Now begin a new transaction */
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Technically we're finished with 'rel' here, but we must re-open it
+		 * again as the calling alter table code will try to close it.  We must
+		 * also ensure that we set *relp to point to this new rel.
+		 */
+		*relp = rel = heap_open(relid, ShareUpdateExclusiveLock);
+
+		/*
+		 * Open and lock the partition relation.  The relation's Oid cannot
+		 * have changed as we've been holding a session-level lock while the
+		 * transaction was commited and the new one begun.
+		 */
+		attachrel = heap_open(attachoid, AccessExclusiveLock);
+
+		/*
+		 * Now we must until there are no transactions left which could see
+		 * the old list of partitions.  Some of these transactions may be
+		 * REPEATABLE READ or above in isolation level, so we cannot just
+		 * add a new partition during their transaction.
+		 */
+		old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
+											  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+											  &n_old_snapshots);
+
+		for (i = 0; i < n_old_snapshots; i++)
+		{
+			if (!VirtualTransactionIdIsValid(old_snapshots[i]))
+				continue;			/* found uninteresting in previous cycle */
+
+			if (i > 0)
+			{
+				/* see if anything's changed ... */
+				VirtualTransactionId *newer_snapshots;
+				int			n_newer_snapshots;
+				int			j;
+				int			k;
+
+				newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
+														true, false,
+														PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+														&n_newer_snapshots);
+				for (j = i; j < n_old_snapshots; j++)
+				{
+					if (!VirtualTransactionIdIsValid(old_snapshots[j]))
+						continue;	/* found uninteresting in previous cycle */
+					for (k = 0; k < n_newer_snapshots; k++)
+					{
+						if (VirtualTransactionIdEquals(old_snapshots[j],
+													   newer_snapshots[k]))
+							break;
+					}
+					if (k >= n_newer_snapshots) /* not there anymore */
+						SetInvalidVirtualTransactionId(old_snapshots[j]);
+				}
+				pfree(newer_snapshots);
+			}
+
+			if (VirtualTransactionIdIsValid(old_snapshots[i]))
+				VirtualXactLock(old_snapshots[i], true);
+		}
+
+		partRelation = heap_open(PartitionRelationId, RowExclusiveLock);
+
+		tuple = SearchSysCacheCopy1(PARTSRELID,
+									ObjectIdGetDatum(attachoid));
+		if (!HeapTupleIsValid(tuple))
+			elog(ERROR, "cache lookup failed for relation %u", attachoid);
+
+		((Form_pg_partition) GETSTRUCT(tuple))->partvalid = true;
+
+		CatalogTupleUpdate(partRelation, &tuple->t_self, tuple);
+
+		heap_close(partRelation, RowExclusiveLock);
+
+		/*
+		 * Invalidate relcache entries for the partitioned table so that new
+		 * queries pickup the new partition.
+		 */
+		CacheInvalidateRelcacheByRelid(relid);
+
+		/*
+		 * Last thing to do is release the session-level lock on the parent table.
+		 */
+		UnlockRelationIdForSession(&partedrelid, ShareUpdateExclusiveLock);
+		UnlockRelationIdForSession(&attachrelid, AccessExclusiveLock);
+	}
+
 	/* keep our lock until commit */
 	heap_close(attachrel, NoLock);
 
@@ -14770,12 +14945,12 @@ CloneRowTriggersToPartition(Relation parent, Relation partition)
 }
 
 /*
- * ALTER TABLE DETACH PARTITION
+ * ALTER TABLE DETACH PARTITION [CONCURRENTLY]
  *
  * Return the address of the relation that is no longer a partition of rel.
  */
 static ObjectAddress
-ATExecDetachPartition(Relation rel, RangeVar *name)
+ATExecDetachPartition(Relation rel, PartitionCmd *cmd)
 {
 	Relation	partRel,
 				pgclass,
@@ -14791,6 +14966,8 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
 	List	   *indexes;
 	ListCell   *cell;
 
+	if (cmd->concurrently)
+		elog(NOTICE, "Concurrently");
 
 	/*
 	 * We must lock the default partition, because detaching this partition
@@ -14801,7 +14978,7 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
 	if (OidIsValid(defaultPartOid))
 		LockRelationOid(defaultPartOid, AccessExclusiveLock);
 
-	partRel = heap_openrv(name, AccessShareLock);
+	partRel = heap_openrv(cmd->name, AccessShareLock);
 
 	/* Update pg_class tuple */
 	pgclass = heap_open(RelationRelationId, RowExclusiveLock);
@@ -15017,7 +15194,7 @@ RangeVarCallbackForAttachIndex(const RangeVar *rv, Oid relOid, Oid oldRelOid,
  * ALTER INDEX i1 ATTACH PARTITION i2
  */
 static ObjectAddress
-ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx, RangeVar *name)
+ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx, PartitionCmd *cmd)
 {
 	Relation	partIdx;
 	Relation	partTbl;
@@ -15027,6 +15204,12 @@ ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx, RangeVar *name)
 	Oid			currParent;
 	struct AttachIndexCallbackState state;
 
+	/* ATTACH PARTITION CONCURRENTLY is only supported on tables */
+	if (cmd->concurrently)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("cannot attach index on partitioned index concurrently")));
+
 	/*
 	 * We need to obtain lock on the index 'name' to modify it, but we also
 	 * need to read its owning table's tuple descriptor -- so we need to lock
@@ -15038,14 +15221,14 @@ ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx, RangeVar *name)
 	state.parentTblOid = parentIdx->rd_index->indrelid;
 	state.lockedParentTbl = false;
 	partIdxId =
-		RangeVarGetRelidExtended(name, AccessExclusiveLock, 0,
+		RangeVarGetRelidExtended(cmd->name, AccessExclusiveLock, 0,
 								 RangeVarCallbackForAttachIndex,
 								 (void *) &state);
 	/* Not there? */
 	if (!OidIsValid(partIdxId))
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
-				 errmsg("index \"%s\" does not exist", name->relname)));
+				 errmsg("index \"%s\" does not exist", cmd->name->relname)));
 
 	/* no deadlock risk: RangeVarGetRelidExtended already acquired the lock */
 	partIdx = relation_open(partIdxId, AccessExclusiveLock);
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7c8220cf65..376d7d0d24 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4575,6 +4575,7 @@ _copyPartitionCmd(const PartitionCmd *from)
 
 	COPY_NODE_FIELD(name);
 	COPY_NODE_FIELD(bound);
+	COPY_SCALAR_FIELD(concurrently);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 378f2facb8..5fc47fefdc 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2885,6 +2885,7 @@ _equalPartitionCmd(const PartitionCmd *a, const PartitionCmd *b)
 {
 	COMPARE_NODE_FIELD(name);
 	COMPARE_NODE_FIELD(bound);
+	COMPARE_SCALAR_FIELD(concurrently);
 
 	return true;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index fd06da98b9..8b800151d7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6928,6 +6928,10 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 			int			nappinfos;
 			List	   *child_scanjoin_targets = NIL;
 
+			/* Skip invalid partitions */
+			if (!child_rel)
+				continue;
+
 			/* Translate scan/join targets for this child. */
 			appinfos = find_appinfos_by_relids(root, child_rel->relids,
 											   &nappinfos);
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 3896617760..a8ae993320 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -107,12 +107,13 @@ static void expand_partitioned_rtentry_recurse(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos);
+						   int partidx, List **appinfos);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
 								PlanRowMark *top_parentrc, Relation childrel,
-								List **appinfos, RangeTblEntry **childrte_p,
+								int partidx,  List **appinfos,
+								RangeTblEntry **childrte_p,
 								Index *childRTindex_p);
 static void make_inh_translation_list(Relation oldrelation,
 						  Relation newrelation,
@@ -1642,7 +1643,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		}
 
 		expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
-										newrelation,
+										newrelation, -1,
 										&appinfos, &childrte,
 										&childRTindex);
 
@@ -1725,7 +1726,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	 * partition key columns of all the partitioned tables.
 	 */
 	expand_partitioned_rtentry_recurse(root, rte, rti, partrel, partrc,
-									   lockmode, &root->append_rel_list);
+									   lockmode, -1, &root->append_rel_list);
 
 	heap_close(partrel, NoLock);
 }
@@ -1741,7 +1742,7 @@ static void
 expand_partitioned_rtentry_recurse(PlannerInfo *root, RangeTblEntry *parentrte,
 								   Index parentRTindex, Relation parentrel,
 								   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-								   List **appinfos)
+								   int partidx, List **appinfos)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1766,14 +1767,14 @@ expand_partitioned_rtentry_recurse(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
-									top_parentrc, parentrel,
+									top_parentrc, parentrel, partidx,
 									appinfos, &childrte, &childRTindex);
 
 	/*
-	 * If the partitioned table has no partitions, treat this as the
+	 * If the partitioned table has no valid partitions, treat this as the
 	 * non-inheritance case.
 	 */
-	if (partdesc->nparts == 0)
+	if (partdesc->nvalidparts == 0)
 	{
 		parentrte->inh = false;
 		return;
@@ -1781,20 +1782,24 @@ expand_partitioned_rtentry_recurse(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	for (i = 0; i < partdesc->nparts; i++)
 	{
-		Oid			childOID = partdesc->oids[i];
 		Relation	childrel;
 
-		childrel = heap_open(childOID, lockmode);
+		/* Skip invalid partitions */
+		if (!partdesc->is_valid[i])
+			continue;
+
+		childrel = heap_open(partdesc->oids[i], lockmode);
 
 		expand_single_inheritance_child(root, parentrte, parentRTindex,
 										parentrel, top_parentrc, childrel,
-										appinfos, &childrte, &childRTindex);
+										i, appinfos, &childrte,
+										&childRTindex);
 
 		/* If this child is itself partitioned, recurse */
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry_recurse(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos);
+									   i, appinfos);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
@@ -1826,7 +1831,8 @@ static void
 expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
 								PlanRowMark *top_parentrc, Relation childrel,
-								List **appinfos, RangeTblEntry **childrte_p,
+								int partidx, List **appinfos,
+								RangeTblEntry **childrte_p,
 								Index *childRTindex_p)
 {
 	Query	   *parse = root->parse;
@@ -1878,6 +1884,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		appinfo->child_relid = childRTindex;
 		appinfo->parent_reltype = parentrel->rd_rel->reltype;
 		appinfo->child_reltype = childrel->rd_rel->reltype;
+		appinfo->partidx = partidx;
 		make_inh_translation_list(parentrel, childrel, childRTindex,
 								  &appinfo->translated_vars);
 		appinfo->parent_reloid = parentOID;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 8369e3ad62..b95b4fe9d6 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1911,6 +1911,9 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
 	Assert(partdesc != NULL && rel->part_scheme != NULL);
 	rel->boundinfo = partition_bounds_copy(partdesc->boundinfo, partkey);
 	rel->nparts = partdesc->nparts;
+	rel->nvalidparts = partdesc->nvalidparts;
+	rel->part_valid = (bool *) palloc(rel->nparts  * sizeof(bool));
+	memcpy(rel->part_valid, partdesc->is_valid, rel->nparts * sizeof(bool));
 	set_baserel_partition_key_exprs(relation, rel);
 	rel->partition_qual = RelationGetPartitionQual(relation);
 }
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index c69740eda6..98a54b40c5 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -190,9 +190,11 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
 	rel->has_eclass_joins = false;
 	rel->part_scheme = NULL;
 	rel->nparts = 0;
+	rel->nvalidparts = 0;
 	rel->boundinfo = NULL;
 	rel->partition_qual = NIL;
 	rel->part_rels = NULL;
+	rel->part_valid = NULL;
 	rel->partexprs = NULL;
 	rel->nullable_partexprs = NULL;
 	rel->partitioned_child_rels = NIL;
@@ -269,21 +271,22 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
 	{
 		ListCell   *l;
 		int			nparts = rel->nparts;
-		int			cnt_parts = 0;
 
 		if (nparts > 0)
 			rel->part_rels = (RelOptInfo **)
-				palloc(sizeof(RelOptInfo *) * nparts);
+				palloc0(sizeof(RelOptInfo *) * nparts);
 
 		foreach(l, root->append_rel_list)
 		{
 			AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l);
 			RelOptInfo *childrel;
+			int			partidx;
 
 			/* append_rel_list contains all append rels; ignore others */
 			if (appinfo->parent_relid != relid)
 				continue;
 
+			partidx = appinfo->partidx;
 			childrel = build_simple_rel(root, appinfo->child_relid,
 										rel);
 
@@ -291,18 +294,10 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
 			if (!rel->part_scheme)
 				continue;
 
-			/*
-			 * The order of partition OIDs in append_rel_list is the same as
-			 * the order in the PartitionDesc, so the order of part_rels will
-			 * also match the PartitionDesc.  See expand_partitioned_rtentry.
-			 */
-			Assert(cnt_parts < nparts);
-			rel->part_rels[cnt_parts] = childrel;
-			cnt_parts++;
+			/* Record the RelOptInfo of this partition */
+			rel->part_rels[partidx] = childrel;
 		}
 
-		/* We should have seen all the child partitions. */
-		Assert(cnt_parts == nparts);
 	}
 
 	return rel;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 87f5e95827..b06f6f1216 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2012,28 +2012,30 @@ alter_table_cmds:
 		;
 
 partition_cmd:
-			/* ALTER TABLE <name> ATTACH PARTITION <table_name> FOR VALUES */
-			ATTACH PARTITION qualified_name PartitionBoundSpec
+			/* ALTER TABLE <name> ATTACH PARTITION [CONCURRENTLY] <table_name> FOR VALUES */
+			ATTACH PARTITION opt_concurrently qualified_name PartitionBoundSpec
 				{
 					AlterTableCmd *n = makeNode(AlterTableCmd);
 					PartitionCmd *cmd = makeNode(PartitionCmd);
 
 					n->subtype = AT_AttachPartition;
-					cmd->name = $3;
-					cmd->bound = $4;
+					cmd->name = $4;
+					cmd->bound = $5;
+					cmd->concurrently = $3;
 					n->def = (Node *) cmd;
 
 					$$ = (Node *) n;
 				}
-			/* ALTER TABLE <name> DETACH PARTITION <partition_name> */
-			| DETACH PARTITION qualified_name
+			/* ALTER TABLE <name> DETACH PARTITION [CONCURRENTLY] <partition_name> */
+			| DETACH PARTITION opt_concurrently qualified_name
 				{
 					AlterTableCmd *n = makeNode(AlterTableCmd);
 					PartitionCmd *cmd = makeNode(PartitionCmd);
 
 					n->subtype = AT_DetachPartition;
-					cmd->name = $3;
+					cmd->name = $4;
 					cmd->bound = NULL;
+					cmd->concurrently = $3;
 					n->def = (Node *) cmd;
 
 					$$ = (Node *) n;
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 752810d0e4..32e70ca580 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -602,7 +602,11 @@ prune_append_rel_partitions(RelOptInfo *rel)
 	i = -1;
 	result = NULL;
 	while ((i = bms_next_member(partindexes, i)) >= 0)
-		result = bms_add_member(result, rel->part_rels[i]->relid);
+	{
+		/* Skip invalid partitions */
+		if (rel->part_rels[i])
+			result = bms_add_member(result, rel->part_rels[i]->relid);
+	}
 
 	return result;
 }
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 51a21c4793..dcb3de9959 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -264,10 +264,13 @@ RelationBuildPartitionDesc(Relation rel)
 {
 	List	   *partoids;
 	Oid		   *oids = NULL;
+	bool	   *isvalid = NULL;
 	List	   *boundspecs = NIL;
+	List	   *isvalidlist = NIL;
 	ListCell   *cell;
 	int			i,
-				nparts;
+				nparts,
+				nvalidparts;
 	PartitionKey key = RelationGetPartitionKey(rel);
 	PartitionDesc result;
 	MemoryContext oldcxt;
@@ -304,7 +307,9 @@ RelationBuildPartitionDesc(Relation rel)
 	partoids = NIL;
 	while ((partTuple = systable_getnext(scan)) != NULL)
 	{
-		Oid			partrelid = ((Form_pg_partition) GETSTRUCT(partTuple))->partrelid;
+		Form_pg_partition	partform = (Form_pg_partition) GETSTRUCT(partTuple);
+		Oid			partrelid = partform->partrelid;
+		bool		valid = partform->partvalid;
 		HeapTuple	tuple;
 		Datum		datum;
 		bool		isnull;
@@ -351,6 +356,7 @@ RelationBuildPartitionDesc(Relation rel)
 
 		boundspecs = lappend(boundspecs, boundspec);
 		partoids = lappend_oid(partoids, partrelid);
+		isvalidlist = lappend_int(isvalidlist, valid); /* XXX int List to store bools? */
 		ReleaseSysCache(tuple);
 	}
 
@@ -359,6 +365,7 @@ RelationBuildPartitionDesc(Relation rel)
 	heap_close(pgpart, AccessShareLock);
 
 	nparts = list_length(partoids);
+	nvalidparts = 0;
 
 	if (nparts > 0)
 	{
@@ -367,6 +374,16 @@ RelationBuildPartitionDesc(Relation rel)
 		foreach(cell, partoids)
 			oids[i++] = lfirst_oid(cell);
 
+		isvalid = (bool *) palloc(nparts * sizeof(bool));
+		i = 0;
+		foreach(cell, isvalidlist)
+		{
+			isvalid[i] = (bool) lfirst_int(cell);
+			if (isvalid[i])
+				nvalidparts++;
+			i++;
+		}
+
 		/* Convert from node to the internal representation */
 		if (key->strategy == PARTITION_STRATEGY_HASH)
 		{
@@ -604,6 +621,7 @@ RelationBuildPartitionDesc(Relation rel)
 
 	result = (PartitionDescData *) palloc0(sizeof(PartitionDescData));
 	result->nparts = nparts;
+	result->nvalidparts = nvalidparts;
 	if (nparts > 0)
 	{
 		PartitionBoundInfo boundinfo;
@@ -612,6 +630,7 @@ RelationBuildPartitionDesc(Relation rel)
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
 		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
+		result->is_valid = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -807,6 +826,7 @@ RelationBuildPartitionDesc(Relation rel)
 			/* Record if the partition is a leaf partition */
 			result->is_leaf[index] =
 				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+			result->is_valid[index] = isvalid[i];
 		}
 
 		pfree(mapping);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 6125421d39..feca620cff 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1024,12 +1024,16 @@ equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
 		/*
 		 * Same oids? If the partitioning structure did not change, that is,
 		 * no partitions were added or removed to the relation, the oids array
-		 * should still match element-by-element.
+		 * should still match element-by-element.  The is_valid flag must also
+		 * match.
 		 */
 		for (i = 0; i < partdesc1->nparts; i++)
 		{
 			if (partdesc1->oids[i] != partdesc2->oids[i])
 				return false;
+
+			if (partdesc1->is_valid[i] != partdesc2->is_valid[i])
+				return false;
 		}
 
 		/*
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 7fde1114a0..518618ee92 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3008,14 +3008,15 @@ describeOneTableDetails(const char *schemaname,
 				printfPQExpBuffer(&buf,
 								  "SELECT c.oid::pg_catalog.regclass,"
 								  "       pg_catalog.pg_get_expr(p.partbound, p.partrelid),"
-								  "       c.relkind"
+								  "       c.relkind,"
+								  "       p.partvalid"
 								  " FROM pg_catalog.pg_class c, pg_catalog.pg_partition p"
 								  " WHERE c.oid=p.partrelid AND p.parentrelid = '%s'"
 								  " ORDER BY pg_catalog.pg_get_expr(p.partbound, p.partrelid) = 'DEFAULT',"
 								  "          c.oid::pg_catalog.regclass::pg_catalog.text;", oid);
 			else
 				printfPQExpBuffer(&buf,
-								  "SELECT c.oid::pg_catalog.regclass,NULL,c.relkind"
+								  "SELECT c.oid::pg_catalog.regclass,NULL,c.relkind,true"
 								  " FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i"
 								  " WHERE c.oid=i.inhrelid AND i.inhparent = '%s'"
 								  " ORDER BY c.oid::pg_catalog.regclass::pg_catalog.text;", oid);
@@ -3024,7 +3025,7 @@ describeOneTableDetails(const char *schemaname,
 			printfPQExpBuffer(&buf,
 							  "SELECT c.oid::pg_catalog.regclass,"
 							  "       pg_catalog.pg_get_expr(c.relpartbound, c.oid),"
-							  "       c.relkind"
+							  "       c.relkind,true"
 							  " FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i"
 							  " WHERE c.oid=i.inhrelid AND i.inhparent = '%s'"
 							  " ORDER BY pg_catalog.pg_get_expr(c.relpartbound, c.oid) = 'DEFAULT',"
@@ -3092,20 +3093,26 @@ describeOneTableDetails(const char *schemaname,
 				else
 				{
 					char	   *partitioned_note;
+					char	   *validity;
 
 					if (*PQgetvalue(result, i, 2) == RELKIND_PARTITIONED_TABLE)
 						partitioned_note = ", PARTITIONED";
 					else
 						partitioned_note = "";
 
+					if (strcmp(PQgetvalue(result, i, 3), "t") == 0)
+						validity = "";
+					else
+						validity = " INVALID";
+
 					if (i == 0)
-						printfPQExpBuffer(&buf, "%s: %s %s%s",
+						printfPQExpBuffer(&buf, "%s: %s %s%s%s",
 										  ct, PQgetvalue(result, i, 0), PQgetvalue(result, i, 1),
-										  partitioned_note);
+										  partitioned_note, validity);
 					else
-						printfPQExpBuffer(&buf, "%*s  %s %s%s",
+						printfPQExpBuffer(&buf, "%*s  %s %s%s%s",
 										  ctw, "", PQgetvalue(result, i, 0), PQgetvalue(result, i, 1),
-										  partitioned_note);
+										  partitioned_note, validity);
 				}
 				if (i < tuples - 1)
 					appendPQExpBufferChar(&buf, ',');
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 8cbbd227f4..e4448bbc56 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -27,10 +27,17 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
+	int			nvalidparts;	/* Number of partitions which are valid */
 	Oid		   *oids;			/* Array of length 'nparts' containing
 								 * partition OIDs in order of the their bounds */
 	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
 								 * a partition is a leaf partition or not */
+	bool	   *is_valid;		/* Array of 'nparts' elements storing whether
+								 * a partition is ready for use by queries.
+								 * When not valid a partition is being
+								 * concurrently attached, or a concurrent
+								 * attach failed. XXX is it worth combining
+								 * these two arrays into flags? */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 61b31fb2bb..1b85beb52b 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -857,6 +857,7 @@ typedef struct PartitionCmd
 	NodeTag		type;
 	RangeVar   *name;			/* name of partition to attach/detach */
 	PartitionBoundSpec *bound;	/* FOR VALUES, if attaching */
+	bool		concurrently;	/* true if CONCURRENTLY keyword was used */
 } PartitionCmd;
 
 /****************************************************************************
@@ -1813,6 +1814,7 @@ typedef struct AlterTableCmd	/* one subcommand of an ALTER TABLE */
 								 * constraint, or parent table */
 	DropBehavior behavior;		/* RESTRICT or CASCADE for DROP cases */
 	bool		missing_ok;		/* skip error if missing? */
+	bool		concurrently;	/* reduced lock level */
 } AlterTableCmd;
 
 
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 41caf873fb..f975a33d5b 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -693,10 +693,14 @@ typedef struct RelOptInfo
 	/* used for partitioned relations */
 	PartitionScheme part_scheme;	/* Partitioning scheme. */
 	int			nparts;			/* number of partitions */
+	int			nvalidparts;	/* number of valid partitions */
 	struct PartitionBoundInfoData *boundinfo;	/* Partition bounds */
 	List	   *partition_qual; /* partition constraint */
 	struct RelOptInfo **part_rels;	/* Array of RelOptInfos of partitions,
 									 * stored in the same order of bounds */
+	bool	   *part_valid;		/* Array of 'nparts' elements set to true if
+								 * the given partition's ATTACH is complete
+								 * and is not concurrently being DETACHed */
 	List	  **partexprs;		/* Non-nullable partition key expressions. */
 	List	  **nullable_partexprs; /* Nullable partition key expressions. */
 	List	   *partitioned_child_rels; /* List of RT indexes. */
@@ -2138,6 +2142,13 @@ typedef struct AppendRelInfo
 	Oid			parent_reltype; /* OID of parent's composite type */
 	Oid			child_reltype;	/* OID of child's composite type */
 
+	/*
+	 * Index into PartitionDesc arrays of this partition, or -1 if the
+	 * AppendRelInfo belongs to a inheritance child table or if it
+	 * belongs to the top-level partitioned table.
+	 */
+	int			partidx;
+
 	/*
 	 * The N'th element of this list is a Var or expression representing the
 	 * child column corresponding to the N'th column of the parent. This is
-- 
2.16.2.windows.1

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: David Rowley (#1)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 3 August 2018 at 01:25, David Rowley <david.rowley@2ndquadrant.com> wrote:

1. Do all the normal partition attach partition validation.
2. Insert a record into pg_partition with partisvalid=false
3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table.
4. Obtain a session-level AccessExclusiveLock on the partition being attached.
5. Commit.
6. Start a new transaction.
7. Wait for snapshots older than our own to be released.
8. Mark the partition as valid
9. Invalidate relcache for the partitioned table.
10. release session-level locks.

So I was thinking about this again and realised this logic is broken.
All it takes is a snapshot that starts after the ATTACH PARTITION
started and before it completed. This snapshot will have the new
partition attached while it's possibly still open which could lead to
non-repeatable reads in a repeatable read transaction. The window for
this to occur is possibly quite large given that the ATTACH
CONCURRENTLY can wait a long time for older snapshots to finish.

Here's my updated thinking for an implementation which seems to get
around the above problem:

ATTACH PARTITION CONCURRENTLY:

1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
than an AccessExclusiveLock.
2. Do all the normal partition attach partition validation.
3. Insert pg_partition record with partvalid = true.
4. Invalidate relcache entry for the partitioned table
5. Any loops over a partitioned table's PartitionDesc must check
PartitionIsValid(). This will return true if the current snapshot
should see the partition or not. The partition is valid if partisvalid
= true and the xmin precedes or is equal to the current snapshot.

#define PartitionIsValid(pd, i) (((pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals((pd)->xmin[(i)], GetCurrentTransactionId())) \
|| (!(pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals(GetCurrentTransactionId(), (pd)->xmin[(i)])))

DETACH PARTITION CONCURRENTLY:

1. Obtain ShareUpdateExclusiveLock on partition being detached
(instead of the AccessShareLock that non-concurrent detach uses)
2. Update the pg_partition record, set partvalid = false.
3. Commit
4. New transaction.
5. Wait for transactions which hold a snapshot older than the one held
when updating pg_partition to complete.
6. Delete the pg_partition record.
7. Perform other cleanup, relpartitionparent = 0, pg_depend etc.

DETACH PARTITION CONCURRENTLY failure when it fails after step 3 (above)

1. Make vacuum of a partition check for pg_partition.partvalid =
false, if xmin of tuple is old enough, perform a partition cleanup by
doing steps 6+7 above.

A VACUUM FREEZE must run before transaction wraparound, so this means
a partition can never reattach itself when the transaction counter
wraps.

I believe I've got the attach and detach working correctly now and
also isolation tests that appear to prove it works. I've also written
the failed detach cleanup code into vacuum. Unusually, since foreign
tables can also be partitions this required teaching auto-vacuum to
look at foreign tables, only in the sense of checking for failed
detached partitions. It also requires adding vacuum support for
foreign tables too. It feels a little bit weird to modify auto-vacuum
to look at foreign tables, but I really couldn't see another way to do
this.

I'm now considering if this all holds together in the event the
pg_partition tuple of an invalid partition becomes frozen. The problem
would be that PartitionIsValid() could return the wrong value due to
TransactionIdPrecedesOrEquals(GetCurrentTransactionId(),
(pd)->xmin[(i)]). this code is trying to keep the detached partition
visible to older snapshots, but if pd->xmin[i] becomes frozen, then
the partition would become invisible. However, I think this won't be
a problem since a VACUUM FREEZE would only freeze tuples that are also
old enough to have failed detaches cleaned up earlier in the vacuum
process.

Also, we must disallow a DEFAULT partition from being attached to a
partition with a failed DETACH CONCURRENTLY as it wouldn't be very
clear what the default partition's partition qual would be, as this is
built based on the quals of all attached partitions.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: David Rowley (#2)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Hi,

On 2018-08-08 00:40:12 +1200, David Rowley wrote:

1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
than an AccessExclusiveLock.
2. Do all the normal partition attach partition validation.
3. Insert pg_partition record with partvalid = true.
4. Invalidate relcache entry for the partitioned table
5. Any loops over a partitioned table's PartitionDesc must check
PartitionIsValid(). This will return true if the current snapshot
should see the partition or not. The partition is valid if partisvalid
= true and the xmin precedes or is equal to the current snapshot.

How does this protect against other sessions actively using the relcache
entry? Currently it is *NOT* safe to receive invalidations for
e.g. partitioning contents afaics.

- Andres

Simon Riggs

simon@2ndquadrant.com

over 7 years ago

In reply to: Andres Freund (#3)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 7 August 2018 at 13:47, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-08-08 00:40:12 +1200, David Rowley wrote:

1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
than an AccessExclusiveLock.
2. Do all the normal partition attach partition validation.
3. Insert pg_partition record with partvalid = true.
4. Invalidate relcache entry for the partitioned table
5. Any loops over a partitioned table's PartitionDesc must check
PartitionIsValid(). This will return true if the current snapshot
should see the partition or not. The partition is valid if partisvalid
= true and the xmin precedes or is equal to the current snapshot.

How does this protect against other sessions actively using the relcache
entry? Currently it is *NOT* safe to receive invalidations for
e.g. partitioning contents afaics.

I think you may be right in the general case, but ISTM possible to
invalidate/refresh just the list of partitions.

If so, that idea would seem to require some new, as-yet not invented mechanism.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Andres Freund (#3)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 8 August 2018 at 00:47, Andres Freund <andres@anarazel.de> wrote:

On 2018-08-08 00:40:12 +1200, David Rowley wrote:

1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
than an AccessExclusiveLock.
2. Do all the normal partition attach partition validation.
3. Insert pg_partition record with partvalid = true.
4. Invalidate relcache entry for the partitioned table
5. Any loops over a partitioned table's PartitionDesc must check
PartitionIsValid(). This will return true if the current snapshot
should see the partition or not. The partition is valid if partisvalid
= true and the xmin precedes or is equal to the current snapshot.

How does this protect against other sessions actively using the relcache
entry? Currently it is *NOT* safe to receive invalidations for
e.g. partitioning contents afaics.

I'm not proposing that sessions running older snapshots can't see that
there's a new partition. The code I have uses PartitionIsValid() to
test if the partition should be visible to the snapshot. The
PartitionDesc will always contain details for all partitions stored in
pg_partition whether they're valid to the current snapshot or not. I
did it this way as there's no way to invalidate the relcache based on
a point in transaction, only a point in time.

I'm open to better ideas, of course.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: David Rowley (#5)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018-08-08 01:23:51 +1200, David Rowley wrote:

On 8 August 2018 at 00:47, Andres Freund <andres@anarazel.de> wrote:

On 2018-08-08 00:40:12 +1200, David Rowley wrote:

1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
than an AccessExclusiveLock.
2. Do all the normal partition attach partition validation.
3. Insert pg_partition record with partvalid = true.
4. Invalidate relcache entry for the partitioned table
5. Any loops over a partitioned table's PartitionDesc must check
PartitionIsValid(). This will return true if the current snapshot
should see the partition or not. The partition is valid if partisvalid
= true and the xmin precedes or is equal to the current snapshot.

How does this protect against other sessions actively using the relcache
entry? Currently it is *NOT* safe to receive invalidations for
e.g. partitioning contents afaics.

I'm not proposing that sessions running older snapshots can't see that
there's a new partition. The code I have uses PartitionIsValid() to
test if the partition should be visible to the snapshot. The
PartitionDesc will always contain details for all partitions stored in
pg_partition whether they're valid to the current snapshot or not. I
did it this way as there's no way to invalidate the relcache based on
a point in transaction, only a point in time.

I don't think that solves the problem that an arriving relcache
invalidation would trigger a rebuild of rd_partdesc, while it actually
is referenced by running code.

You'd need to build infrastructure to prevent that.

One approach would be to make sure that everything relying on
rt_partdesc staying the same stores its value in a local variable, and
then *not* free the old version of rt_partdesc (etc) when the refcount >
0, but delay that to the RelationClose() that makes refcount reach
0. That'd be the start of a framework for more such concurrenct
handling.

Regards,

Andres Freund

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Andres Freund (#6)

1 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 8 August 2018 at 01:29, Andres Freund <andres@anarazel.de> wrote:

On 2018-08-08 01:23:51 +1200, David Rowley wrote:

I'm not proposing that sessions running older snapshots can't see that
there's a new partition. The code I have uses PartitionIsValid() to
test if the partition should be visible to the snapshot. The
PartitionDesc will always contain details for all partitions stored in
pg_partition whether they're valid to the current snapshot or not. I
did it this way as there's no way to invalidate the relcache based on
a point in transaction, only a point in time.

I don't think that solves the problem that an arriving relcache
invalidation would trigger a rebuild of rd_partdesc, while it actually
is referenced by running code.

You'd need to build infrastructure to prevent that.

One approach would be to make sure that everything relying on
rt_partdesc staying the same stores its value in a local variable, and
then *not* free the old version of rt_partdesc (etc) when the refcount >
0, but delay that to the RelationClose() that makes refcount reach
0. That'd be the start of a framework for more such concurrenct
handling.

I'm not so sure not freeing the partdesc until the refcount reaches 0
is safe. As you'd expect, we hold a lock on a partitioned table
between the planner and executor, but the relation has been closed and
the ref count returns to 0, which means when the relation is first
opened in the executor that the updated PartitionDesc is obtained. A
non-concurrent attach would have been blocked in this case due to the
lock being held by the planner. Instead of using refcount == 0,
perhaps we can release the original partdesc only when there are no
locks held by us on the relation.

It's late here now, so I'll look at that tomorrow.

I've attached what I was playing around with. I think I'll also need
to change RelationGetPartitionDesc() to have it return the original
partdesc, if it's non-NULL.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

dont_destroy_original_partdesc_on_rel_inval.patchapplication/octet-stream; name=dont_destroy_original_partdesc_on_rel_inval.patchDownload

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index feca620cff..a25ead01d4 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1202,6 +1202,9 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 		relation->rd_pdcxt = NULL;
 	}
 
+	relation->rd_partdesc_orig = NULL;
+	relation->rd_pdcxt_orig = NULL;
+
 	/*
 	 * if it's an index, initialize index-related information
 	 */
@@ -2012,6 +2015,20 @@ RelationClose(Relation relation)
 	/* Note: no locking manipulations needed */
 	RelationDecrementReferenceCount(relation);
 
+	/*
+	 * If the partdesc has been changed while the relation had a non-zero
+	 * ref count we'll have stored the original partdesc.  When the ref
+	 * count reaches 0 we must get rid of the original partdesc and destroy
+	 * the memory context it was stored in.
+	 */
+	if (relation->rd_partdesc_orig &&
+		RelationHasReferenceCountZero(relation))
+	{
+		MemoryContextDelete(relation->rd_pdcxt_orig);
+		relation->rd_partdesc_orig = NULL;
+		relation->rd_pdcxt_orig = NULL;
+	}
+
 #ifdef RELCACHE_FORCE_RELEASE
 	if (RelationHasReferenceCountZero(relation) &&
 		relation->rd_createSubid == InvalidSubTransactionId &&
@@ -2288,7 +2305,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
 		MemoryContextDelete(relation->rd_rsdesc->rscxt);
 	if (relation->rd_partkeycxt)
 		MemoryContextDelete(relation->rd_partkeycxt);
-	if (relation->rd_pdcxt)
+	if (relation->rd_pdcxt && !relation->rd_pdcxt_orig)
 		MemoryContextDelete(relation->rd_pdcxt);
 	if (relation->rd_partcheck)
 		pfree(relation->rd_partcheck);
@@ -2546,6 +2563,27 @@ RelationClearRelation(Relation relation, bool rebuild)
 			SWAPFIELD(MemoryContext, rd_pdcxt);
 		}
 
+		/*
+		 * When the partdesc changes while the refcount is > 0 then we must
+		 * keep the original partdesc intact as it might currently be getting
+		 * used by some code.  We store the original partdesc in
+		 * rd_partdesc_orig.  If it changes multiple times we don't need to
+		 * keep the intermediate ones, just the original.  This gets cleared
+		 * up and set to NULL again when the ref count reaches 0.
+		 */
+		else if (relation->rd_partdesc != NULL && newrel->rd_partdesc != NULL &&
+			relation->rd_partdesc_orig == NULL)
+		{
+			relation->rd_partdesc_orig = newrel->rd_partdesc;
+			relation->rd_pdcxt_orig = newrel->rd_pdcxt;
+
+			/*
+			 * Set this so that RelationDestroyRelation does not destroy the
+			 * memory context of the original partdesc.
+			 */
+			newrel->rd_pdcxt_orig = newrel->rd_pdcxt;
+		}
+
 #undef SWAPFIELD
 
 		/* And now we can throw away the temporary entry */
@@ -5652,6 +5690,8 @@ load_relcache_init_file(bool shared)
 		rel->rd_partkey = NULL;
 		rel->rd_pdcxt = NULL;
 		rel->rd_partdesc = NULL;
+		rel->rd_pdcxt_orig = NULL;
+		rel->rd_partdesc_orig = NULL;
 		rel->rd_partcheck = NIL;
 		rel->rd_indexprs = NIL;
 		rel->rd_indpred = NIL;
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index eb1858aa92..7724d8da15 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -99,6 +99,12 @@ typedef struct RelationData
 	struct PartitionKeyData *rd_partkey;	/* partition key, or NULL */
 	MemoryContext rd_pdcxt;		/* private context for partdesc */
 	struct PartitionDescData *rd_partdesc;	/* partitions, or NULL */
+	MemoryContext rd_pdcxt_orig;	/* private context for rd_partdesc_orig */
+	struct PartitionDescData *rd_partdesc_orig;	/* original partitions, or
+												 * NULL if not partitioned or
+												 * the rd_partdesc has not
+												 * changed while the ref count
+												 * was non-zero */
 	List	   *rd_partcheck;	/* partition CHECK quals */
 
 	/* data managed by RelationGetIndexList: */

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 7 years ago

In reply to: Andres Freund (#6)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 07/08/2018 15:29, Andres Freund wrote:

I don't think that solves the problem that an arriving relcache
invalidation would trigger a rebuild of rd_partdesc, while it actually
is referenced by running code.

The problem is more generally that a relcache invalidation changes all
pointers that might be in use. So it's currently not safe to trigger a
relcache invalidation (on tables) without some kind of exclusive lock.
One possible solution to this is outlined here:
</messages/by-id/CA+TgmobtmFT5g-0dA=vEFFtogjRAuDHcYPw+qEdou5dZPnF=pg@mail.gmail.com>

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Peter Eisentraut (#8)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Hi,

On 2018-08-09 20:57:35 +0200, Peter Eisentraut wrote:

On 07/08/2018 15:29, Andres Freund wrote:

I don't think that solves the problem that an arriving relcache
invalidation would trigger a rebuild of rd_partdesc, while it actually
is referenced by running code.

The problem is more generally that a relcache invalidation changes all
pointers that might be in use.

I don't think that's quite right. We already better be OK with
superfluous invals that do not change anything, because there's already
sources of those (just think of vacuum, analyze, relation extension,
whatnot).

So it's currently not safe to trigger a relcache invalidation (on
tables) without some kind of exclusive lock.

I don't think that's true in a as general sense as you're stating it.
It's not OK to send relcache invalidations for things that people rely
on, and that cannot be updated in-place. Because of the dangling pointer
issue etc.

The fact that currently it is not safe to *change* partition related
stuff without an AEL and how to make it safe is precisely what I was
talking about in the thread. It won't be a general solution, but the
infrastructure I'm talking about should get us closer.

One possible solution to this is outlined here:
</messages/by-id/CA+TgmobtmFT5g-0dA=vEFFtogjRAuDHcYPw+qEdou5dZPnF=pg@mail.gmail.com>

I don't see anything in here that addresses the issue structurally?

Greetings,

Andres Freund

#10

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Andres Freund (#6)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 8 August 2018 at 01:29, Andres Freund <andres@anarazel.de> wrote:

One approach would be to make sure that everything relying on
rt_partdesc staying the same stores its value in a local variable, and
then *not* free the old version of rt_partdesc (etc) when the refcount >
0, but delay that to the RelationClose() that makes refcount reach
0. That'd be the start of a framework for more such concurrenct
handling.

This is not a fully baked idea, but I'm wondering if a better way to
do this, instead of having this PartitionIsValid macro to determine if
the partition should be visible to the current transaction ID, we
could, when we invalidate a relcache entry, send along the transaction
ID that it's invalid from. Other backends when they process the
invalidation message they could wipe out the cache entry only if their
xid is >= the invalidation's "xmax" value. Otherwise, just tag the
xmax onto the cache somewhere and always check it before using the
cache (perhaps make it part of the RelationIsValid macro). This would
also require that we move away from SnapshotAny type catalogue scans
in favour of MVCC scans so that backends populating their relcache
build it based on their current xid. Unless I'm mistaken, it should
not make any difference for all DDL that takes an AEL on the relation,
since there can be no older transactions running when the catalogue is
modified, but for DDL that's not taking an AEL, we could effectively
have an MVCC relcache.

It would need careful thought about how it might affect CREATE INDEX
CONCURRENTLY and all the other DDL that can be performed without an
AEL.

I'm unsure how this would work for the catcache as I've studied that
code in even less detail, but throwing this out there in case there
some major flaw in this idea so that I don't go wasting time looking
into it further.

I think the PartitionIsValid idea was not that great as it really
complicates run-time partition pruning since it's quite critical about
partition indexes being the same between the planner and executor.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: David Rowley (#10)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Sun, Aug 12, 2018 at 9:05 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

This is not a fully baked idea, but I'm wondering if a better way to
do this, instead of having this PartitionIsValid macro to determine if
the partition should be visible to the current transaction ID, we
could, when we invalidate a relcache entry, send along the transaction
ID that it's invalid from. Other backends when they process the
invalidation message they could wipe out the cache entry only if their
xid is >= the invalidation's "xmax" value. Otherwise, just tag the
xmax onto the cache somewhere and always check it before using the
cache (perhaps make it part of the RelationIsValid macro).

Transactions don't necessarily commit in XID order, so this might be
an optimization to keep older transactions from having to do
unnecessary rebuilds -- which I actually doubt is a major problem, but
maybe I'm wrong -- but you can't rely solely on this as a way of
deciding which transactions will see the effects of some change. If
transactions 100, 101, and 102 begin in that order, and transaction
101 commits, there's no principled justification for 102 seeing its
effects but 100 not seeing it.

This would
also require that we move away from SnapshotAny type catalogue scans
in favour of MVCC scans so that backends populating their relcache
build it based on their current xid.

I think this is a somewhat confused analysis. We don't use
SnapshotAny for catalog scans, and we never have. We used to use
SnapshotNow, and we now use a current MVCC snapshot. What you're
talking about, I think, is possibly using the transaction snapshot
rather than a current MVCC snapshot for the catalog scans.

I've thought about similar things, but I think there's a pretty deep
can of worms. For instance, if you built a relcache entry using the
transaction snapshot, you might end up building a seemingly-valid
relcache entry for a relation that has been dropped or rewritten.
When you try to access the relation data, you'll be attempt to access
a relfilenode that's not there any more. Similarly, if you use an
older snapshot to build a partition descriptor, you might thing that
relation OID 12345 is still a partition of that table when in fact
it's been detached - and, maybe, altered in other ways, such as
changing column types.

It seems to me that overall you're not really focusing on the right
set of issues here. I think the very first thing we need to worry
about how how we're going to keep the executor from following a bad
pointer and crashing. Any time the partition descriptor changes, the
next relcache rebuild is going to replace rd_partdesc and free the old
one, but the executor may still have the old pointer cached in a
structure or local variable; the next attempt to dereference it will
be looking at freed memory, and kaboom. Right now, we prevent this by
not allowing the partition descriptor to be modified while there are
any queries running against the partition, so while there may be a
rebuild, the old pointer will remain valid (cf. keep_partdesc). I
think that whatever scheme you/we choose here should be tested with a
combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions
-- one of them doing DDL on the table while the other runs a long
query.

Once we keep it from blowing up, the second question is what the
semantics are supposed to be. It seems inconceivable to me that the
set of partitions that an in-progress query will scan can really be
changed on the fly. I think we're going to have to rule that if you
add or remove partitions while a query is running, we're going to scan
exactly the set we had planned to scan at the beginning of the query;
anything else would require on-the-fly plan surgery to a degree that
seems unrealistic. That means that when a new partition is attached,
already-running queries aren't going to scan it. If they did, we'd
have big problems, because the transaction snapshot might see rows in
those tables from an earlier time period during which that table
wasn't attached. There's no guarantee that the data at that time
conformed to the partition constraint, so it would be pretty
problematic to let users see it. Conversely, when a partition is
detached, there may still be scans from existing queries hitting it
for a fairly arbitrary length of time afterwards. That may be
surprising from a locking point of view or something, but it's correct
as far as MVCC goes. Any changes made after the DETACH operation
can't be visible to the snapshot being used for the scan.

Now, what we could try to change on the fly is the set of partitions
that are used for tuple routing. For instance, suppose we're
inserting a long stream of COPY data. At some point, we attach a new
partition from another session. If we then encounter a row that
doesn't route to any of the partitions that existed at the time the
query started, we could - instead of immediately failing - go and
reload the set of partitions that are available for tuple routing and
see if the new partition which was concurrently added happens to be
appropriate to the tuple we've got. If so, we could route the tuple
to it. But all of this looks optional. If new partitions aren't
available for insert/update tuple routing until the start of the next
query, that's not a catastrophe. The reverse direction might be more
problematic: if a partition is detached, I'm not sure how sensible it
is to keep routing tuples into it. On the flip side, what would
break, really?

Given the foregoing, I don't see why you need something like
PartitionIsValid() at all, or why you need an algorithm similar to
CREATE INDEX CONCURRENTLY. The problem seems mostly different. In
the case of CREATE INDEX CONCURRENTLY, the issue is that any new
tuples that get inserted while the index creation is in progress need
to end up in the index, so you'd better not start building the index
on the existing tuples until everybody who might insert new tuples
knows about the index. I don't see that we have the same kind of
problem in this case. Each partition is basically a separate table
with its own set of indexes; as long as queries don't end up with one
notion of which tables are relevant and a different notion of which
indexes are relevant, we shouldn't end up with any table/index
inconsistencies. And it's not clear to me what other problems we
actually have here. To put it another way, if we've got the notion of
"isvalid" for a partition, what's the difference between a partition
that exists but is not yet valid and one that exists and is valid? I
can't think of anything, and I have a feeling that you may therefore
be inventing a lot of infrastructure that is not necessary.

I'm inclined to think that we could drop the name CONCURRENTLY from
this feature altogether and recast it as work to reduce the lock level
associated with partition attach/detach. As long as we have a
reasonable definition of what the semantics are for already-running
queries, and clear documentation to go with those semantics, that
seems fine. If a particular user finds the concurrent behavior too
strange, they can always perform the DDL in a transaction that uses
LOCK TABLE first, removing the concurrency.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Robert Haas (#11)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 14 August 2018 at 04:00, Robert Haas <robertmhaas@gmail.com> wrote:

I've thought about similar things, but I think there's a pretty deep
can of worms. For instance, if you built a relcache entry using the
transaction snapshot, you might end up building a seemingly-valid
relcache entry for a relation that has been dropped or rewritten.
When you try to access the relation data, you'll be attempt to access
a relfilenode that's not there any more. Similarly, if you use an
older snapshot to build a partition descriptor, you might thing that
relation OID 12345 is still a partition of that table when in fact
it's been detached - and, maybe, altered in other ways, such as
changing column types.

hmm, I guess for that to work correctly we'd need some way to allow
older snapshots to see the changes if they've not already taken a lock
on the table. If the lock had already been obtained then the ALTER
TABLE to change the type of the column would get blocked by the
existing lock. That kinda blows holes in only applying the change to
only snapshots newer than the ATTACH/DETACH's

It seems to me that overall you're not really focusing on the right
set of issues here. I think the very first thing we need to worry
about how how we're going to keep the executor from following a bad
pointer and crashing. Any time the partition descriptor changes, the
next relcache rebuild is going to replace rd_partdesc and free the old
one, but the executor may still have the old pointer cached in a
structure or local variable; the next attempt to dereference it will
be looking at freed memory, and kaboom. Right now, we prevent this by
not allowing the partition descriptor to be modified while there are
any queries running against the partition, so while there may be a
rebuild, the old pointer will remain valid (cf. keep_partdesc). I
think that whatever scheme you/we choose here should be tested with a
combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions
-- one of them doing DDL on the table while the other runs a long
query.

I did focus on that and did write a patch to solve the issue. After
writing that I discovered another problem where if the PartitionDesc
differed between planning and execution then run-time pruning did the
wrong thing (See find_matching_subplans_recurse). The
PartitionPruneInfo is built assuming the PartitionDesc matches between
planning and execution. I moved on from the dangling pointer issue
onto trying to figure out a way to ensure these are the same between
planning and execution.

Once we keep it from blowing up, the second question is what the
semantics are supposed to be. It seems inconceivable to me that the
set of partitions that an in-progress query will scan can really be
changed on the fly. I think we're going to have to rule that if you
add or remove partitions while a query is running, we're going to scan
exactly the set we had planned to scan at the beginning of the query;
anything else would require on-the-fly plan surgery to a degree that
seems unrealistic.

Trying to do that for in-progress queries would be pretty insane. I'm
not planning on doing anything there.

That means that when a new partition is attached,
already-running queries aren't going to scan it. If they did, we'd
have big problems, because the transaction snapshot might see rows in
those tables from an earlier time period during which that table
wasn't attached. There's no guarantee that the data at that time
conformed to the partition constraint, so it would be pretty
problematic to let users see it. Conversely, when a partition is
detached, there may still be scans from existing queries hitting it
for a fairly arbitrary length of time afterwards. That may be
surprising from a locking point of view or something, but it's correct
as far as MVCC goes. Any changes made after the DETACH operation
can't be visible to the snapshot being used for the scan.

Now, what we could try to change on the fly is the set of partitions
that are used for tuple routing. For instance, suppose we're
inserting a long stream of COPY data. At some point, we attach a new
partition from another session. If we then encounter a row that
doesn't route to any of the partitions that existed at the time the
query started, we could - instead of immediately failing - go and
reload the set of partitions that are available for tuple routing and
see if the new partition which was concurrently added happens to be
appropriate to the tuple we've got. If so, we could route the tuple
to it. But all of this looks optional. If new partitions aren't
available for insert/update tuple routing until the start of the next
query, that's not a catastrophe. The reverse direction might be more
problematic: if a partition is detached, I'm not sure how sensible it
is to keep routing tuples into it. On the flip side, what would
break, really?

Unsure about that, I don't really see what it would buy us, so
presumably you're just considering that this might not be a
roadblocking side-effect. However, I think the PartitionDesc needs to
not change between planning and execution due to run-time pruning
requirements, so if that's the case then what you're saying here is
probably not an issue we need to think about.

Given the foregoing, I don't see why you need something like
PartitionIsValid() at all, or why you need an algorithm similar to
CREATE INDEX CONCURRENTLY. The problem seems mostly different. In
the case of CREATE INDEX CONCURRENTLY, the issue is that any new
tuples that get inserted while the index creation is in progress need
to end up in the index, so you'd better not start building the index
on the existing tuples until everybody who might insert new tuples
knows about the index. I don't see that we have the same kind of
problem in this case. Each partition is basically a separate table
with its own set of indexes; as long as queries don't end up with one
notion of which tables are relevant and a different notion of which
indexes are relevant, we shouldn't end up with any table/index
inconsistencies. And it's not clear to me what other problems we
actually have here. To put it another way, if we've got the notion of
"isvalid" for a partition, what's the difference between a partition
that exists but is not yet valid and one that exists and is valid? I
can't think of anything, and I have a feeling that you may therefore
be inventing a lot of infrastructure that is not necessary.

Well, the problem is that you want REPEATABLE READ transactions to be
exactly that. A concurrent attach/detach should not change the output
of a query. I don't know for sure that some isvalid flag is required,
but we do need something to ensure we don't change the results of
queries run inside a repeatable read transaction. I did try to start
moving away from the isvalid flag in favour of having a PartitionDesc
just not change within the same snapshot but you've pointed out a few
problems with what I tried to come up with for that.

I'm inclined to think that we could drop the name CONCURRENTLY from
this feature altogether and recast it as work to reduce the lock level
associated with partition attach/detach. As long as we have a
reasonable definition of what the semantics are for already-running
queries, and clear documentation to go with those semantics, that
seems fine. If a particular user finds the concurrent behavior too
strange, they can always perform the DDL in a transaction that uses
LOCK TABLE first, removing the concurrency.

I did have similar thoughts but that seems like something to think
about once the semantics are determined, not before.

Thanks for your input on this. I clearly don't have all the answers on
this so your input and thoughts are very valuable.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13

Alvaro Herrera

alvherre@2ndquadrant.com

over 7 years ago

In reply to: Robert Haas (#11)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018-Aug-13, Robert Haas wrote:

I think this is a somewhat confused analysis. We don't use
SnapshotAny for catalog scans, and we never have. We used to use
SnapshotNow, and we now use a current MVCC snapshot. What you're
talking about, I think, is possibly using the transaction snapshot
rather than a current MVCC snapshot for the catalog scans.

I've thought about similar things, but I think there's a pretty deep
can of worms. For instance, if you built a relcache entry using the
transaction snapshot, you might end up building a seemingly-valid
relcache entry for a relation that has been dropped or rewritten.
When you try to access the relation data, you'll be attempt to access
a relfilenode that's not there any more. Similarly, if you use an
older snapshot to build a partition descriptor, you might thing that
relation OID 12345 is still a partition of that table when in fact
it's been detached - and, maybe, altered in other ways, such as
changing column types.

I wonder if this all stems from a misunderstanding of what I suggested
to David offlist. My suggestion was that the catalog scans would
continue to use the catalog MVCC snapshot, and that the relcache entries
would contain all the partitions that appear to the catalog; but each
partition's entry would carry the Xid of the creating transaction in a
field (say xpart), and that field is compared to the regular transaction
snapshot: if xpart is visible to the transaction snapshot, then the
partition is visible, otherwise not. So you never try to access a
partition that doesn't exist, because those just don't appear at all in
the relcache entry. But if you have an old transaction running with an
old snapshot, and the partitioned table just acquired a new partition,
then whether the partition will be returned as part of the partition
descriptor or not depends on the visibility of its entry.

I think that works fine for ATTACH without any further changes. I'm not
so sure about DETACH, particularly when snapshots persist for a "long
time" (a repeatable-read transaction). ISTM that in the above design,
the partition descriptor would lose the entry for the detached partition
ahead of time, which means queries would silently fail to see their data
(though they wouldn't crash). I first thought this could be fixed by
waiting for those snapshots to finish, but then I realized that there's
no actual place where waiting achieves anything. Certainly it's not
useful to wait before commit (because other snapshots are going to be
starting all the time), and it's not useful to start after the commit
(because by then the catalog tuple is already gone). Maybe we need two
transactions: mark partition as removed with an xmax of sorts, commit,
wait for all snapshots, start transaction, remove partition catalog
tuple, commit.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Alvaro Herrera (#13)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I wonder if this all stems from a misunderstanding of what I suggested
to David offlist. My suggestion was that the catalog scans would
continue to use the catalog MVCC snapshot, and that the relcache entries
would contain all the partitions that appear to the catalog; but each
partition's entry would carry the Xid of the creating transaction in a
field (say xpart), and that field is compared to the regular transaction
snapshot: if xpart is visible to the transaction snapshot, then the
partition is visible, otherwise not. So you never try to access a
partition that doesn't exist, because those just don't appear at all in
the relcache entry. But if you have an old transaction running with an
old snapshot, and the partitioned table just acquired a new partition,
then whether the partition will be returned as part of the partition
descriptor or not depends on the visibility of its entry.

Hmm. One question is where you're going to get the XID of the
creating transaction. If it's taken from the pg_class row or the
pg_inherits row or something of that sort, then you risk getting a
bogus value if something updates that row other than what you expect
-- and the consequences of that are pretty bad here; for this to work
as you intend, you need an exactly-correct value, not newer or older.
An alternative is to add an xid field that stores the value
explicitly, and that might work, but you'll have to arrange for that
value to be frozen at the appropriate time.

A further problem is that there could be multiple changes in quick
succession. Suppose that a partition is attached, then detached
before the attach operation is all-visible, then reattached, perhaps
with different partition bounds.

I think that works fine for ATTACH without any further changes. I'm not
so sure about DETACH, particularly when snapshots persist for a "long
time" (a repeatable-read transaction). ISTM that in the above design,
the partition descriptor would lose the entry for the detached partition
ahead of time, which means queries would silently fail to see their data
(though they wouldn't crash).

I don't see why they wouldn't crash. If the partition descriptor gets
rebuilt and some partitions disappear out from under you, the old
partition descriptor is going to get freed, and the executor has a
cached pointer to it, so it seems like you are in trouble.

I first thought this could be fixed by
waiting for those snapshots to finish, but then I realized that there's
no actual place where waiting achieves anything. Certainly it's not
useful to wait before commit (because other snapshots are going to be
starting all the time), and it's not useful to start after the commit
(because by then the catalog tuple is already gone). Maybe we need two
transactions: mark partition as removed with an xmax of sorts, commit,
wait for all snapshots, start transaction, remove partition catalog
tuple, commit.

And what would that accomplish, exactly? Waiting for all snapshots
would ensure that all still-running transactions see the fact the xmax
with which the partition has been marked as removed, but what good
does that do? In order to have a plausible algorithm, you have to
describe both what the ATTACH/DETACH operation does and what the other
concurrent transactions do and how those things interact. Otherwise,
it's like saying that we're going to solve a problem with X and Y
overlapping by having X take a lock. If Y doesn't take a conflicting
lock, this does nothing.

Generally, I think I see what you're aiming at: make ATTACH and DETACH
have MVCC-like semantics with respect to concurrent transactions. I
don't think that's a dumb idea from a theoretical perspective, but in
practice I think it's going to be very difficult to implement. We
have no other DDL that has such semantics, and there's no reason we
couldn't; for example, TRUNCATE could work with SUEL and transactions
that can't see the TRUNCATE as committed continue to operate on the
old heap. While we could do such things, we don't. If you decide to
do them here, you've probably got a lot of work ahead of you.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Robert Haas (#14)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 21 August 2018 at 13:59, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I wonder if this all stems from a misunderstanding of what I suggested
to David offlist. My suggestion was that the catalog scans would
continue to use the catalog MVCC snapshot, and that the relcache entries
would contain all the partitions that appear to the catalog; but each
partition's entry would carry the Xid of the creating transaction in a
field (say xpart), and that field is compared to the regular transaction
snapshot: if xpart is visible to the transaction snapshot, then the
partition is visible, otherwise not. So you never try to access a
partition that doesn't exist, because those just don't appear at all in
the relcache entry. But if you have an old transaction running with an
old snapshot, and the partitioned table just acquired a new partition,
then whether the partition will be returned as part of the partition
descriptor or not depends on the visibility of its entry.

Hmm. One question is where you're going to get the XID of the
creating transaction. If it's taken from the pg_class row or the
pg_inherits row or something of that sort, then you risk getting a
bogus value if something updates that row other than what you expect
-- and the consequences of that are pretty bad here; for this to work
as you intend, you need an exactly-correct value, not newer or older.
An alternative is to add an xid field that stores the value
explicitly, and that might work, but you'll have to arrange for that
value to be frozen at the appropriate time.

A further problem is that there could be multiple changes in quick
succession. Suppose that a partition is attached, then detached
before the attach operation is all-visible, then reattached, perhaps
with different partition bounds.

I should probably post the WIP I have here. In those, I do have the
xmin array in the PartitionDesc. This gets taken from the new
pg_partition table, which I don't think suffers from the same issue as
taking it from pg_class, since nothing else will update the
pg_partition record.

However, I don't think the xmin array is going to work if we include
it in the PartitionDesc. The problem is, as I discovered from writing
the code was that the PartitionDesc must remain exactly the same
between planning an execution. If there are any more or any fewer
partitions found during execution than what we saw in planning then
run-time pruning will access the wrong element in the
PartitionPruneInfo array, or perhaps access of the end of the array.
It might be possible to work around that by identifying partitions by
Oid rather than PartitionDesc array index, but the run-time pruning
code is already pretty complex. I think coding it to work when the
PartitionDesc does not match between planning and execution is just
going to too difficult to get right. Tom is already unhappy with the
complexity of ExecFindInitialMatchingSubPlans().

I think the solution will require that the PartitionDesc does not:

a) Change between planning and execution.
b) Change during a snapshot after the partitioned table has been locked.

With b, it sounds like we'll need to take the most recent
PartitionDesc even if the transaction is older than the one that did
the ATTACH/DETACH operation as if we use an old version then, as
Robert mentions, there's nothing to stop another transaction making
changes to the table that make it an incompatible partition, e.g DROP
COLUMN. This wouldn't be possible if we update the PartitionDesc
right after taking the first lock on the partitioned table since any
transactions doing DROP COLUMN would be blocked until the other
snapshot gets released.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#16

Alvaro Herrera

alvherre@2ndquadrant.com

about 7 years ago

In reply to: David Rowley (#1)

1 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Hello

Here's my take on this feature, owing to David Rowley's version.

Firstly, I took Robert's advice and removed the CONCURRENTLY keyword
from the syntax. We just do it that way always. When there's a default
partition, only that partition is locked with an AEL; all the rest is
locked with ShareUpdateExclusive only.

I added some isolation tests for it -- they all pass for me.

There are two main ideas supporting this patch:

1. The Partition descriptor cache module (partcache.c) now contains a
long-lived hash table that lists all the current partition descriptors;
when an invalidation message is received for a relation, we unlink the
partdesc from the hash table *but do not free it*. The hash
table-linked partdesc is rebuilt again in the future, when requested, so
many copies might exist in memory for one partitioned table.

2. Snapshots have their own cache (hash table) of partition descriptors.
If a partdesc is requested and the snapshot has already obtained that
partdesc, the original one is returned -- we don't request a new one
from partcache.

Then there are a few other implementation details worth mentioning:

3. parallel query: when a worker starts on a snapshot that has a
partition descriptor cache, we need to transmit those partdescs from
leader via shmem ... but we cannot send the full struct, so we just send
the OID list of partitions, then rebuild the descriptor in the worker.
Side effect: if a partition is detached right between the leader taking
the partdesc and the worker starting, the partition loses its
relpartbound column, so it's not possible to reconstruct the partdesc.
In this case, we raise an error. Hopefully this should be rare.

4. If a partitioned table is dropped, but was listed in a snapshot's
partdesc cache, and then parallel query starts, the worker will try to
restore the partdesc for that table, but there are no catalog rows for
it. The implementation choice here is to ignore the table and move on.
I would like to just remove the partdesc from the snapshot, but that
would require a relcache inval callback, and a) it'd kill us to scan all
snapshots for every relation drop; b) it doesn't work anyway because we
don't have any way to distinguish invals arriving because of DROP from
invals arriving because of anything else, say ANALYZE.

5. snapshots are copied a lot. Copies share the same hash table as the
"original", because surely all copies should see the same partition
descriptor. This leads to the pinning/unpinning business you see for
the structs in snapmgr.c.

Some known defects:

6. this still leaks memory. Not as terribly as my earlier prototypes,
but clearly it's something that I need to address.

7. I've considered the idea of tracking snapshot-partdescs in resowner.c
to prevent future memory leak mistakes. Not done yet. Closely related
to item 6.

8. Header changes may need some cleanup yet -- eg. I'm not sure
snapmgr.h compiles standalone.

9. David Rowley recently pointed out that we can modify
CREATE TABLE .. PARTITION OF to likewise not obtain AEL anymore.
Apparently it just requires removal of three lines in MergeAttributes.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

attach-concurrently.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 3c9c03c997..79571e3a38 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -3614,7 +3614,7 @@ StorePartitionBound(Relation rel, Relation parent, PartitionBoundSpec *bound)
 	 * relcache entry for that partition every time a partition is added or
 	 * removed.
 	 */
-	defaultPartOid = get_default_oid_from_partdesc(RelationGetPartitionDesc(parent));
+	defaultPartOid = get_default_oid_from_partdesc(lookup_partdesc_cache(parent));
 	if (OidIsValid(defaultPartOid))
 		CacheInvalidateRelcacheByRelid(defaultPartOid);
 
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index f4057a9f15..0b7dd2c612 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -33,6 +33,7 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
+#include "utils/partcache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
@@ -753,7 +754,7 @@ clone_fk_constraints(Relation pg_constraint, Relation parentRel,
 	if (partRel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
 		subclone != NIL)
 	{
-		PartitionDesc partdesc = RelationGetPartitionDesc(partRel);
+		PartitionDesc partdesc = lookup_partdesc_cache(partRel);
 		int			i;
 
 		for (i = 0; i < partdesc->nparts; i++)
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 906d711378..c027567c95 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -876,7 +876,7 @@ DefineIndex(Oid relationId,
 		 */
 		if (!stmt->relation || stmt->relation->inh)
 		{
-			PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+			PartitionDesc partdesc = lookup_partdesc_cache(rel);
 			int			nparts = partdesc->nparts;
 			Oid		   *part_oids = palloc(sizeof(Oid) * nparts);
 			bool		invalidate_parent = false;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 153aec263e..b3ec820d6e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -830,7 +830,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		 * lock the partition so as to avoid a deadlock.
 		 */
 		defaultPartOid =
-			get_default_oid_from_partdesc(RelationGetPartitionDesc(parent));
+			get_default_oid_from_partdesc(lookup_partdesc_cache(parent));
 		if (OidIsValid(defaultPartOid))
 			defaultRel = heap_open(defaultPartOid, AccessExclusiveLock);
 
@@ -3614,9 +3614,15 @@ AlterTableGetLockLevel(List *cmds)
 				cmd_lockmode = AlterTableGetRelOptionsLockLevel((List *) cmd->def);
 				break;
 
+				/*
+				 * Attaching and detaching partitions can be done
+				 * concurrently.  The default partition (if there's one) will
+				 * have to be locked with AccessExclusive, but that's done
+				 * elsewhere.
+				 */
 			case AT_AttachPartition:
 			case AT_DetachPartition:
-				cmd_lockmode = AccessExclusiveLock;
+				cmd_lockmode = ShareUpdateExclusiveLock;
 				break;
 
 			default:			/* oops */
@@ -5903,7 +5909,7 @@ ATPrepDropNotNull(Relation rel, bool recurse, bool recursing)
 	 */
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+		PartitionDesc partdesc = lookup_partdesc_cache(rel);
 
 		Assert(partdesc != NULL);
 		if (partdesc->nparts > 0 && !recurse && !recursing)
@@ -6048,7 +6054,7 @@ ATPrepSetNotNull(Relation rel, bool recurse, bool recursing)
 	 */
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+		PartitionDesc partdesc = lookup_partdesc_cache(rel);
 
 		if (partdesc && partdesc->nparts > 0 && !recurse && !recursing)
 			ereport(ERROR,
@@ -7749,7 +7755,7 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	{
 		PartitionDesc partdesc;
 
-		partdesc = RelationGetPartitionDesc(rel);
+		partdesc = lookup_partdesc_cache(rel);
 
 		for (i = 0; i < partdesc->nparts; i++)
 		{
@@ -14023,7 +14029,7 @@ QueuePartitionConstraintValidation(List **wqueue, Relation scanrel,
 	}
 	else if (scanrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDesc partdesc = RelationGetPartitionDesc(scanrel);
+		PartitionDesc partdesc = lookup_partdesc_cache(scanrel);
 		int			i;
 
 		for (i = 0; i < partdesc->nparts; i++)
@@ -14083,10 +14089,11 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 
 	/*
 	 * We must lock the default partition if one exists, because attaching a
-	 * new partition will change its partition constraint.
+	 * new partition will change its partition constraint.  We must use
+	 * AccessExclusiveLock here, to avoid routing any tuples to it that would
+	 * belong in the newly attached partition.
 	 */
-	defaultPartOid =
-		get_default_oid_from_partdesc(RelationGetPartitionDesc(rel));
+	defaultPartOid = get_default_oid_from_partdesc(lookup_partdesc_cache(rel));
 	if (OidIsValid(defaultPartOid))
 		LockRelationOid(defaultPartOid, AccessExclusiveLock);
 
@@ -14673,11 +14680,12 @@ ATExecDetachPartition(Relation rel, RangeVar *name)
 	ListCell   *cell;
 
 	/*
-	 * We must lock the default partition, because detaching this partition
-	 * will change its partition constraint.
+	 * We must lock the default partition if one exists, because detaching
+	 * this partition will change its partition constraint.  We must use
+	 * AccessExclusiveLock here, to prevent concurrent routing of tuples using
+	 * the obsolete partition constraint.
 	 */
-	defaultPartOid =
-		get_default_oid_from_partdesc(RelationGetPartitionDesc(rel));
+	defaultPartOid = get_default_oid_from_partdesc(lookup_partdesc_cache(rel));
 	if (OidIsValid(defaultPartOid))
 		LockRelationOid(defaultPartOid, AccessExclusiveLock);
 
@@ -14911,7 +14919,7 @@ ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx, RangeVar *name)
 							   RelationGetRelationName(partIdx))));
 
 		/* Make sure it indexes a partition of the other index's table */
-		partDesc = RelationGetPartitionDesc(parentTbl);
+		partDesc = lookup_partdesc_cache(parentTbl);
 		found = false;
 		for (i = 0; i < partDesc->nparts; i++)
 		{
@@ -15046,6 +15054,7 @@ validatePartitionedIndex(Relation partedIdx, Relation partedTbl)
 	int			tuples = 0;
 	HeapTuple	inhTup;
 	bool		updated = false;
+	PartitionDesc partdesc;
 
 	Assert(partedIdx->rd_rel->relkind == RELKIND_PARTITIONED_INDEX);
 
@@ -15085,7 +15094,8 @@ validatePartitionedIndex(Relation partedIdx, Relation partedTbl)
 	 * If we found as many inherited indexes as the partitioned table has
 	 * partitions, we're good; update pg_index to set indisvalid.
 	 */
-	if (tuples == RelationGetPartitionDesc(partedTbl)->nparts)
+	partdesc = lookup_partdesc_cache(partedTbl);
+	if (tuples == partdesc->nparts)
 	{
 		Relation	idxRel;
 		HeapTuple	newtup;
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 240e85e391..676bb2851f 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -55,6 +55,7 @@
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/partcache.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -1089,7 +1090,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 	 */
 	if (partition_recurse)
 	{
-		PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+		PartitionDesc partdesc = lookup_partdesc_cache(rel);
 		List	   *idxs = NIL;
 		List	   *childTbls = NIL;
 		ListCell   *l;
@@ -1857,7 +1858,7 @@ EnableDisableTrigger(Relation rel, const char *tgname,
 			if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
 				(TRIGGER_FOR_ROW(oldtrig->tgtype)))
 			{
-				PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+				PartitionDesc partdesc = lookup_partdesc_cache(rel);
 				int			i;
 
 				for (i = 0; i < partdesc->nparts; i++)
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 0bcb2377c3..98524ac093 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -30,6 +30,7 @@
 #include "utils/rel.h"
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
+#include "utils/snapmgr.h"
 
 
 /*-----------------------
@@ -950,7 +951,6 @@ get_partition_dispatch_recurse(Relation rel, Relation parent,
 							   List **pds, List **leaf_part_oids)
 {
 	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
 	PartitionKey partkey = RelationGetPartitionKey(rel);
 	PartitionDispatch pd;
 	int			i;
@@ -963,7 +963,7 @@ get_partition_dispatch_recurse(Relation rel, Relation parent,
 	pd->reldesc = rel;
 	pd->key = partkey;
 	pd->keystate = NIL;
-	pd->partdesc = partdesc;
+	pd->partdesc = SnapshotGetPartitionDesc(GetActiveSnapshot(), rel);
 	if (parent != NULL)
 	{
 		/*
@@ -1004,10 +1004,10 @@ get_partition_dispatch_recurse(Relation rel, Relation parent,
 	 * corresponding sub-partition; otherwise, we've identified the correct
 	 * partition.
 	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
+	pd->indexes = (int *) palloc(pd->partdesc->nparts * sizeof(int));
+	for (i = 0; i < pd->partdesc->nparts; i++)
 	{
-		Oid			partrelid = partdesc->oids[i];
+		Oid			partrelid = pd->partdesc->oids[i];
 
 		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
 		{
@@ -1515,7 +1515,8 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			 */
 			partrel = ExecGetRangeTableRelation(estate, pinfo->rtindex);
 			partkey = RelationGetPartitionKey(partrel);
-			partdesc = RelationGetPartitionDesc(partrel);
+			partdesc = SnapshotGetPartitionDesc(GetActiveSnapshot(),
+												partrel);
 
 			n_steps = list_length(pinfo->pruning_steps);
 
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..8602cb676a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -49,8 +49,10 @@
 #include "parser/parse_coerce.h"
 #include "parser/parsetree.h"
 #include "utils/lsyscache.h"
+#include "utils/partcache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
 
@@ -1580,13 +1582,11 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	oldrelation = heap_open(parentOID, NoLock);
 
 	/* Scan the inheritance set and expand it */
-	if (RelationGetPartitionDesc(oldrelation) != NULL)
+	if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
-
 		/*
-		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.  While at it, also
+		 * If this is a partitioned table, recursively expand the partitions in the
+		 * order in which they appear in the PartitionDesc.  While at it, also
 		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
@@ -1670,11 +1670,12 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 	int			i;
 	RangeTblEntry *childrte;
 	Index		childRTindex;
-	PartitionDesc partdesc = RelationGetPartitionDesc(parentrel);
+	PartitionDesc partdesc;
 
 	check_stack_depth();
 
 	/* A partitioned table should always have a partition descriptor. */
+	partdesc = SnapshotGetPartitionDesc(GetActiveSnapshot(), parentrel);
 	Assert(partdesc);
 
 	Assert(parentrte->inh);
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 46de00460d..2bf5dc8775 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1905,7 +1905,7 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
 
 	Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
 
-	partdesc = RelationGetPartitionDesc(relation);
+	partdesc = SnapshotGetPartitionDesc(GetActiveSnapshot(), relation);
 	partkey = RelationGetPartitionKey(relation);
 	rel->part_scheme = find_partition_scheme(root, relation);
 	Assert(partdesc != NULL && rel->part_scheme != NULL);
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index c94f73aadc..725bbe1977 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -308,7 +308,7 @@ check_new_partition_bound(char *relname, Relation parent,
 						  PartitionBoundSpec *spec)
 {
 	PartitionKey key = RelationGetPartitionKey(parent);
-	PartitionDesc partdesc = RelationGetPartitionDesc(parent);
+	PartitionDesc partdesc = lookup_partdesc_cache(parent);
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 	ParseState *pstate = make_parsestate(NULL);
 	int			with = -1;
@@ -1415,13 +1415,15 @@ get_qual_for_list(Relation parent, PartitionBoundSpec *spec)
 	/*
 	 * For default list partition, collect datums for all the partitions. The
 	 * default partition constraint should check that the partition key is
-	 * equal to none of those.
+	 * equal to none of those.  Using the cached version of the PartitionDesc
+	 * is fine for default partitions since an AEL lock must be obtained to
+	 * add partitions to a table which has a default partition.
 	 */
 	if (spec->is_default)
 	{
 		int			i;
 		int			ndatums = 0;
-		PartitionDesc pdesc = RelationGetPartitionDesc(parent);
+		PartitionDesc pdesc = SnapshotGetPartitionDesc(GetActiveSnapshot(), parent);
 		PartitionBoundInfo boundinfo = pdesc->boundinfo;
 
 		if (boundinfo)
@@ -1621,7 +1623,7 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 	if (spec->is_default)
 	{
 		List	   *or_expr_args = NIL;
-		PartitionDesc pdesc = RelationGetPartitionDesc(parent);
+		PartitionDesc pdesc = lookup_partdesc_cache(parent);
 		Oid		   *inhoids = pdesc->oids;
 		int			nparts = pdesc->nparts,
 					i;
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 908f62d37e..a4be3dea6a 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1761,6 +1761,9 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
 
+	/* this is set later, if appropriate */
+	snapshot->partdescs = NULL;
+
 	if (old_snapshot_threshold < 0)
 	{
 		/*
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..be39db7a22 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -29,20 +29,50 @@
 #include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "utils/builtins.h"
+#include "utils/catcache.h"
 #include "utils/datum.h"
+#include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/partcache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
+/*
+ * We keep a partition descriptor cache (partcache), separate from relcache,
+ * for partitioned tables.  Each entry points to a PartitionDesc struct.
+ *
+ * On relcache invalidations, the then-current partdesc for the involved
+ * relation is removed from the hash table, but not freed; instead, its
+ * containing memory context is reparented to TopTransactionContext.  This
+ * way, it continues to be available for the current transaction, but newly
+ * planned queries will obtain a fresh descriptor.
+ *
+ * XXX doing it this way amounts to a transaction-long memory leak.
+ * This is not terrible, because these objects are typically a few hundred
+ * to a few thousand bytes at the most, so we can live with that.
+ *
+ * This is what we would like to do instead:
+ * Partcache entries are reference-counted and live beyond relcache
+ * invalidations, to protect callers that need to work with consistent
+ * partition descriptor entries.  On relcache invalidations, the "current"
+ * partdesc for the involved relation is removed from the hash table, but not
+ * freed; a pointer to the existing entry is kept in a separate hash table,
+ * from where it is removed later when the refcount drops to zero.
+ */
 
+/* The partition descriptor hashtable, searched by lookup_partdesc_cache */
+static HTAB *PartCacheHash = NULL;
+
+static void PartCacheRelCallback(Datum arg, Oid relid);
+static PartitionDesc BuildPartitionDesc(Relation rel, List *oids);
 static List *generate_partition_qual(Relation rel);
 static int32 qsort_partition_hbound_cmp(const void *a, const void *b);
 static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
 							   void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
 						   void *arg);
+static void create_partcache_hashtab(int nelems);
 
 
 /*
@@ -59,6 +89,8 @@ static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
  * context the current context except in very brief code sections, out of fear
  * that some of our callees allocate memory on their own which would be leaked
  * permanently.
+ *
+ * XXX this function should be in relcache.c.
  */
 void
 RelationBuildPartitionKey(Relation relation)
@@ -251,14 +283,105 @@ RelationBuildPartitionKey(Relation relation)
 }
 
 /*
- * RelationBuildPartitionDesc
- *		Form rel's partition descriptor
+ * lookup_partdesc_cache
  *
- * Not flushed from the cache by RelationClearRelation() unless changed because
- * of addition or removal of partition.
+ * Fetch the partition descriptor cache entry for the specified relation.
  */
-void
-RelationBuildPartitionDesc(Relation rel)
+PartitionDesc
+lookup_partdesc_cache(Relation partedrel)
+{
+	Oid			relid = RelationGetRelid(partedrel);
+	PartdescCacheEntry *entry;
+	bool		found;
+
+	if (PartCacheHash == NULL)
+	{
+		/* First time through: set up hash table */
+		create_partcache_hashtab(64);
+		/* Also set up callback for SI invalidations */
+		CacheRegisterRelcacheCallback(PartCacheRelCallback, (Datum) 0);
+	}
+
+	/* If the hashtable has an entry, we're done. */
+	entry = (PartdescCacheEntry *) hash_search(PartCacheHash,
+											   (void *) &relid,
+											   HASH_ENTER, &found);
+	if (found)
+		return entry->partdesc;
+
+	/* None found; gotta create one */
+	entry->partdesc = BuildPartitionDesc(partedrel, NIL);
+
+	return entry->partdesc;
+}
+
+/*
+ * PartCacheRelCallback
+ *		Relcache inval callback function
+ *
+ * When a relcach inval is received, we must not make the partcache entry
+ * disappear -- it may still be visible to some snapshot.  Keep it around
+ * instead, but unlink it from the global hash table.  We do reparent its
+ * memory context to be a child of the current transaction context, so that it
+ * goes away as soon as the current transaction finishes.  No snapshot can
+ * live longer than that.
+ *
+ * On reset, we delete the entire hash table.
+ */
+static void
+PartCacheRelCallback(Datum arg, Oid relid)
+{
+	if (!OidIsValid(relid))
+	{
+		HASH_SEQ_STATUS status;
+		PartdescCacheEntry *entry;
+		int			nelems = 0;
+
+		/*
+		 * In case of a full relcache reset, we must reparent all entries to
+		 * the current transaction context and flush the entire hash table.
+		 */
+		hash_seq_init(&status, PartCacheHash);
+		while ((entry = (PartdescCacheEntry *) hash_seq_search(&status)) != NULL)
+		{
+			MemoryContextSetParent(entry->partdesc->memcxt,
+								   TopTransactionContext);
+			nelems++;
+		}
+		hash_destroy(PartCacheHash);
+		create_partcache_hashtab(nelems);
+	}
+	else
+	{
+		PartdescCacheEntry *entry;
+		bool		found;
+
+		/*
+		 * For a single-relation inval message, search the hash table
+		 * for that entry directly.
+		 */
+		entry = hash_search(PartCacheHash, (void *) &relid,
+							HASH_REMOVE, &found);
+		if (found)
+			MemoryContextSetParent(entry->partdesc->memcxt,
+								   TopTransactionContext);
+	}
+}
+
+/*
+ * BuildPartitionDesc
+ *		Build and return the PartitionDesc for 'rel'.
+ *
+ * partrelids can be passed as a list of partitions that will be included in
+ * the descriptor; this is useful when the list of partitions is fixed in
+ * advance, for example when a parallel worker restores state from the parallel
+ * leader.  If partrelids is NIL, then pg_inherits is scanned (with catalog
+ * snapshot) to determine the list of partitions.
+ *
+ * This function is supposed not to leak any memory.
+ */
+static PartitionDesc
+BuildPartitionDesc(Relation rel, List *partrelids)
 {
 	List	   *inhoids,
 			   *partoids;
@@ -270,22 +393,45 @@ RelationBuildPartitionDesc(Relation rel)
 	PartitionKey key = RelationGetPartitionKey(rel);
 	PartitionDesc result;
 	MemoryContext oldcxt;
-
 	int			ndatums = 0;
 	int			default_index = -1;
-
-	/* Hash partitioning specific */
 	PartitionHashBound **hbounds = NULL;
-
-	/* List partitioning specific */
 	PartitionListValue **all_values = NULL;
 	int			null_index = -1;
-
-	/* Range partitioning specific */
 	PartitionRangeBound **rbounds = NULL;
+	MemoryContext memcxt;
 
-	/* Get partition oids from pg_inherits */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+	/*
+	 * Each partition descriptor has and is contained in its own memory
+	 * context.  We start by creating a child of the current memory context,
+	 * and then set it as child of CacheMemoryContext if everything goes well,
+	 * making the partdesc permanent.
+	 */
+	memcxt = AllocSetContextCreate(CurrentMemoryContext,
+								   "partition descriptor",
+								   ALLOCSET_SMALL_SIZES);
+	MemoryContextCopyAndSetIdentifier(memcxt,
+									  RelationGetRelationName(rel));
+	result = MemoryContextAllocZero(memcxt, sizeof(PartitionDescData));
+	result->memcxt = memcxt;
+
+	/*
+	 * To guarantee no memory leaks in this function, we create a temporary
+	 * memory context into which all our transient allocations go.  This also
+	 * enables us to run without pfree'ing anything; simply deleting this
+	 * context at the end is enough.
+	 */
+	memcxt = AllocSetContextCreate(CurrentMemoryContext,
+								   "partdesc temp",
+								   ALLOCSET_SMALL_SIZES);
+	oldcxt = MemoryContextSwitchTo(memcxt);
+
+	/*
+	 * If caller passed an OID array, use that as the partition list;
+	 * otherwise obtain a fresh one from pg_inherits.
+	 */
+	inhoids = partrelids != NIL ? partrelids :
+		find_inheritance_children(RelationGetRelid(rel), NoLock);
 
 	/* Collect bound spec nodes in a list */
 	i = 0;
@@ -302,11 +448,25 @@ RelationBuildPartitionDesc(Relation rel)
 		if (!HeapTupleIsValid(tuple))
 			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
 
+		/*
+		 * If this partition doesn't have relpartbound set, it must have been
+		 * recently detached.  We can't cope with that; producing a partition
+		 * descriptor without it might cause a crash if used with a plan
+		 * containing partition prune info.  Raise an error in this case.
+		 *
+		 * This should only happen when an ALTER TABLE DETACH PARTITION occurs
+		 * between the leader process of a parallel query serializes the partition
+		 * descriptor and the workers restore it, so it should be pretty
+		 * uncommon anyway.
+		 */
 		datum = SysCacheGetAttr(RELOID, tuple,
 								Anum_pg_class_relpartbound,
 								&isnull);
 		if (isnull)
-			elog(ERROR, "null relpartbound for relation %u", inhrelid);
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+					 errmsg("relation %u is no longer a partition", inhrelid)));
+
 		boundspec = (Node *) stringToNode(TextDatumGetCString(datum));
 
 		/*
@@ -435,8 +595,8 @@ RelationBuildPartitionDesc(Relation rel)
 			 * Collect all list values in one array. Alongside the value, we
 			 * also save the index of partition the value comes from.
 			 */
-			all_values = (PartitionListValue **) palloc(ndatums *
-														sizeof(PartitionListValue *));
+			all_values = (PartitionListValue **)
+				palloc(ndatums * sizeof(PartitionListValue *));
 			i = 0;
 			foreach(cell, non_null_values)
 			{
@@ -565,15 +725,12 @@ RelationBuildPartitionDesc(Relation rel)
 				 (int) key->strategy);
 	}
 
-	/* Now build the actual relcache partition descriptor */
-	rel->rd_pdcxt = AllocSetContextCreate(CacheMemoryContext,
-										  "partition descriptor",
-										  ALLOCSET_DEFAULT_SIZES);
-	MemoryContextCopyAndSetIdentifier(rel->rd_pdcxt, RelationGetRelationName(rel));
+	/*
+	 * Everything allocated from here on is part of the PartitionDesc, so use
+	 * the descriptor's own memory context.
+	 */
+	MemoryContextSwitchTo(result->memcxt);
 
-	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
-
-	result = (PartitionDescData *) palloc0(sizeof(PartitionDescData));
 	result->nparts = nparts;
 	if (nparts > 0)
 	{
@@ -591,8 +748,8 @@ RelationBuildPartitionDesc(Relation rel)
 		boundinfo->null_index = -1;
 		boundinfo->datums = (Datum **) palloc0(ndatums * sizeof(Datum *));
 
-		/* Initialize mapping array with invalid values */
-		mapping = (int *) palloc(sizeof(int) * nparts);
+		/* Initialize temporary mapping array with invalid values */
+		mapping = (int *) MemoryContextAlloc(memcxt, sizeof(int) * nparts);
 		for (i = 0; i < nparts; i++)
 			mapping[i] = -1;
 
@@ -628,9 +785,7 @@ RelationBuildPartitionDesc(Relation rel)
 						}
 
 						mapping[hbounds[i]->index] = i;
-						pfree(hbounds[i]);
 					}
-					pfree(hbounds);
 					break;
 				}
 
@@ -771,11 +926,96 @@ RelationBuildPartitionDesc(Relation rel)
 		 */
 		for (i = 0; i < nparts; i++)
 			result->oids[mapping[i]] = oids[i];
-		pfree(mapping);
 	}
 
+	MemoryContextDelete(memcxt);
 	MemoryContextSwitchTo(oldcxt);
-	rel->rd_partdesc = result;
+
+	/* Make the new entry permanent */
+	MemoryContextSetParent(result->memcxt, CacheMemoryContext);
+
+	return result;
+}
+
+/*
+ * EstimatePartCacheEntrySpace
+ *		Returns the size needed to store the given partition descriptor.
+ *
+ * We are exporting only required fields from the partition descriptor.
+ */
+Size
+EstimatePartCacheEntrySpace(PartdescCacheEntry *pce)
+{
+	return sizeof(Oid) + sizeof(int) + sizeof(Oid) * pce->partdesc->nparts;
+}
+
+/*
+ * SerializePartCacheEntry
+ *		Dumps the serialized partition descriptor cache entry onto the
+ *		memory location at start_address.  The amount of memory used is
+ *		returned.
+ */
+Size
+SerializePartCacheEntry(PartdescCacheEntry *pce, char *start_address)
+{
+	Size	offset;
+
+	/* copy all required fields */
+	memcpy(start_address, &pce->relid, sizeof(Oid));
+	offset = sizeof(Oid);
+	memcpy(start_address + offset, &pce->partdesc->nparts, sizeof(int));
+	offset += sizeof(int);
+	memcpy(start_address + offset, pce->partdesc->oids,
+		   pce->partdesc->nparts * sizeof(Oid));
+	offset += pce->partdesc->nparts * sizeof(Oid);
+
+	return offset;
+}
+
+/*
+ * RestorePartitionDescriptor
+ *		Restore a serialized partition descriptor from the specified address.
+ *		The amount of memory read is returned.
+ */
+Size
+RestorePartdescCacheEntry(PartdescCacheEntry *pce, Oid relid,
+						  char *start_address)
+{
+	Size		offset = 0;
+	Relation	rel;
+	int			nparts;
+	Oid		   *oids;
+	List	   *oidlist = NIL;
+
+	pce->relid = relid;
+
+	memcpy(&nparts, start_address, sizeof(int));
+	offset += sizeof(int);
+
+	oids = palloc(nparts * sizeof(Oid));
+	memcpy(oids, start_address + offset, nparts * sizeof(Oid));
+	offset += nparts * sizeof(Oid);
+	for (int i = 0; i < nparts; i++)
+		oidlist = lappend_oid(oidlist, oids[i]);
+
+	/*
+	 * If the snapshot still contains in its cache a descriptor for a relation
+	 * that was dropped, we cannot open it here anymore; ignore it.  We cannot
+	 * rely on the invalidation occurring at the time of relation drop,
+	 * because we want to preserve entries across invalidations arriving for
+	 * reasons other than drop.
+	 */
+	rel = try_relation_open(pce->relid, AccessShareLock);
+	if (rel)
+	{
+		pce->partdesc = BuildPartitionDesc(rel, oidlist);
+		relation_close(rel, NoLock);
+	}
+
+	pfree(oids);
+	list_free(oidlist);
+
+	return offset;
 }
 
 /*
@@ -962,3 +1202,19 @@ qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
 								key->partcollation, b1->datums, b1->kind,
 								b1->lower, b2);
 }
+
+/*
+ * Auxiliary function to create the hash table containing the partition
+ * descriptor cache.
+ */
+static void
+create_partcache_hashtab(int nelems)
+{
+	HASHCTL		ctl;
+
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(PartdescCacheEntry);
+	PartCacheHash = hash_create("Partition Descriptors", nelems, &ctl,
+								HASH_ELEM | HASH_BLOBS);
+}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index fd3d010b77..1a16a2b7a2 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -288,8 +288,6 @@ static OpClassCacheEnt *LookupOpclassInfo(Oid operatorClassOid,
 				  StrategyNumber numSupport);
 static void RelationCacheInitFileRemoveInDir(const char *tblspcpath);
 static void unlink_initfile(const char *initfilename, int elevel);
-static bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
-					PartitionDesc partdesc2);
 
 
 /*
@@ -1003,60 +1001,6 @@ equalRSDesc(RowSecurityDesc *rsdesc1, RowSecurityDesc *rsdesc2)
 }
 
 /*
- * equalPartitionDescs
- *		Compare two partition descriptors for logical equality
- */
-static bool
-equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
-					PartitionDesc partdesc2)
-{
-	int			i;
-
-	if (partdesc1 != NULL)
-	{
-		if (partdesc2 == NULL)
-			return false;
-		if (partdesc1->nparts != partdesc2->nparts)
-			return false;
-
-		Assert(key != NULL || partdesc1->nparts == 0);
-
-		/*
-		 * Same oids? If the partitioning structure did not change, that is,
-		 * no partitions were added or removed to the relation, the oids array
-		 * should still match element-by-element.
-		 */
-		for (i = 0; i < partdesc1->nparts; i++)
-		{
-			if (partdesc1->oids[i] != partdesc2->oids[i])
-				return false;
-		}
-
-		/*
-		 * Now compare partition bound collections.  The logic to iterate over
-		 * the collections is private to partition.c.
-		 */
-		if (partdesc1->boundinfo != NULL)
-		{
-			if (partdesc2->boundinfo == NULL)
-				return false;
-
-			if (!partition_bounds_equal(key->partnatts, key->parttyplen,
-										key->parttypbyval,
-										partdesc1->boundinfo,
-										partdesc2->boundinfo))
-				return false;
-		}
-		else if (partdesc2->boundinfo != NULL)
-			return false;
-	}
-	else if (partdesc2 != NULL)
-		return false;
-
-	return true;
-}
-
-/*
  *		RelationBuildDesc
  *
  *		Build a relation descriptor.  The caller must hold at least
@@ -1184,18 +1128,13 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	relation->rd_fkeylist = NIL;
 	relation->rd_fkeyvalid = false;
 
-	/* if a partitioned table, initialize key and partition descriptor info */
+	/* if a partitioned table, initialize key info */
 	if (relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		RelationBuildPartitionKey(relation);
-		RelationBuildPartitionDesc(relation);
-	}
 	else
 	{
 		relation->rd_partkeycxt = NULL;
 		relation->rd_partkey = NULL;
-		relation->rd_partdesc = NULL;
-		relation->rd_pdcxt = NULL;
 	}
 
 	/*
@@ -2284,8 +2223,6 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
 		MemoryContextDelete(relation->rd_rsdesc->rscxt);
 	if (relation->rd_partkeycxt)
 		MemoryContextDelete(relation->rd_partkeycxt);
-	if (relation->rd_pdcxt)
-		MemoryContextDelete(relation->rd_pdcxt);
 	if (relation->rd_partcheck)
 		pfree(relation->rd_partcheck);
 	if (relation->rd_fdwroutine)
@@ -2440,7 +2377,6 @@ RelationClearRelation(Relation relation, bool rebuild)
 		bool		keep_rules;
 		bool		keep_policies;
 		bool		keep_partkey;
-		bool		keep_partdesc;
 
 		/* Build temporary entry, but don't link it into hashtable */
 		newrel = RelationBuildDesc(save_relid, false);
@@ -2473,9 +2409,6 @@ RelationClearRelation(Relation relation, bool rebuild)
 		keep_policies = equalRSDesc(relation->rd_rsdesc, newrel->rd_rsdesc);
 		/* partkey is immutable once set up, so we can always keep it */
 		keep_partkey = (relation->rd_partkey != NULL);
-		keep_partdesc = equalPartitionDescs(relation->rd_partkey,
-											relation->rd_partdesc,
-											newrel->rd_partdesc);
 
 		/*
 		 * Perform swapping of the relcache entry contents.  Within this
@@ -2536,11 +2469,6 @@ RelationClearRelation(Relation relation, bool rebuild)
 			SWAPFIELD(PartitionKey, rd_partkey);
 			SWAPFIELD(MemoryContext, rd_partkeycxt);
 		}
-		if (keep_partdesc)
-		{
-			SWAPFIELD(PartitionDesc, rd_partdesc);
-			SWAPFIELD(MemoryContext, rd_pdcxt);
-		}
 
 #undef SWAPFIELD
 
@@ -3776,7 +3704,7 @@ RelationCacheInitializePhase3(void)
 		}
 
 		/*
-		 * Reload the partition key and descriptor for a partitioned table.
+		 * Reload the partition key for a partitioned table.
 		 */
 		if (relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
 			relation->rd_partkey == NULL)
@@ -3787,15 +3715,6 @@ RelationCacheInitializePhase3(void)
 			restart = true;
 		}
 
-		if (relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
-			relation->rd_partdesc == NULL)
-		{
-			RelationBuildPartitionDesc(relation);
-			Assert(relation->rd_partdesc != NULL);
-
-			restart = true;
-		}
-
 		/* Release hold on the relation */
 		RelationDecrementReferenceCount(relation);
 
@@ -5652,8 +5571,6 @@ load_relcache_init_file(bool shared)
 		rel->rd_rsdesc = NULL;
 		rel->rd_partkeycxt = NULL;
 		rel->rd_partkey = NULL;
-		rel->rd_pdcxt = NULL;
-		rel->rd_partdesc = NULL;
 		rel->rd_partcheck = NIL;
 		rel->rd_indexprs = NIL;
 		rel->rd_indpred = NIL;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index edf59efc29..72a82f05cc 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
+#include "utils/partcache.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
@@ -128,6 +129,11 @@ typedef struct OldSnapshotControlData
 
 static volatile OldSnapshotControlData *oldSnapshotControl;
 
+typedef struct SnapshotPartitionDescriptors
+{
+	HTAB   *hashtab;
+	int		refcount;
+} SnapshotPartitionDescriptors;
 
 /*
  * CurrentSnapshot points to the only snapshot taken in transaction-snapshot
@@ -228,6 +234,11 @@ static Snapshot CopySnapshot(Snapshot snapshot);
 static void FreeSnapshot(Snapshot snapshot);
 static void SnapshotResetXmin(void);
 
+static void init_partition_descriptors(Snapshot snapshot);
+static void create_partdesc_hashtab(Snapshot snapshot);
+static void pin_partition_descriptors(Snapshot snapshot);
+static void unpin_partition_descriptors(Snapshot snapshot);
+
 /*
  * Snapshot fields to be serialized.
  *
@@ -245,6 +256,7 @@ typedef struct SerializedSnapshotData
 	CommandId	curcid;
 	TimestampTz whenTaken;
 	XLogRecPtr	lsn;
+	int			npartdescs;		/* XXX move? */
 } SerializedSnapshotData;
 
 Size
@@ -355,6 +367,9 @@ GetTransactionSnapshot(void)
 		else
 			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
 
+		init_partition_descriptors(CurrentSnapshot);
+		pin_partition_descriptors(CurrentSnapshot);
+
 		FirstSnapshotSet = true;
 		return CurrentSnapshot;
 	}
@@ -367,6 +382,8 @@ GetTransactionSnapshot(void)
 
 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
 
+	init_partition_descriptors(CurrentSnapshot);
+
 	return CurrentSnapshot;
 }
 
@@ -396,7 +413,16 @@ GetLatestSnapshot(void)
 	if (!FirstSnapshotSet)
 		return GetTransactionSnapshot();
 
+	/*
+	 * If we have a partition descriptor cache from a previous iteration,
+	 * clean it up
+	 */
+	if (SecondarySnapshot)
+		unpin_partition_descriptors(SecondarySnapshot);
+
 	SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData);
+	init_partition_descriptors(SecondarySnapshot);
+	pin_partition_descriptors(SecondarySnapshot);
 
 	return SecondarySnapshot;
 }
@@ -678,6 +704,13 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->active_count = 0;
 	newsnap->copied = true;
 
+	/*
+	 * All copies of a snapshot share the same partition descriptor cache; we
+	 * must not free it until all references to it are gone.  Caller must see
+	 * to it that the descriptor is pinned!
+	 */
+	newsnap->partdescs = snapshot->partdescs;
+
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
 	{
@@ -718,6 +751,9 @@ FreeSnapshot(Snapshot snapshot)
 	Assert(snapshot->active_count == 0);
 	Assert(snapshot->copied);
 
+	if (snapshot->partdescs)
+		unpin_partition_descriptors(snapshot);
+
 	pfree(snapshot);
 }
 
@@ -747,6 +783,8 @@ PushActiveSnapshot(Snapshot snap)
 	else
 		newactive->as_snap = snap;
 
+	pin_partition_descriptors(newactive->as_snap);
+
 	newactive->as_next = ActiveSnapshot;
 	newactive->as_level = GetCurrentTransactionNestLevel();
 
@@ -891,6 +929,8 @@ RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner)
 	if (snap->regd_count == 1)
 		pairingheap_add(&RegisteredSnapshots, &snap->ph_node);
 
+	pin_partition_descriptors(snap);
+
 	return snap;
 }
 
@@ -1075,6 +1115,8 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	}
 	FirstXactSnapshot = NULL;
 
+	/* FIXME what do we need for partdescs here?? */
+
 	/*
 	 * If we exported any snapshots, clean them up.
 	 */
@@ -2056,6 +2098,20 @@ EstimateSnapshotSpace(Snapshot snap)
 		size = add_size(size,
 						mul_size(snap->subxcnt, sizeof(TransactionId)));
 
+	if (snap->partdescs->hashtab)
+	{
+		HASH_SEQ_STATUS	status;
+		void *entry;
+
+		size = add_size(size, sizeof(int));
+
+		hash_seq_init(&status, snap->partdescs->hashtab);
+		while ((entry = hash_seq_search(&status)) != NULL)
+		{
+			size = add_size(size, EstimatePartCacheEntrySpace(entry));
+		}
+	}
+
 	return size;
 }
 
@@ -2068,9 +2124,21 @@ void
 SerializeSnapshot(Snapshot snapshot, char *start_address)
 {
 	SerializedSnapshotData serialized_snapshot;
+	int		numpartdescs = 0;
 
 	Assert(snapshot->subxcnt >= 0);
 
+	/* Count entries in local partition descriptor cache, if there's one */
+	if (snapshot->partdescs->hashtab)
+	{
+		HASH_SEQ_STATUS status;
+		void		   *entry;
+
+		hash_seq_init(&status, snapshot->partdescs->hashtab);
+		while ((entry = hash_seq_search(&status)) != NULL)
+			numpartdescs++;
+	}
+
 	/* Copy all required fields */
 	serialized_snapshot.xmin = snapshot->xmin;
 	serialized_snapshot.xmax = snapshot->xmax;
@@ -2081,6 +2149,7 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
 	serialized_snapshot.curcid = snapshot->curcid;
 	serialized_snapshot.whenTaken = snapshot->whenTaken;
 	serialized_snapshot.lsn = snapshot->lsn;
+	serialized_snapshot.npartdescs = numpartdescs;
 
 	/*
 	 * Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -2114,6 +2183,25 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
 		memcpy((TransactionId *) (start_address + subxipoff),
 			   snapshot->subxip, snapshot->subxcnt * sizeof(TransactionId));
 	}
+
+	/* Serialize each cached partition descriptor. */
+	if (numpartdescs > 0)
+	{
+		HASH_SEQ_STATUS	status;
+		Size	partdescoff;
+		void   *entry;
+
+		partdescoff = sizeof(SerializedSnapshotData) +
+			snapshot->xcnt * sizeof(TransactionId) +
+			serialized_snapshot.subxcnt * sizeof(TransactionId);
+
+		hash_seq_init(&status, snapshot->partdescs->hashtab);
+		while ((entry = hash_seq_search(&status)) != NULL)
+		{
+			partdescoff +=
+				SerializePartCacheEntry(entry, start_address + partdescoff);
+		}
+	}
 }
 
 /*
@@ -2156,6 +2244,8 @@ RestoreSnapshot(char *start_address)
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
 
+	snapshot->partdescs = NULL;
+
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)
 	{
@@ -2173,6 +2263,31 @@ RestoreSnapshot(char *start_address)
 			   serialized_snapshot.subxcnt * sizeof(TransactionId));
 	}
 
+	if (serialized_snapshot.npartdescs > 0)
+	{
+		char   *address = start_address + sizeof(SerializedSnapshotData) +
+			serialized_snapshot.xcnt * sizeof(TransactionId) +
+			serialized_snapshot.subxcnt * sizeof(TransactionId);
+
+		init_partition_descriptors(snapshot);
+		create_partdesc_hashtab(snapshot);
+		pin_partition_descriptors(snapshot);	/* XXX is this needed? */
+
+		for (int i = 0; i < serialized_snapshot.npartdescs; i++)
+		{
+			Oid		relid;
+			PartdescCacheEntry *entry;
+
+			memcpy(&relid, address, sizeof(Oid));
+			address += sizeof(Oid);
+
+			entry = hash_search(snapshot->partdescs->hashtab, &relid,
+								HASH_ENTER, NULL);
+
+			address += RestorePartdescCacheEntry(entry, relid, address);
+		}
+	}
+
 	/* Set the copied flag so that the caller will set refcounts correctly. */
 	snapshot->regd_count = 0;
 	snapshot->active_count = 0;
@@ -2192,3 +2307,103 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *master_pgproc)
 {
 	SetTransactionSnapshot(snapshot, NULL, InvalidPid, master_pgproc);
 }
+
+/*---------------------------------------------------------------------
+ * Partition descriptor cache support
+ *---------------------------------------------------------------------
+ */
+
+/*
+ * SnapshotGetPartitionDesc
+ *		Return a partition descriptor valid for the given snapshot.
+ *
+ * If the partition descriptor has already been cached for this snapshot,
+ * return that; otherwise, partcache.c does the actual work.
+ */
+PartitionDesc
+SnapshotGetPartitionDesc(Snapshot snapshot, Relation rel)
+{
+	PartdescCacheEntry *entry;
+	Oid			relid = RelationGetRelid(rel);
+	bool		found;
+
+	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+	/* Initialize hash table on first call */
+	if (snapshot->partdescs->hashtab == NULL)
+		create_partdesc_hashtab(snapshot);
+
+	/* Search hash table, initializing new entry if not found */
+	entry = hash_search(snapshot->partdescs->hashtab, &relid,
+						HASH_ENTER, &found);
+	if (!found)
+		entry->partdesc = lookup_partdesc_cache(rel);
+
+	return entry->partdesc;
+}
+
+/*
+ * Initialize the partition descriptor struct for this snapshot.
+ */
+static void
+init_partition_descriptors(Snapshot snapshot)
+{
+	SnapshotPartitionDescriptors *descs;
+
+	descs = MemoryContextAlloc(TopTransactionContext,
+							 sizeof(SnapshotPartitionDescriptors));
+	descs->hashtab = NULL;
+	descs->refcount = 0;
+
+	snapshot->partdescs = descs;
+}
+
+/*
+ * Create the hashtable for the partition descriptor cache of this snapshot.
+ *
+ * We do this separately from initializing, to delay until the hashtable is
+ * really needed.  Many snapshots will never access a partitioned table.
+ */
+static void
+create_partdesc_hashtab(Snapshot snapshot)
+{
+	HASHCTL	ctl;
+
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(PartdescCacheEntry);
+	ctl.hcxt = TopTransactionContext;
+	snapshot->partdescs->hashtab =
+		hash_create("Snapshot Partdescs", 10, &ctl,
+					HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+}
+
+/*
+ * Increment pin count for this snapshot's partition descriptor.
+ */
+static void
+pin_partition_descriptors(Snapshot snapshot)
+{
+	if (snapshot->partdescs)
+		snapshot->partdescs->refcount++;
+}
+
+/*
+ * Decrement pin count for this snapshot's partition descriptor.
+ *
+ * If this was the last snapshot using this partition descriptor, free it.
+ */
+static void
+unpin_partition_descriptors(Snapshot snapshot)
+{
+	/* Quick exit for snapshots without partition descriptors */
+	if (!snapshot->partdescs)
+		return;
+
+	if (--snapshot->partdescs->refcount <= 0)
+	{
+		hash_destroy(snapshot->partdescs->hashtab);
+		pfree(snapshot->partdescs);
+		snapshot->partdescs = NULL;
+	}
+}
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a53de2372e..74904d6285 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -19,15 +19,6 @@
 /* Seed for the extended hash function */
 #define HASH_PARTITION_SEED UINT64CONST(0x7A5B22367996DCFD)
 
-/*
- * Information about partitions of a partitioned table.
- */
-typedef struct PartitionDescData
-{
-	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
-	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
-} PartitionDescData;
 
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_partition_ancestors(Oid relid);
diff --git a/src/include/utils/partcache.h b/src/include/utils/partcache.h
index 873c60fafd..b48bcd33fe 100644
--- a/src/include/utils/partcache.h
+++ b/src/include/utils/partcache.h
@@ -46,11 +46,35 @@ typedef struct PartitionKeyData
 	Oid		   *parttypcoll;
 }			PartitionKeyData;
 
+/*
+ * Information about partitions of a partitioned table.
+ */
+typedef struct PartitionDescData
+{
+	Oid			relid;			/* hash key -- must be first */
+	int			nparts;			/* Number of partitions */
+	Oid		   *oids;			/* OIDs of partitions */
+	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
+	MemoryContext memcxt;		/* memory context containing this entry */
+} PartitionDescData;
+
+typedef struct PartdescCacheEntry
+{
+	Oid			relid;
+	PartitionDesc partdesc;
+} PartdescCacheEntry;
+
+extern PartitionDesc lookup_partdesc_cache(Relation partedrel);
+
 extern void RelationBuildPartitionKey(Relation relation);
-extern void RelationBuildPartitionDesc(Relation rel);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern Size EstimatePartCacheEntrySpace(PartdescCacheEntry *pce);
+extern Size SerializePartCacheEntry(PartdescCacheEntry *pce, char *start_address);
+extern Size RestorePartdescCacheEntry(PartdescCacheEntry *pce, Oid relid,
+						  char *start_ddress);
+
 /*
  * PartitionKey inquiry functions
  */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 84469f5715..97cfd0f4d0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -97,8 +97,6 @@ typedef struct RelationData
 
 	MemoryContext rd_partkeycxt;	/* private memory cxt for the below */
 	struct PartitionKeyData *rd_partkey;	/* partition key, or NULL */
-	MemoryContext rd_pdcxt;		/* private context for partdesc */
-	struct PartitionDescData *rd_partdesc;	/* partitions, or NULL */
 	List	   *rd_partcheck;	/* partition CHECK quals */
 
 	/* data managed by RelationGetIndexList: */
@@ -589,12 +587,6 @@ typedef struct ViewOptions
  */
 #define RelationGetPartitionKey(relation) ((relation)->rd_partkey)
 
-/*
- * RelationGetPartitionDesc
- *		Returns partition descriptor for a relation.
- */
-#define RelationGetPartitionDesc(relation) ((relation)->rd_partdesc)
-
 /* routines in utils/cache/relcache.c */
 extern void RelationIncrementReferenceCount(Relation rel);
 extern void RelationDecrementReferenceCount(Relation rel);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 83806f3040..443c4532c4 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -14,6 +14,7 @@
 #define SNAPMGR_H
 
 #include "fmgr.h"
+#include "partitioning/partdefs.h"
 #include "utils/relcache.h"
 #include "utils/resowner.h"
 #include "utils/snapshot.h"
@@ -83,6 +84,8 @@ extern void UnregisterSnapshot(Snapshot snapshot);
 extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner);
 extern void UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner);
 
+extern PartitionDesc SnapshotGetPartitionDesc(Snapshot snapshot, Relation rel);
+
 extern void AtSubCommit_Snapshot(int level);
 extern void AtSubAbort_Snapshot(int level);
 extern void AtEOXact_Snapshot(bool isCommit, bool resetXmin);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index a8a5a8f4c0..d99bacd8b6 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -34,6 +34,12 @@ typedef bool (*SnapshotSatisfiesFunc) (HeapTuple htup,
 									   Snapshot snapshot, Buffer buffer);
 
 /*
+ * Partition descriptors cached by the snapshot.  Opaque to outside callers;
+ * use SnapshotGetPartitionDesc().
+ */
+struct SnapshotPartitionDescriptors;
+
+/*
  * Struct representing all kind of possible snapshots.
  *
  * There are several different kinds of snapshots:
@@ -103,6 +109,9 @@ typedef struct SnapshotData
 	 */
 	uint32		speculativeToken;
 
+	/* cached partitioned table descriptors */
+	struct SnapshotPartitionDescriptors *partdescs;
+
 	/*
 	 * Book-keeping information, used by the snapshot manager
 	 */
diff --git a/src/test/isolation/expected/attach-partition-1.out b/src/test/isolation/expected/attach-partition-1.out
new file mode 100644
index 0000000000..3a5a5b6422
--- /dev/null
+++ b/src/test/isolation/expected/attach-partition-1.out
@@ -0,0 +1,31 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1b s1s s2a s1s s3b s3s s1c s1s s3s s3c
+step s1b: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1s: SELECT * FROM listp;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1s: SELECT * FROM listp;
+a              
+
+1              
+step s3b: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3s: SELECT * FROM listp;
+a              
+
+1              
+2              
+step s1c: COMMIT;
+step s1s: SELECT * FROM listp;
+a              
+
+1              
+2              
+step s3s: SELECT * FROM listp;
+a              
+
+1              
+2              
+step s3c: COMMIT;
diff --git a/src/test/isolation/expected/attach-partition-2.out b/src/test/isolation/expected/attach-partition-2.out
new file mode 100644
index 0000000000..c4090ceb0d
--- /dev/null
+++ b/src/test/isolation/expected/attach-partition-2.out
@@ -0,0 +1,238 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1brc s1prep s1exec s2a s1exec s1c s1exec
+step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s1c: COMMIT;
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+
+starting permutation: s1brc s1prep s1exec s2a s1dummy s1exec s1c s1exec
+step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1dummy: SELECT 1;
+?column?       
+
+1              
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s1c: COMMIT;
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+
+starting permutation: s1brc s1prep s1exec s2a s1dummy2 s1exec s1c s1exec
+step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1dummy2: SELECT 1 + 1;
+?column?       
+
+2              
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+step s1c: COMMIT;
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+
+starting permutation: s1brc s1prep s1exec s2a s1ins s1exec s1c s1exec
+step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1ins: INSERT INTO listp VALUES (1);
+step s1exec: EXECUTE f;
+a              
+
+1              
+1              
+2              
+step s1c: COMMIT;
+step s1exec: EXECUTE f;
+a              
+
+1              
+1              
+2              
+
+starting permutation: s1brr s1prep s1exec s2a s1exec s1c s1exec
+step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s1c: COMMIT;
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+
+starting permutation: s1brr s1prep s1exec s2a s1dummy s1exec s1c s1exec
+step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1dummy: SELECT 1;
+?column?       
+
+1              
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s1c: COMMIT;
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+
+starting permutation: s1brr s1prep s1exec s2a s1dummy2 s1exec s1c s1exec
+step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1dummy2: SELECT 1 + 1;
+?column?       
+
+2              
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s1c: COMMIT;
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+
+starting permutation: s1brr s1prep s1exec s2a s1ins s1exec s1c s1exec
+step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1ins: INSERT INTO listp VALUES (1);
+step s1exec: EXECUTE f;
+a              
+
+1              
+1              
+step s1c: COMMIT;
+step s1exec: EXECUTE f;
+a              
+
+1              
+1              
+
+starting permutation: s1prep s1exec s2a s1exec
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+
+starting permutation: s1prep s1exec s2a s1dummy s1exec
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1dummy: SELECT 1;
+?column?       
+
+1              
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+
+starting permutation: s1prep s1exec s2a s1dummy2 s1exec
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1dummy2: SELECT 1 + 1;
+?column?       
+
+2              
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+
+starting permutation: s1prep s1exec s2a s1ins s1exec
+step s1prep: PREPARE f AS SELECT * FROM listp ;
+step s1exec: EXECUTE f;
+a              
+
+1              
+step s2a: ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2);
+step s1ins: INSERT INTO listp VALUES (1);
+step s1exec: EXECUTE f;
+a              
+
+1              
+1              
+2              
diff --git a/src/test/isolation/expected/detach-partition-1.out b/src/test/isolation/expected/detach-partition-1.out
new file mode 100644
index 0000000000..b14d9f1018
--- /dev/null
+++ b/src/test/isolation/expected/detach-partition-1.out
@@ -0,0 +1,42 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1brr s1s s2d s1s s2drop s1c s1s
+step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1s: SELECT * FROM d_listp;
+a              
+
+1              
+2              
+step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2;
+step s1s: SELECT * FROM d_listp;
+a              
+
+1              
+2              
+step s2drop: DROP TABLE d_listp2; <waiting ...>
+step s1c: COMMIT;
+step s2drop: <... completed>
+step s1s: SELECT * FROM d_listp;
+a              
+
+1              
+
+starting permutation: s1brc s1s s2d s1s s2drop s1c s1s
+step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1s: SELECT * FROM d_listp;
+a              
+
+1              
+2              
+step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2;
+step s1s: SELECT * FROM d_listp;
+a              
+
+1              
+step s2drop: DROP TABLE d_listp2; <waiting ...>
+step s1c: COMMIT;
+step s2drop: <... completed>
+step s1s: SELECT * FROM d_listp;
+a              
+
+1              
diff --git a/src/test/isolation/expected/detach-partition-2.out b/src/test/isolation/expected/detach-partition-2.out
new file mode 100644
index 0000000000..8c1e828c5f
--- /dev/null
+++ b/src/test/isolation/expected/detach-partition-2.out
@@ -0,0 +1,37 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1brr s1dec s1fetch s2d s1fetch s2drop s1c
+step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1dec: DECLARE f NO SCROLL CURSOR FOR SELECT * FROM d_listp;
+step s1fetch: FETCH ALL FROM f; MOVE ABSOLUTE 0 f;
+a              
+
+1              
+2              
+step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2;
+step s1fetch: FETCH ALL FROM f; MOVE ABSOLUTE 0 f;
+a              
+
+1              
+2              
+step s2drop: DROP TABLE d_listp2; <waiting ...>
+step s1c: COMMIT;
+step s2drop: <... completed>
+
+starting permutation: s1brc s1dec s1fetch s2d s1fetch s2drop s1c
+step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1dec: DECLARE f NO SCROLL CURSOR FOR SELECT * FROM d_listp;
+step s1fetch: FETCH ALL FROM f; MOVE ABSOLUTE 0 f;
+a              
+
+1              
+2              
+step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2;
+step s1fetch: FETCH ALL FROM f; MOVE ABSOLUTE 0 f;
+a              
+
+1              
+2              
+step s2drop: DROP TABLE d_listp2; <waiting ...>
+step s1c: COMMIT;
+step s2drop: <... completed>
diff --git a/src/test/isolation/expected/detach-partition-3.out b/src/test/isolation/expected/detach-partition-3.out
new file mode 100644
index 0000000000..cb775b8f97
--- /dev/null
+++ b/src/test/isolation/expected/detach-partition-3.out
@@ -0,0 +1,45 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1brr s1prep s1exec s2d s1exec s2drop s1c s1exec
+step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1prep: PREPARE f AS SELECT * FROM dp_listp;
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+step s2d: ALTER TABLE dp_listp DETACH PARTITION dp_listp2;
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+step s2drop: DROP TABLE dp_listp2; <waiting ...>
+step s1c: COMMIT;
+step s2drop: <... completed>
+step s1exec: EXECUTE f;
+a              
+
+1              
+
+starting permutation: s1brc s1prep s1exec s2d s1exec s2drop s1c s1exec
+step s1brc: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1prep: PREPARE f AS SELECT * FROM dp_listp;
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+step s2d: ALTER TABLE dp_listp DETACH PARTITION dp_listp2;
+step s1exec: EXECUTE f;
+a              
+
+1              
+2              
+step s2drop: DROP TABLE dp_listp2; <waiting ...>
+step s1c: COMMIT;
+step s2drop: <... completed>
+step s1exec: EXECUTE f;
+a              
+
+1              
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index dd57a96e78..da72d31507 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -77,5 +77,10 @@ test: partition-key-update-1
 test: partition-key-update-2
 test: partition-key-update-3
 test: partition-key-update-4
+test: attach-partition-1
+test: attach-partition-2
+test: detach-partition-1
+test: detach-partition-2
+test: detach-partition-3
 test: plpgsql-toast
 test: truncate-conflict
diff --git a/src/test/isolation/specs/attach-partition-1.spec b/src/test/isolation/specs/attach-partition-1.spec
new file mode 100644
index 0000000000..4d8af76d92
--- /dev/null
+++ b/src/test/isolation/specs/attach-partition-1.spec
@@ -0,0 +1,30 @@
+# Test that attach partition concurrently makes the partition visible at the
+# correct time.
+
+setup
+{
+  CREATE TABLE listp (a int) PARTITION BY LIST(a);
+  CREATE TABLE listp1 PARTITION OF listp FOR VALUES IN (1);
+  CREATE TABLE listp2 (a int);
+  INSERT INTO listp1 VALUES (1);
+  INSERT INTO listp2 VALUES (2);
+}
+
+teardown { DROP TABLE listp; }
+
+session "s1"
+step "s1b"	 { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s1s"   { SELECT * FROM listp; }
+step "s1c"	 { COMMIT; }
+
+session "s2"
+step "s2a"	 { ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); }
+
+session "s3"
+step "s3b"	 { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s3s"   { SELECT * FROM listp; }
+step "s3c"	 { COMMIT; }
+
+# listp2's row should not be visible to s1 until transaction commit.
+# session 3 should see list2's row with both SELECTs it performs.
+permutation "s1b" "s1s" "s2a" "s1s" "s3b" "s3s" "s1c" "s1s" "s3s" "s3c"
diff --git a/src/test/isolation/specs/attach-partition-2.spec b/src/test/isolation/specs/attach-partition-2.spec
new file mode 100644
index 0000000000..c6a7de8801
--- /dev/null
+++ b/src/test/isolation/specs/attach-partition-2.spec
@@ -0,0 +1,42 @@
+setup
+{
+  CREATE TABLE listp (a int) PARTITION BY LIST(a);
+  CREATE TABLE listp1 PARTITION OF listp FOR VALUES IN (1);
+  CREATE TABLE listp2 (a int);
+  INSERT INTO listp1 VALUES (1);
+  INSERT INTO listp2 VALUES (2);
+}
+
+teardown { DROP TABLE listp; }
+
+session "s1"
+step "s1brc"	{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1brr"	{ BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s1prep"	{ PREPARE f AS SELECT * FROM listp ; }
+step "s1exec"	{ EXECUTE f; }
+step "s1ins"	{ INSERT INTO listp VALUES (1); }
+step "s1dummy"	{ SELECT 1; }
+step "s1dummy2"	{ SELECT 1 + 1; }
+step "s1c"		{ COMMIT; }
+teardown		{ DEALLOCATE f; }
+
+session "s2"
+step "s2a"{ ALTER TABLE listp ATTACH PARTITION listp2 FOR VALUES IN (2); }
+
+# read committed
+permutation "s1brc" "s1prep" "s1exec" "s2a"            "s1exec" "s1c" "s1exec"
+permutation "s1brc" "s1prep" "s1exec" "s2a" "s1dummy"  "s1exec" "s1c" "s1exec"
+permutation "s1brc" "s1prep" "s1exec" "s2a" "s1dummy2" "s1exec" "s1c" "s1exec"
+permutation "s1brc" "s1prep" "s1exec" "s2a" "s1ins"    "s1exec" "s1c" "s1exec"
+
+# repeatable read
+permutation "s1brr" "s1prep" "s1exec" "s2a"            "s1exec" "s1c" "s1exec"
+permutation "s1brr" "s1prep" "s1exec" "s2a" "s1dummy"  "s1exec" "s1c" "s1exec"
+permutation "s1brr" "s1prep" "s1exec" "s2a" "s1dummy2" "s1exec" "s1c" "s1exec"
+permutation "s1brr" "s1prep" "s1exec" "s2a" "s1ins"    "s1exec" "s1c" "s1exec"
+
+# no transaction
+permutation         "s1prep" "s1exec" "s2a"            "s1exec"
+permutation         "s1prep" "s1exec" "s2a" "s1dummy"  "s1exec"
+permutation         "s1prep" "s1exec" "s2a" "s1dummy2" "s1exec"
+permutation         "s1prep" "s1exec" "s2a" "s1ins"    "s1exec"
diff --git a/src/test/isolation/specs/detach-partition-1.spec b/src/test/isolation/specs/detach-partition-1.spec
new file mode 100644
index 0000000000..8f18853948
--- /dev/null
+++ b/src/test/isolation/specs/detach-partition-1.spec
@@ -0,0 +1,31 @@
+# Test that detach partition concurrently makes the partition invisible at the
+# correct time.
+
+setup
+{
+  CREATE TABLE d_listp (a int) PARTITION BY LIST(a);
+  CREATE TABLE d_listp1 PARTITION OF d_listp FOR VALUES IN (1);
+  CREATE TABLE d_listp2 PARTITION OF d_listp FOR VALUES IN (2);
+  INSERT INTO d_listp VALUES (1),(2);
+}
+
+teardown { DROP TABLE IF EXISTS d_listp, d_listp2; }
+
+session "s1"
+step "s1brr"	{ BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s1brc"	{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1s"		{ SELECT * FROM d_listp; }
+step "s1c"		{ COMMIT; }
+
+session "s2"
+step "s2d"		{ ALTER TABLE d_listp DETACH PARTITION d_listp2; }
+step "s2drop"	{ DROP TABLE d_listp2; }
+
+# In repeatable-read isolation level, listp2's row should always be visible to
+# s1 until transaction commit.  Also, s2 cannot drop the detached partition
+# until s1 has closed its transaction.
+permutation "s1brr" "s1s" "s2d" "s1s" "s2drop" "s1c" "s1s"
+
+# In read-committed isolation level, the partition "disappears" immediately
+# from view.  However, the DROP still has to wait for s1's commit.
+permutation "s1brc" "s1s" "s2d" "s1s" "s2drop" "s1c" "s1s"
diff --git a/src/test/isolation/specs/detach-partition-2.spec b/src/test/isolation/specs/detach-partition-2.spec
new file mode 100644
index 0000000000..24035276a8
--- /dev/null
+++ b/src/test/isolation/specs/detach-partition-2.spec
@@ -0,0 +1,32 @@
+# Test that detach partition concurrently makes the partition invisible at the
+# correct time.
+
+setup
+{
+  CREATE TABLE d_listp (a int) PARTITION BY LIST(a);
+  CREATE TABLE d_listp1 PARTITION OF d_listp FOR VALUES IN (1);
+  CREATE TABLE d_listp2 PARTITION OF d_listp FOR VALUES IN (2);
+  INSERT INTO d_listp VALUES (1),(2);
+}
+
+teardown { DROP TABLE IF EXISTS d_listp, d_listp2; }
+
+session "s1"
+step "s1brr"	{ BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s1brc"	{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1dec"	{ DECLARE f NO SCROLL CURSOR FOR SELECT * FROM d_listp; }
+step "s1fetch"	{ FETCH ALL FROM f; MOVE ABSOLUTE 0 f; }
+step "s1c"		{ COMMIT; }
+
+session "s2"
+step "s2d"		{ ALTER TABLE d_listp DETACH PARTITION d_listp2; }
+step "s2drop"	{ DROP TABLE d_listp2; }
+
+# In repeatable-read isolation level, listp2's row should always be visible to
+# s1 until transaction commit.  Also, s2 cannot drop the detached partition
+# until s1 has closed its transaction.
+permutation "s1brr" "s1dec" "s1fetch" "s2d" "s1fetch" "s2drop" "s1c"
+
+# In read-committed isolation level, the partition "disappears" immediately
+# from view.  However, the DROP still has to wait for s1's commit.
+permutation "s1brc" "s1dec" "s1fetch" "s2d" "s1fetch" "s2drop" "s1c"
diff --git a/src/test/isolation/specs/detach-partition-3.spec b/src/test/isolation/specs/detach-partition-3.spec
new file mode 100644
index 0000000000..5410f92d31
--- /dev/null
+++ b/src/test/isolation/specs/detach-partition-3.spec
@@ -0,0 +1,33 @@
+# Test that detach partition concurrently makes the partition invisible at the
+# correct time.
+
+setup
+{
+  CREATE TABLE dp_listp (a int) PARTITION BY LIST(a);
+  CREATE TABLE dp_listp1 PARTITION OF dp_listp FOR VALUES IN (1);
+  CREATE TABLE dp_listp2 PARTITION OF dp_listp FOR VALUES IN (2);
+  INSERT INTO dp_listp VALUES (1),(2);
+}
+
+teardown { DROP TABLE IF EXISTS dp_listp, dp_listp2; }
+
+session "s1"
+step "s1brr"	{ BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s1brc"	{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1prep"	{ PREPARE f AS SELECT * FROM dp_listp; }
+step "s1exec"	{ EXECUTE f; }
+step "s1c"		{ COMMIT; }
+teardown		{ DEALLOCATE f; }
+
+session "s2"
+step "s2d"		{ ALTER TABLE dp_listp DETACH PARTITION dp_listp2; }
+step "s2drop"	{ DROP TABLE dp_listp2; }
+
+# In repeatable-read isolation level, listp2's row should always be visible to
+# s1 until transaction commit.  Also, s2 cannot drop the detached partition
+# until s1 has closed its transaction.
+permutation "s1brr" "s1prep" "s1exec" "s2d" "s1exec" "s2drop" "s1c" "s1exec"
+
+# In read-committed isolation level, the partition "disappears" immediately
+# from view.  However, the DROP still has to wait for s1's commit.
+permutation "s1brc" "s1prep" "s1exec" "s2d" "s1exec" "s2drop" "s1c" "s1exec"

#17

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Alvaro Herrera (#16)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Oct 25, 2018 at 4:26 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Firstly, I took Robert's advice and removed the CONCURRENTLY keyword
from the syntax. We just do it that way always. When there's a default
partition, only that partition is locked with an AEL; all the rest is
locked with ShareUpdateExclusive only.

Check.

Then there are a few other implementation details worth mentioning:

3. parallel query: when a worker starts on a snapshot that has a
partition descriptor cache, we need to transmit those partdescs from
leader via shmem ... but we cannot send the full struct, so we just send
the OID list of partitions, then rebuild the descriptor in the worker.
Side effect: if a partition is detached right between the leader taking
the partdesc and the worker starting, the partition loses its
relpartbound column, so it's not possible to reconstruct the partdesc.
In this case, we raise an error. Hopefully this should be rare.

I don't think it's a good idea to for parallel query to just randomly
fail in cases where a non-parallel query would have worked. I tried
pretty hard to avoid that while working on the feature, and it would
be a shame to see that work undone.

It strikes me that it would be a good idea to break this work into two
phases. In phase 1, let's support ATTACH and CREATE TABLE ..
PARTITION OF without requiring AccessExclusiveLock. In phase 2, think
about concurrency for DETACH (and possibly DROP).

I suspect phase 1 actually isn't that hard. It seems to me that the
only thing we REALLY need to ensure is that the executor doesn't blow
up if a relcache reload occurs. There are probably a few different
approaches to that problem, but I think it basically boils down to (1)
making sure that the executor is holding onto pointers to the exact
objects it wants to use and not re-finding them through the relcache
and (2) making sure that the relcache doesn't free and rebuild those
objects but rather holds onto the existing copies. With this
approach, already-running queries won't take into account the fact
that new partitions have been added, but that seems at least tolerable
and perhaps desirable.

For phase 2, we're not just talking about adding stuff that need not
be used immediately, but about removing stuff which may already be in
use. Your email doesn't seem to describe what we want the *behavior*
to be in that case. Leave aside for a moment the issue of not
crashing: what are the desired semantics? I think it would be pretty
strange if you had a COPY running targeting a partitioned table,
detached a partition, and the COPY continued to route tuples to the
detached partition even though it was now an independent table. It
also seems pretty strange if the tuples just get thrown away. If the
COPY isn't trying to send any tuples to the now-detached partition,
then it's fine, but if it is, then I have trouble seeing any behavior
other than an error as sane, unless perhaps a new partition has been
attached or created for that part of the key space.

If you adopt that proposal, then the problem of parallel query
behaving differently from non-parallel query goes away. You just get
an error in both cases, probably to the effect that there is no
(longer) a partition matching the tuple you are trying to insert (or
update).

If you're not hacking on this patch set too actively right at the
moment, I'd like to spend some time hacking on the CREATE/ATTACH side
of things and see if I can produce something committable for that
portion of the problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Alvaro Herrera

alvherre@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#17)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018-Nov-06, Robert Haas wrote:

If you're not hacking on this patch set too actively right at the
moment, I'd like to spend some time hacking on the CREATE/ATTACH side
of things and see if I can produce something committable for that
portion of the problem.

I'm not -- feel free to hack away.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19

Simon Riggs

simon@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#17)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, 6 Nov 2018 at 10:10, Robert Haas <robertmhaas@gmail.com> wrote:

With this
approach, already-running queries won't take into account the fact
that new partitions have been added, but that seems at least tolerable
and perhaps desirable.

Desirable, imho. No data added after a query starts would be visible.

If the
COPY isn't trying to send any tuples to the now-detached partition,
then it's fine, but if it is, then I have trouble seeing any behavior
other than an error as sane, unless perhaps a new partition has been
attached or created for that part of the key space.

Error in the COPY or in the DDL? COPY preferred. Somebody with insert
rights shouldn't be able to prevent a table-owner level action. People
normally drop partitions to save space, so it could be annoying if that was
interrupted.

Supporting parallel query shouldn't make other cases more difficult from a
behavioral perspective just to avoid the ERROR. The ERROR sounds annoying,
but not sure how annoying avoiding it would be.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Simon Riggs (#19)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Nov 6, 2018 at 1:54 PM Simon Riggs <simon@2ndquadrant.com> wrote:

Error in the COPY or in the DDL? COPY preferred. Somebody with insert rights shouldn't be able to prevent a table-owner level action. People normally drop partitions to save space, so it could be annoying if that was interrupted.

Yeah, the COPY.

Supporting parallel query shouldn't make other cases more difficult from a behavioral perspective just to avoid the ERROR. The ERROR sounds annoying, but not sure how annoying avoiding it would be.

In my view, it's not just a question of it being annoying, but of
whether anything else is even sensible. I mean, you can avoid an
error when a user types SELECT 1/0 by returning NULL or 42, but that's
not usually how we roll around here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

Simon Riggs

simon@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#20)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, 6 Nov 2018 at 10:56, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 6, 2018 at 1:54 PM Simon Riggs <simon@2ndquadrant.com> wrote:

Error in the COPY or in the DDL? COPY preferred. Somebody with insert

rights shouldn't be able to prevent a table-owner level action. People
normally drop partitions to save space, so it could be annoying if that was
interrupted.

Yeah, the COPY.

Supporting parallel query shouldn't make other cases more difficult from

a behavioral perspective just to avoid the ERROR. The ERROR sounds
annoying, but not sure how annoying avoiding it would be.

In my view, it's not just a question of it being annoying, but of
whether anything else is even sensible. I mean, you can avoid an
error when a user types SELECT 1/0 by returning NULL or 42, but that's
not usually how we roll around here.

If you can remove the ERROR without any other adverse effects, that sounds
great.

Please let us know what, if any, adverse effects would be caused so we can
discuss. Thanks

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#22

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Simon Riggs (#21)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Nov 6, 2018 at 2:01 PM Simon Riggs <simon@2ndquadrant.com> wrote:

If you can remove the ERROR without any other adverse effects, that sounds great.

Please let us know what, if any, adverse effects would be caused so we can discuss. Thanks

Well, I've already written about this in two previous emails on this
thread, so I'm not sure exactly what you think is missing. But to
state the problem again:

If you don't throw an error when a partition is concurrently detached
and then someone routes a tuple to that portion of the key space, what
DO you do? Continue inserting tuples into the table even though it's
no longer a partition? Throw tuples destined for that partition away?
You can make an argument for both of those behaviors, but they're
both pretty strange. The first one means that for an arbitrarily long
period of time after detaching a partition, the partition may continue
to receive inserts that were destined for its former parent. The
second one means that your data can disappear into the ether. I don't
like either of those things.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#23

Alvaro Herrera

alvherre@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#22)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018-Nov-06, Robert Haas wrote:

If you don't throw an error when a partition is concurrently detached
and then someone routes a tuple to that portion of the key space, what
DO you do? Continue inserting tuples into the table even though it's
no longer a partition?

Yes -- the table was a partition when the query started, so it's still
a partition from the point of view of that query's snapshot.

Throw tuples destined for that partition away?

Surely not. (/me doesn't beat straw men anyway.)

You can make an argument for both of those behaviors, but they're
both pretty strange. The first one means that for an arbitrarily long
period of time after detaching a partition, the partition may continue
to receive inserts that were destined for its former parent.

Not arbitrarily long -- only as long as those old snapshots live. I
don't find this at all surprising.

(I think DETACH is not related to DROP in any way. My proposal is that
DETACH can work concurrently, and if people want to drop the partition
later they can wait until snapshots/queries that could see that
partition are gone.)

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#24

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Alvaro Herrera (#23)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Nov 6, 2018 at 2:10 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2018-Nov-06, Robert Haas wrote:

If you don't throw an error when a partition is concurrently detached
and then someone routes a tuple to that portion of the key space, what
DO you do? Continue inserting tuples into the table even though it's
no longer a partition?

Yes -- the table was a partition when the query started, so it's still
a partition from the point of view of that query's snapshot.

I think it's important to point out that DDL does not in general
respect the query snapshot. For example, you can query a table that
was created by a transaction not visible to your query snapshot. You
cannot query a table that was dropped by a transaction not visible to
your query snapshot. If someone runs ALTER FUNCTION on a function
your query uses, you get the latest committed version, not the version
that was current at the time your query snapshot was created. So, if
we go with the semantics you are proposing here, we will be making
this DDL behave differently from pretty much all other DDL.

Possibly that's OK in this case, but it's easy to think of other cases
where it could cause problems. To take an example that I believe was
discussed on-list a number of years ago, suppose that ADD CONSTRAINT
worked according to the model that you are proposing for ATTACH
PARTITION. If it did, then one transaction could be concurrently
inserting a tuple while another transaction was adding a constraint
which the tuple fails to satisfy. Once both transactions commit, you
have a table with a supposedly-valid constraint and a tuple inside of
it that doesn't satisfy that constraint. Obviously, that's no good.

I'm not entirely sure whether there are any similar dangers in the
case of DETACH PARTITION. I think it depends a lot on what can be
done with that detached partition while the overlapping transaction is
still active. For instance, suppose you attached it to the original
table with a different set of partition bounds, or attached it to some
other table with a different set of partition bounds. If you can do
that, then I think it effectively creates the problem described in the
previous paragraph with respect to the partition constraint.

IOW, we've got to somehow prevent this:

setup: partition is attached with bounds 1 to a million
S1: COPY begins
S2: partition is detached
S2: partition is reattached with bounds 1 to a thousand
S1: still-running copy inserts a tuple with value ten thousand

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#25

Simon Riggs

simon@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#22)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, 6 Nov 2018 at 11:06, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 6, 2018 at 2:01 PM Simon Riggs <simon@2ndquadrant.com> wrote:

If you can remove the ERROR without any other adverse effects, that

sounds great.

Please let us know what, if any, adverse effects would be caused so we

can discuss. Thanks

Well, I've already written about this in two previous emails on this
thread, so I'm not sure exactly what you think is missing. But to
state the problem again:

I was discussing the ERROR in relation to parallel query, not COPY.

I didn't understand how that would be achieved.

Thanks for working on this.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#26

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Andres Freund (#6)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Aug 7, 2018 at 9:29 AM Andres Freund <andres@anarazel.de> wrote:

One approach would be to make sure that everything relying on
rt_partdesc staying the same stores its value in a local variable, and
then *not* free the old version of rt_partdesc (etc) when the refcount >
0, but delay that to the RelationClose() that makes refcount reach
0. That'd be the start of a framework for more such concurrenct
handling.

Some analysis of possible trouble spots:

- get_partition_dispatch_recurse and ExecCreatePartitionPruneState
both call RelationGetPartitionDesc. Presumably, this means that if
the partition descriptor gets updated on the fly, the tuple routing
and partition dispatch code could end up with different ideas about
which partitions exist. I think this should be fixed somehow, so that
we only call RelationGetPartitionDesc once per query and use the
result for everything.

- expand_inherited_rtentry checks
RelationGetPartitionDesc(oldrelation) != NULL. If so, it calls
expand_partitioned_rtentry which fetches the same PartitionDesc again.
We can probably just do this once in the caller and pass the result
down.

- set_relation_partition_info also calls RelationGetPartitionDesc.
Off-hand, I think this code runs after expand_inherited_rtentry. Not
sure what to do about this. I'm not sure what the consequences would
be if this function and that one had different ideas about the
partition descriptor.

- tablecmds.c is pretty free about calling RelationGetPartitionDesc
repeatedly, but it probably doesn't matter. If we're doing some kind
of DDL that depends on the contents of the partition descriptor, we
*had better* be holding a lock strong enough to prevent the partition
descriptor from being changed by somebody else at the same time.
Allowing a partition to be added concurrently with DML is one thing;
allowing a partition to be added concurrently with adding another
partition is a whole different level of insanity. I think we'd be
best advised not to go down that rathole - among other concerns, how
would you even guarantee that the partitions being added didn't
overlap?

Generally:

Is it really OK to postpone freeing the old partition descriptor until
the relation reference count goes to 0? I wonder if there are cases
where this could lead to tons of copies of the partition descriptor
floating around at the same time, gobbling up precious cache memory.
My first thought was that this would be pretty easy: just create a lot
of new partitions one by one while some long-running transaction is
open. But the actual result in that case depends on the behavior of
the backend running the transaction. If it just ignores the new
partitions and sticks with the partition descriptor it has got, then
probably nothing else will request the new partition descriptor either
and there will be no accumulation of memory. However, if it tries to
absorb the updated partition descriptor, but without being certain
that the old one can be freed, then we'd have a query-lifespan memory
leak which is quadratic in the number of new partitions.

Maybe even that would be OK -- we could suppose that the number of new
partitions would probably be all THAT crazy large, and the constant
factor not too bad, so maybe you'd leak a could of MB for the length
of the query, but no more. However, I wonder if it would better to
give each PartitionDescData its own refcnt, so that it can be freed
immediately when the refcnt goes to zero. That would oblige every
caller of RelationGetPartitionDesc() to later call something like
ReleasePartitionDesc(). We could catch failures to do that by keeping
all the PartitionDesc objects so far created in a list. When the main
entry's refcnt goes to 0, cross-check that this list is empty; if not,
then the remaining entries have non-zero refcnts that were leaked. We
could emit a WARNING as we do in similar cases.

In general, I think something along the lines you are suggesting here
is the right place to start attacking this problem. Effectively, we
immunize the system against the possibility of new entries showing up
in the partition descriptor while concurrent DML is running; the
semantics are that the new partitions are ignored for the duration of
currently-running queries. This seems to allow for painless creation
or addition of new partitions in normal cases, but not when a default
partition exists. In that case, using the old PartitionDesc is
outright wrong, because adding a new toplevel partition changes the
default partition's partition constraint. We can't insert into the
default partition a tuple that under the updated table definition
needs to go someplace else. It seems like the best way to account for
that is to reduce the lock level on the partitioned table to
ShareUpdateExclusiveLock, but leave the lock level on any default
partition as AccessExclusiveLock (because we are modifying a
constraint on it). We would also need to leave the lock level on the
new partition as AccessExclusiveLock (because we are adding a
constraint on it). Not perfect, for sure, but not bad for a first
patch, either; it would improve things for users in a bunch of
practical cases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27

Alvaro Herrera

alvherre@2ndquadrant.com

about 7 years ago

In reply to: Alvaro Herrera (#23)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018-Nov-06, Alvaro Herrera wrote:

On 2018-Nov-06, Robert Haas wrote:

Throw tuples destined for that partition away?

Surely not. (/me doesn't beat straw men anyway.)

Hmm, apparently this can indeed happen with my patch :-(

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#28

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Alvaro Herrera (#27)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Nov 6, 2018 at 10:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Throw tuples destined for that partition away?

Surely not. (/me doesn't beat straw men anyway.)

Hmm, apparently this can indeed happen with my patch :-(

D'oh. This is a hard problem, especially the part of it that involves
handling detach, so I wouldn't feel too bad about that. However, to
beat this possibly-dead horse a little more, I think you made the
error of writing a patch that (1) tried to solve too many problems at
once and (2) didn't seem to really have a clear, well-considered idea
about what the semantics ought to be.

This is not intended as an attack; I want to work with you to solve
the problem, not have a fight about it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Robert Haas (#26)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Nov 6, 2018 at 5:09 PM Robert Haas <robertmhaas@gmail.com> wrote:

- get_partition_dispatch_recurse and ExecCreatePartitionPruneState
both call RelationGetPartitionDesc. Presumably, this means that if
the partition descriptor gets updated on the fly, the tuple routing
and partition dispatch code could end up with different ideas about
which partitions exist. I think this should be fixed somehow, so that
we only call RelationGetPartitionDesc once per query and use the
result for everything.

I think there is deeper trouble here.
ExecSetupPartitionTupleRouting() calls find_all_inheritors() to
acquire RowExclusiveLock on the whole partitioning hierarchy. It then
calls RelationGetPartitionDispatchInfo (as a non-relcache function,
this seems poorly named) which calls get_partition_dispatch_recurse,
which does this:

/*
* We assume all tables in the partition tree were already locked
* by the caller.
*/
Relation partrel = heap_open(partrelid, NoLock);

That seems OK at present, because no new partitions can have appeared
since ExecSetupPartitionTupleRouting() acquired locks. But if we
allow new partitions to be added with only ShareUpdateExclusiveLock,
then I think there would be a problem. If a new partition OID creeps
into the partition descriptor after find_all_inheritors() and before
we fetch its partition descriptor, then we wouldn't have previously
taken a lock on it and would still be attempting to open it without a
lock, which is bad (cf. b04aeb0a053e7cf7faad89f7d47844d8ba0dc839).

Admittedly, it might be a bit hard to provoke a failure here because
I'm not exactly sure how you could trigger a relcache reload in the
critical window, but I don't think we should rely on that.

More generally, it seems a bit strange that we take the approach of
locking the entire partitioning hierarchy here regardless of which
relations the query actually knows about. If some relations have been
pruned, presumably we don't need to lock them; if/when we permit
concurrent partition, we don't need to lock any new ones that have
materialized. We're just going to end up ignoring them anyway because
there's nothing to do with the information that they are or are not
excluded from the query when they don't appear in the query plan in
the first place.

Furthermore, this whole thing looks suspiciously like more of the sort
of redundant locking that f2343653f5b2aecfc759f36dbb3fd2a61f36853e
attempted to eliminate. In light of that commit message, I'm
wondering whether the best approach would be to [1] get rid of the
find_all_inheritors call altogether and [2] somehow ensure that
get_partition_dispatch_recurse() doesn't open any tables that aren't
part of the query's range table.

Thoughts? Comments? Ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30

Alvaro Herrera

alvherre@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#26)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018-Nov-06, Robert Haas wrote:

- get_partition_dispatch_recurse and ExecCreatePartitionPruneState
both call RelationGetPartitionDesc.

My patch deals with this by caching descriptors in the active snapshot.
So those two things would get the same partition descriptor. There's no
RelationGetPartitionDesc anymore, and SnapshotGetPartitionDesc takes its
place.

(I tried to use different scoping than the active snapshot; I first
tried the Portal, then I tried the resource owner. But nothing seems to
fit as precisely as the active snapshot.)

- expand_inherited_rtentry checks
RelationGetPartitionDesc(oldrelation) != NULL. If so, it calls
expand_partitioned_rtentry which fetches the same PartitionDesc again.

This can be solved by changing the test to a relkind one, as my patch
does.

- set_relation_partition_info also calls RelationGetPartitionDesc.
Off-hand, I think this code runs after expand_inherited_rtentry. Not
sure what to do about this. I'm not sure what the consequences would
be if this function and that one had different ideas about the
partition descriptor.

Snapshot caching, like in my patch, again solves this problem.

- tablecmds.c is pretty free about calling RelationGetPartitionDesc
repeatedly, but it probably doesn't matter. If we're doing some kind
of DDL that depends on the contents of the partition descriptor, we
*had better* be holding a lock strong enough to prevent the partition
descriptor from being changed by somebody else at the same time.

My patch deals with this by unlinking the partcache entry from the hash
table on relation invalidation, so DDL code would obtain a fresh copy
each time (lookup_partcache_entry).

In other words, I already solved these problems you list.

Maybe you could give my patch a look.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#31

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Alvaro Herrera (#30)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Wed, Nov 7, 2018 at 12:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2018-Nov-06, Robert Haas wrote:

- get_partition_dispatch_recurse and ExecCreatePartitionPruneState
both call RelationGetPartitionDesc.

My patch deals with this by caching descriptors in the active snapshot.
So those two things would get the same partition descriptor. There's no
RelationGetPartitionDesc anymore, and SnapshotGetPartitionDesc takes its
place.

(I tried to use different scoping than the active snapshot; I first
tried the Portal, then I tried the resource owner. But nothing seems to
fit as precisely as the active snapshot.)

...

In other words, I already solved these problems you list.

Maybe you could give my patch a look.

I have, a bit. One problem I'm having is that while you explained the
design you chose in a fair amount of detail, you didn't give a lot of
explanation (that I have seen) of the reasons why you chose that
design. If there's a README or a particularly good comment someplace
that I should be reading to understand that better, please point me in
the right direction.

And also, I just don't really understand what all the problems are
yet. I'm only starting to study this.

I am a bit skeptical of your approach, though. Tying it to the active
snapshot seems like an awfully big hammer. Snapshot manipulation can
be a performance bottleneck both in terms of actual performance and
also in terms of code complexity, and I don't really like the idea of
adding more code there. It's not a sustainable pattern for making DDL
work concurrently, either -- I'm pretty sure we don't want to add new
code to things like GetLatestSnapshot() every time we want to make a
new kind of DDL concurrent. Also, while hash table lookups are pretty
cheap, they're not free. In my opinion, to the extent that we can, it
would be better to refactor things to avoid duplicate lookups of the
PartitionDesc rather than to install a new subsystem that tries to
make sure they always return the same answer.

Such an approach seems to have other possible advantages. For
example, if a COPY is running and a new partition shows up, we might
actually want to allow tuples to be routed to it. Maybe that's too
pie in the sky, but if we want to preserve the option to do such
things in the future, a hard-and-fast rule that the apparent partition
descriptor doesn't change unless the snapshot changes seems like it
might get in the way. It seems better to me to have a system where
code that accesses the relcache has a choice, so that at its option it
can either hang on to the PartitionDesc it has or get a new one that
may be different. If we can do things that way, it gives us the most
flexibility.

After the poking around I've done over the last 24 hours, I do see
that there are some non-trivial problems with making it that way, but
I'm not really ready to give up yet.

Does that make sense to you, or am I all wet here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#29)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 8 November 2018 at 05:05, Robert Haas <robertmhaas@gmail.com> wrote:

That seems OK at present, because no new partitions can have appeared
since ExecSetupPartitionTupleRouting() acquired locks. But if we
allow new partitions to be added with only ShareUpdateExclusiveLock,
then I think there would be a problem. If a new partition OID creeps
into the partition descriptor after find_all_inheritors() and before
we fetch its partition descriptor, then we wouldn't have previously
taken a lock on it and would still be attempting to open it without a
lock, which is bad (cf. b04aeb0a053e7cf7faad89f7d47844d8ba0dc839).

Admittedly, it might be a bit hard to provoke a failure here because
I'm not exactly sure how you could trigger a relcache reload in the
critical window, but I don't think we should rely on that.

More generally, it seems a bit strange that we take the approach of
locking the entire partitioning hierarchy here regardless of which
relations the query actually knows about. If some relations have been
pruned, presumably we don't need to lock them; if/when we permit
concurrent partition, we don't need to lock any new ones that have
materialized. We're just going to end up ignoring them anyway because
there's nothing to do with the information that they are or are not
excluded from the query when they don't appear in the query plan in
the first place.

While the find_all_inheritors() call is something I'd like to see
gone, I assume it was done that way since an UPDATE might route a
tuple to a partition that there is no subplan for and due to INSERT
with VALUES not having any RangeTblEntry for any of the partitions.
Simply, any partition which is a descendant of the target partition
table could receive the tuple regardless of what might have been
pruned.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#33

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: David Rowley (#32)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Wed, Nov 7, 2018 at 7:06 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

While the find_all_inheritors() call is something I'd like to see
gone, I assume it was done that way since an UPDATE might route a
tuple to a partition that there is no subplan for and due to INSERT
with VALUES not having any RangeTblEntry for any of the partitions.
Simply, any partition which is a descendant of the target partition
table could receive the tuple regardless of what might have been
pruned.

Thanks. I had figured out since my email of earlier today that it was
needed in the INSERT case, but I had not thought of/discovered the
case of an UPDATE that routes a tuple to a pruned partition. I think
that latter case may not be tested in our regression tests, which is
perhaps something we ought to change.

Honestly, I *think* that the reason that find_all_inheritors() call is
there is because I had the idea that it was important to try to lock
partition hierarchies in the same order in all cases so as to avoid
spurious deadlocks. However, I don't think we're really achieving
that goal despite this code. If we arrive at this point having
already locked some relations, and then lock some more, based on
whatever got pruned, we're clearly not using a deterministic locking
order. So I think we could probably rip out the find_all_inheritors()
call here and change the NoLock in get_partition_dispatch_recurse() to
just take a lock. That's probably a worthwhile simplification and a
slight optimization regardless of anything else.

But I really think it would be better if we could also jigger this to
avoid reopening relations which the executor has already opened and
locked elsewhere. Unfortunately, I don't see a really simple way to
accomplish that. We get the OIDs of the descendents and want to know
whether there is range table entry for that OID; but there's no data
structure which answers that question at present, I believe, and
introducing one just for this purpose seems like an awful lot of new
machinery. Perhaps that new machinery would still have less
far-reaching consequences than the machinery Alvaro is proposing, but,
still, it's not very appealing.

Perhaps one idea is only open and lock partitions on demand - i.e. if
a tuple actually gets routed to them. There are good reasons to do
that independently of reducing lock levels, and we certainly couldn't
do it without having some efficient way to check whether it had
already been done. So then the mechanism wouldn't feel like so much
like a special-purpose hack just for concurrent ATTACH/DETACH. (Was
Amit Langote already working on this, or was that some other kind of
on-demand locking?)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#33)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 8 November 2018 at 15:01, Robert Haas <robertmhaas@gmail.com> wrote:

Honestly, I *think* that the reason that find_all_inheritors() call is
there is because I had the idea that it was important to try to lock
partition hierarchies in the same order in all cases so as to avoid
spurious deadlocks. However, I don't think we're really achieving
that goal despite this code. If we arrive at this point having
already locked some relations, and then lock some more, based on
whatever got pruned, we're clearly not using a deterministic locking
order. So I think we could probably rip out the find_all_inheritors()
call here and change the NoLock in get_partition_dispatch_recurse() to
just take a lock. That's probably a worthwhile simplification and a
slight optimization regardless of anything else.

I'd not thought of the locks taken elsewhere case. I guess it just
perhaps reduces the chances of a deadlock then.

A "slight optimization" is one way to categorise it. There are some
benchmarks you might find interesting in [1]/messages/by-id/06524959-fda8-cff9-6151-728901897b79@redhat.com and [2]/messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com. Patch 0002 does
just what you mention.

[1]: /messages/by-id/06524959-fda8-cff9-6151-728901897b79@redhat.com
[2]: /messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#35

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 7 years ago

In reply to: Robert Haas (#33)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018/11/08 11:01, Robert Haas wrote:

On Wed, Nov 7, 2018 at 7:06 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

While the find_all_inheritors() call is something I'd like to see
gone, I assume it was done that way since an UPDATE might route a
tuple to a partition that there is no subplan for and due to INSERT
with VALUES not having any RangeTblEntry for any of the partitions.
Simply, any partition which is a descendant of the target partition
table could receive the tuple regardless of what might have been
pruned.

Thanks. I had figured out since my email of earlier today that it was
needed in the INSERT case, but I had not thought of/discovered the
case of an UPDATE that routes a tuple to a pruned partition. I think
that latter case may not be tested in our regression tests, which is
perhaps something we ought to change.

Honestly, I *think* that the reason that find_all_inheritors() call is
there is because I had the idea that it was important to try to lock
partition hierarchies in the same order in all cases so as to avoid
spurious deadlocks. However, I don't think we're really achieving
that goal despite this code. If we arrive at this point having
already locked some relations, and then lock some more, based on
whatever got pruned, we're clearly not using a deterministic locking
order. So I think we could probably rip out the find_all_inheritors()
call here and change the NoLock in get_partition_dispatch_recurse() to
just take a lock. That's probably a worthwhile simplification and a
slight optimization regardless of anything else.

A patch that David and I have been working on over at:

https://commitfest.postgresql.org/20/1690/

does that. With that patch, partitions (leaf or not) are locked and
opened only if a tuple is routed to them. In edd44738bc (Be lazier about
partition tuple routing), we postponed the opening of leaf partitions, but
we still left the RelationGetPartitionDispatchInfo machine which
recursively creates PartitionDispatch structs for all partitioned tables
in a tree. The patch mentioned above postpones even the partitioned
partition initialization to a point after a tuple is routed to it.

The patch doesn't yet eliminate the find_all_inheritors call from
ExecSetupPartitionTupleRouting. But that's mostly because of the fear
that if we start becoming lazier about locking individual partitions too,
we'll get non-deterministic locking order on partitions that we might want
to avoid for deadlock fears. Maybe, we don't need to be fearful though.

But I really think it would be better if we could also jigger this to
avoid reopening relations which the executor has already opened and
locked elsewhere. Unfortunately, I don't see a really simple way to
accomplish that. We get the OIDs of the descendents and want to know
whether there is range table entry for that OID; but there's no data
structure which answers that question at present, I believe, and
introducing one just for this purpose seems like an awful lot of new
machinery. Perhaps that new machinery would still have less
far-reaching consequences than the machinery Alvaro is proposing, but,
still, it's not very appealing.

The newly added ExecGetRangeTableRelation opens (if not already done) and
returns a Relation pointer for tables that are present in the range table,
so requires to be passed a valid RT index. That works for tables that the
planner touched. UPDATE tuple routing benefits from that in cases where
the routing target is already in the range table.

For insert itself, planner adds only the target partitioned table to the
range table. Partitions that the inserted tuples will route to may be
present in the range table via some other plan node, but the insert's
execution state won't know about them, so it cannot use
EcecGetRangeTableRelation.

Perhaps one idea is only open and lock partitions on demand - i.e. if
a tuple actually gets routed to them. There are good reasons to do
that independently of reducing lock levels, and we certainly couldn't
do it without having some efficient way to check whether it had
already been done. So then the mechanism wouldn't feel like so much
like a special-purpose hack just for concurrent ATTACH/DETACH. (Was
Amit Langote already working on this, or was that some other kind of
on-demand locking?)

I think the patch mentioned above gets us closer to that goal.

Thanks,
Amit

#36

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Robert Haas (#31)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Wed, Nov 7, 2018 at 1:37 PM Robert Haas <robertmhaas@gmail.com> wrote:

Maybe you could give my patch a look.

I have, a bit.

While thinking about this problem a bit more, I realized that what is
called RelationBuildPartitionDesc in master and BuildPartitionDesc in
Alvaro's patch has a synchronization problem as soon as we start to
reduce lock levels. At some point, find_inheritance_children() gets
called to get a list of the OIDs of the partitions. Then, later,
SysCacheGetAttr(RELOID, ...) gets called for each one to get its
relpartbound value. But since catalog lookups use the most current
snapshot, they might not see a compatible view of the catalogs.

That could manifest in a few different ways:

- We might see a newer version of relpartbound, where it's now null
because it's been detached.
- We might see a newer version of relpartbound where it now has an
unrelated value because it has been detached and then reattached to
some other partitioned table.
- We might see newer versions of relpartbound for some tables than
others. For instance, suppose we had partition A for 1..200 and B for
201..300. Then we realize that this is not what we actually wanted to
do, so we detach A and reattach it with a bound of 1..100 and detached
B and reattach it with a bound of 101..300. If we perform the
syscache lookup for A before this happens and the syscache lookup for
B after this happens, we might see the old bound for A and the new
bound for B, and that would be sad, 'cuz they overlap.
- Seeing an older relpartbound for some other table is also a problem
for other reasons -- we will have the wrong idea about the bounds of
that partition and may put the wrong tuples into it. Without
AccessExclusiveLock, I don't think there is anything that keeps us
from reading stale syscache entries.

Alvaro's patch defends against the first of these cases by throwing an
error, which, as I already said, I don't think is acceptable, but I
don't see any defense at all against the other cases. The root of the
problem is that the way catalog lookups work today - each individual
lookup uses the latest available snapshot, but there is zero guarantee
that consecutive lookups use the same snapshot. Therefore, as soon as
you start lowering lock levels, you are at risk for inconsistent data.

I suspect the only good way of fixing this problem is using a single
snapshot to perform both the scan of pg_inherits and the subsequent
pg_class lookups. That way, you know that you are seeing the state of
the whole partitioning hierarchy as it existed at some particular
point in time -- every commit is either fully reflected in the
constructed PartitionDesc or not reflected at all. Unfortunately,
that would mean that we can't use the syscache to perform the lookups,
which might have unhappy performance consequences.

Note that this problem still exists even if you allow concurrent
attach but not concurrent detach, but it's not as bad, because when
you encounter a concurrently-attached partition, you know it hasn't
also been concurrently-detached from someplace else. Presumably you
either see the latest value of the partition bound or the NULL value
which preceded it, but not anything else. If that's so, then maybe
you could get by without using a consistent snapshot for all of your
information gathering: if you see NULL, you know that the partition
was concurrently added and you just ignore it. There's still no
guarantee that all parallel workers would come to the same conclusion,
though, which doesn't feel too good.

Personally, I don't think it's right to blame that problem on parallel
query. The problem is more general than that: we assume that holding
any kind of a lock on a relation is enough to keep the important
details of the relation static, and therefore it's fine to do
staggered lookups within one backend, and it's also fine to do
staggered lookups across different backends. When you remove the
basic assumption that any lock is enough to prevent concurrent DDL,
then the whole idea that you can do different lookups at different
times with different snapshots (possibly in different backends) and
get sane answers also ceases to be correct. But the idea that you can
look up different bits of catalog data at whatever time is convenient
undergirds large amounts of our current machinery -- it's built into
relcache, syscache, sinval, ...

I think that things get even crazier if we postpone locking on
individual partitions until we need to do something with that
partition, as has been proposed elsewhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#36)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 9 November 2018 at 05:34, Robert Haas <robertmhaas@gmail.com> wrote:

I suspect the only good way of fixing this problem is using a single
snapshot to perform both the scan of pg_inherits and the subsequent
pg_class lookups. That way, you know that you are seeing the state of
the whole partitioning hierarchy as it existed at some particular
point in time -- every commit is either fully reflected in the
constructed PartitionDesc or not reflected at all. Unfortunately,
that would mean that we can't use the syscache to perform the lookups,
which might have unhappy performance consequences.

I do have a patch sitting around that moves the relpartbound into a
new catalogue table named pg_partition. This gets rid of the usage of
pg_inherits for partitioned tables. I wonder if that problem is easier
to solve with that. It also solves the issue with long partition keys
and lack of toast table on pg_class.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#38

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: David Rowley (#37)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Nov 8, 2018 at 3:59 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 9 November 2018 at 05:34, Robert Haas <robertmhaas@gmail.com> wrote:

I suspect the only good way of fixing this problem is using a single
snapshot to perform both the scan of pg_inherits and the subsequent
pg_class lookups. That way, you know that you are seeing the state of
the whole partitioning hierarchy as it existed at some particular
point in time -- every commit is either fully reflected in the
constructed PartitionDesc or not reflected at all. Unfortunately,
that would mean that we can't use the syscache to perform the lookups,
which might have unhappy performance consequences.

I do have a patch sitting around that moves the relpartbound into a
new catalogue table named pg_partition. This gets rid of the usage of
pg_inherits for partitioned tables. I wonder if that problem is easier
to solve with that. It also solves the issue with long partition keys
and lack of toast table on pg_class.

Yeah, I thought about that, and it does make some sense. Not sure if
it would hurt performance to have to access another table, but maybe
it comes out in the wash if pg_inherits is gone? Seems like a fair
amount of code rearrangement just to get around the lack of a TOAST
table on pg_class, but maybe it's worth it.

I had another idea, too. I think we might be able to reuse the
technique Noah invented in 4240e429d0c2d889d0cda23c618f94e12c13ade7.
That is:

- make a note of SharedInvalidMessageCounter before doing any of the
relevant catalog lookups
- do them
- AcceptInvalidationMessages()
- if SharedInvalidMessageCounter has changed, discard all the data we
collected and retry from the top

I believe that is sufficient to guarantee that whatever we construct
will have a consistent view of the catalogs which is the most recent
available view as of the time we do the work. And with this approach
I believe we can continue to use syscache lookups to get the data
rather than having to use actual index scans, which is nice.

Then again, with your approach I'm guessing that one index scan would
get us the list of children and their partition bound information.
That would be even better -- the syscache lookup per child goes away
altogether; it's just a question of deforming the pg_partition tuples.

Way back at the beginning of the partitioning work, I mulled over the
idea of storing the partition bound information in a new column in
pg_inherits rather than in pg_class. I wonder why exactly I rejected
that idea, and whether I was wrong to do so. One possible advantage
of that approach over a pg_partition table is that is that client code
which queries pg_inherits will have to be adjusted if we stop using
it, and some of those queries are going to get more complicated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#39

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Robert Haas (#38)

2 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, Nov 9, 2018 at 9:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

I had another idea, too. I think we might be able to reuse the
technique Noah invented in 4240e429d0c2d889d0cda23c618f94e12c13ade7.
That is:

- make a note of SharedInvalidMessageCounter before doing any of the
relevant catalog lookups
- do them
- AcceptInvalidationMessages()
- if SharedInvalidMessageCounter has changed, discard all the data we
collected and retry from the top

I believe that is sufficient to guarantee that whatever we construct
will have a consistent view of the catalogs which is the most recent
available view as of the time we do the work. And with this approach
I believe we can continue to use syscache lookups to get the data
rather than having to use actual index scans, which is nice.

Here are a couple of patches to illustrate this approach to this part
of the overall problem. 0001 is, I think, a good cleanup that may as
well be applied in isolation; it makes the code in
RelationBuildPartitionDesc both cleaner and more efficient. 0002
adjust things so that - I hope - the partition bounds we get for the
individual partitions has to be as of the same point in the commit
sequence as the list of children. As I noted before, Alvaro's patch
doesn't seem to have tackled this part of the problem.

Another part of the problem is finding a way to make sure that if we
execute a query (or plan one), all parts of the executor (or planner)
see the same PartitionDesc for every relation. In the case of
parallel query, I think it's important to try to get consistency not
only within a single backend but also across backends. I'm thinking
about perhaps creating an object with a name like
PartitionDescDirectory which can optionally attach to dynamic shared
memory. It would store an OID -> PartitionDesc mapping in local
memory, and optionally, an additional OID -> serialized-PartitionDesc
in DSA. When given an OID, it would check the local hash table first,
and then if that doesn't find anything, check the shared hash table if
there is one. If an entry is found there, deserialize and add to the
local hash table. We'd then hang such a directory off of the EState
for the executor and the PlannerInfo for the planner. As compared
with Alvaro's proposal, this approach has the advantage of not
treating parallel query as a second-class citizen, and also of keeping
partitioning considerations out of the snapshot handling, which as I
said before seems to me to be a good idea.

One thing which was vaguely on my mind in earlier emails but which I
think I can now articulate somewhat more clearly is this: In some
cases, a consistent but outdated view of the catalog state is just as
bad as an inconsistent view of the catalog state. For example, it's
not OK to decide that a tuple can be placed in a certain partition
based on an outdated list of relation constraints, including the
partitioning constraint - nor is it OK to decide that a partition can
be pruned based on an old view of the partitioning constraint. I
think this means that whenever we change a partition's partitioning
constraint, we MUST take AccessExclusiveLock on the partition.
Otherwise, a heap_insert() [or a partition pruning decision] can be in
progress on that relation in one relation at the same time that some
other partition is changing the partition constraint, which can't
possibly work. The best we can reasonably do is to reduce the locking
level on the partitioned table itself.

A corollary is that holding the PartitionDescs that a particular query
sees for a particular relation fixed, whether by the method Alvaro
proposes or by what I am proposing here or by some other method is not
a panacea. For example, the ExecPartitionCheck call in copy.c
sometimes gets skipped on the theory that if tuple routing has sent us
to partition X, then the tuple being routed must satisfy the partition
constraint for that partition. But that's not true if we set up tuple
routing using one view of the catalogs, and then things changed
afterwards. RelationBuildPartitionDesc doesn't lock the children
whose relpartbounds it is fetching (!), so unless we're guaranteed to
have already locked them children earlier for some other reason, we
could grab the partition bound at this point and then it could change
again before we get a lock on them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchapplication/octet-stream; name=0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchDownload

From 9570b27545bc896b8dd9215583e922659f343eb4 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Nov 2018 12:15:44 -0500
Subject: [PATCH 2/2] Ensure that RelationBuildPartitionDesc sees a consistent
 view.

If partitions are added or removed concurrently, make sure that we
nevertheless get a view of the partition list and the partition
descriptor for each partition which is consistent with the system
state at some single point in the commit history.

To do this, reuse an idea first invented by Noah Misch back in
commit 4240e429d0c2d889d0cda23c618f94e12c13ade7.
---
 src/backend/utils/cache/partcache.c | 137 ++++++++++++++++++++++++++----------
 1 file changed, 101 insertions(+), 36 deletions(-)

diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 25c8b69f3f..6cfe5c8a1b 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -28,8 +28,10 @@
 #include "optimizer/clauses.h"
 #include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
+#include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
+#include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/partcache.h"
@@ -283,45 +285,113 @@ RelationBuildPartitionDesc(Relation rel)
 	/* Range partitioning specific */
 	PartitionRangeBound **rbounds = NULL;
 
-	/* Get partition oids from pg_inherits */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
-	nparts = list_length(inhoids);
-
-	if (nparts > 0)
+	/*
+	 * Fetch catalog information.  Since we want to allow partitions to be
+	 * added and removed without holding AccessExclusiveLock on the parent
+	 * table, it's possible that the catalog contents could be changing under
+	 * us.  That means that by by the time we fetch the partition bound for a
+	 * partition returned by find_inheritance_children, it might no longer be
+	 * a partition or might even be a partition of some other table.
+	 *
+	 * To ensure that we get a consistent view of the catalog data, we first
+	 * fetch everything we need and then call AcceptInvalidationMessages. If
+	 * SharedInvalidMessageCounter advances between the time we start fetching
+	 * information and the time AcceptInvalidationMessages() completes, that
+	 * means something may have changed under us, so we start over and do it
+	 * all again.
+	 */
+	for (;;)
 	{
-		oids = palloc(nparts * sizeof(Oid));
-		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		uint64		inval_count = SharedInvalidMessageCounter;
+
+		/* Get partition oids from pg_inherits */
+		inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+		nparts = list_length(inhoids);
+
+		if (nparts > 0)
+		{
+			oids = palloc(nparts * sizeof(Oid));
+			boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		}
+
+		/* Collect bound spec nodes for each partition */
+		i = 0;
+		foreach(cell, inhoids)
+		{
+			Oid			inhrelid = lfirst_oid(cell);
+			HeapTuple	tuple;
+			PartitionBoundSpec *boundspec = NULL;
+
+			/*
+			 * Don't put any sanity checks here that might fail as a result of
+			 * concurrent DDL, such as a check that relpartbound is not NULL.
+			 * We could transiently see such states as a result of concurrent
+			 * DDL.  Such checks can be performed only after we're sure we got
+			 * a consistent view of the underlying data.
+			 */
+			tuple = SearchSysCache1(RELOID, inhrelid);
+			if (HeapTupleIsValid(tuple))
+			{
+				Datum		datum;
+				bool		isnull;
+
+				datum = SysCacheGetAttr(RELOID, tuple,
+										Anum_pg_class_relpartbound,
+										&isnull);
+				if (!isnull)
+					boundspec = stringToNode(TextDatumGetCString(datum));
+				ReleaseSysCache(tuple);
+			}
+
+			oids[i] = inhrelid;
+			boundspecs[i] = boundspec;
+			++i;
+		}
+
+		/*
+		 * If no relevant catalog changes have occurred (see comments at the
+		 * top of this loop, then we got a consistent view of our partition
+		 * list and can stop now.
+		 */
+		AcceptInvalidationMessages();
+		if (inval_count == SharedInvalidMessageCounter)
+			break;
+
+		/* Something changed, so retry from the top. */
+		if (oids != NULL)
+		{
+			pfree(oids);
+			oids = NULL;
+		}
+		if (boundspecs != NULL)
+		{
+			pfree(boundspecs);
+			boundspecs = NULL;
+		}
+		if (inhoids != NIL)
+			list_free(inhoids);
 	}
 
-	/* Collect bound spec nodes for each partition */
-	i = 0;
-	foreach(cell, inhoids)
+	/*
+	 * At this point, we should have a consistent view of the data we got from
+	 * pg_inherits and pg_class, so it's safe to perform some sanity checks.
+	 */
+	for (i = 0; i < nparts; ++i)
 	{
-		Oid			inhrelid = lfirst_oid(cell);
-		HeapTuple	tuple;
-		Datum		datum;
-		bool		isnull;
-		PartitionBoundSpec *boundspec;
-
-		tuple = SearchSysCache1(RELOID, inhrelid);
-		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
-
-		datum = SysCacheGetAttr(RELOID, tuple,
-								Anum_pg_class_relpartbound,
-								&isnull);
-		if (isnull)
-			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = stringToNode(TextDatumGetCString(datum));
-		if (!IsA(boundspec, PartitionBoundSpec))
+		Oid			inhrelid = oids[i];
+		PartitionBoundSpec *spec = boundspecs[i];
+
+		if (!spec)
+			elog(ERROR, "missing relpartbound for relation %u", inhrelid);
+		if (!IsA(spec, PartitionBoundSpec))
 			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
-		 * Sanity check: If the PartitionBoundSpec says this is the default
-		 * partition, its OID should correspond to whatever's stored in
-		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
+		 * If the PartitionBoundSpec says this is the default partition, its
+		 * OID should match pg_partitioned_table.partdefid; if not, the
+		 * catalog is corrupt.
 		 */
-		if (boundspec->is_default)
+		if (spec->is_default)
 		{
 			Oid			partdefid;
 
@@ -330,11 +400,6 @@ RelationBuildPartitionDesc(Relation rel)
 				elog(ERROR, "expected partdefid %u, but got %u",
 					 inhrelid, partdefid);
 		}
-
-		oids[i] = inhrelid;
-		boundspecs[i] = boundspec;
-		++i;
-		ReleaseSysCache(tuple);
 	}
 
 	if (nparts > 0)
-- 
2.14.3 (Apple Git-98)

0001-Reduce-unnecessary-list-construction-in-RelationBuil.patchapplication/octet-stream; name=0001-Reduce-unnecessary-list-construction-in-RelationBuil.patchDownload

From 6caedf3c86deb1c5bf9b3dc2c333a2ae6b83a3fc Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Nov 2018 11:11:58 -0500
Subject: [PATCH 1/2] Reduce unnecessary list construction in
 RelationBuildPartitionDesc.

The 'partoids' list which was constructed by the previous version
of this code was necessarily identical to 'inhoids'.  There's no
point to duplicating the list, so avoid that.  Instead, construct
the array representation directly from the original 'inhoids' list.

Also, use an array rather than a list for 'boundspecs'.  We know
exactly how many items we need to store, so there's really no
reason to use a list.  Using an array instead reduces the number
of memory allocations we perform.
---
 src/backend/utils/cache/partcache.c | 61 ++++++++++++++++---------------------
 1 file changed, 26 insertions(+), 35 deletions(-)

diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..25c8b69f3f 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -260,10 +260,9 @@ RelationBuildPartitionKey(Relation relation)
 void
 RelationBuildPartitionDesc(Relation rel)
 {
-	List	   *inhoids,
-			   *partoids;
+	List	   *inhoids;
 	Oid		   *oids = NULL;
-	List	   *boundspecs = NIL;
+	PartitionBoundSpec **boundspecs = NULL;
 	ListCell   *cell;
 	int			i,
 				nparts;
@@ -286,17 +285,23 @@ RelationBuildPartitionDesc(Relation rel)
 
 	/* Get partition oids from pg_inherits */
 	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+	nparts = list_length(inhoids);
 
-	/* Collect bound spec nodes in a list */
+	if (nparts > 0)
+	{
+		oids = palloc(nparts * sizeof(Oid));
+		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+	}
+
+	/* Collect bound spec nodes for each partition */
 	i = 0;
-	partoids = NIL;
 	foreach(cell, inhoids)
 	{
 		Oid			inhrelid = lfirst_oid(cell);
 		HeapTuple	tuple;
 		Datum		datum;
 		bool		isnull;
-		Node	   *boundspec;
+		PartitionBoundSpec *boundspec;
 
 		tuple = SearchSysCache1(RELOID, inhrelid);
 		if (!HeapTupleIsValid(tuple))
@@ -307,14 +312,16 @@ RelationBuildPartitionDesc(Relation rel)
 								&isnull);
 		if (isnull)
 			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = (Node *) stringToNode(TextDatumGetCString(datum));
+		boundspec = stringToNode(TextDatumGetCString(datum));
+		if (!IsA(boundspec, PartitionBoundSpec))
+			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
 		 * Sanity check: If the PartitionBoundSpec says this is the default
 		 * partition, its OID should correspond to whatever's stored in
 		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
 		 */
-		if (castNode(PartitionBoundSpec, boundspec)->is_default)
+		if (boundspec->is_default)
 		{
 			Oid			partdefid;
 
@@ -324,20 +331,14 @@ RelationBuildPartitionDesc(Relation rel)
 					 inhrelid, partdefid);
 		}
 
-		boundspecs = lappend(boundspecs, boundspec);
-		partoids = lappend_oid(partoids, inhrelid);
+		oids[i] = inhrelid;
+		boundspecs[i] = boundspec;
+		++i;
 		ReleaseSysCache(tuple);
 	}
 
-	nparts = list_length(partoids);
-
 	if (nparts > 0)
 	{
-		oids = (Oid *) palloc(nparts * sizeof(Oid));
-		i = 0;
-		foreach(cell, partoids)
-			oids[i++] = lfirst_oid(cell);
-
 		/* Convert from node to the internal representation */
 		if (key->strategy == PARTITION_STRATEGY_HASH)
 		{
@@ -345,11 +346,9 @@ RelationBuildPartitionDesc(Relation rel)
 			hbounds = (PartitionHashBound **)
 				palloc(nparts * sizeof(PartitionHashBound *));
 
-			i = 0;
-			foreach(cell, boundspecs)
+			for (i = 0; i < nparts; ++i)
 			{
-				PartitionBoundSpec *spec = castNode(PartitionBoundSpec,
-													lfirst(cell));
+				PartitionBoundSpec *spec = boundspecs[i];
 
 				if (spec->strategy != PARTITION_STRATEGY_HASH)
 					elog(ERROR, "invalid strategy in partition bound spec");
@@ -360,7 +359,6 @@ RelationBuildPartitionDesc(Relation rel)
 				hbounds[i]->modulus = spec->modulus;
 				hbounds[i]->remainder = spec->remainder;
 				hbounds[i]->index = i;
-				i++;
 			}
 
 			/* Sort all the bounds in ascending order */
@@ -374,12 +372,10 @@ RelationBuildPartitionDesc(Relation rel)
 			/*
 			 * Create a unified list of non-null values across all partitions.
 			 */
-			i = 0;
 			null_index = -1;
-			foreach(cell, boundspecs)
+			for (i = 0; i < nparts; ++i)
 			{
-				PartitionBoundSpec *spec = castNode(PartitionBoundSpec,
-													lfirst(cell));
+				PartitionBoundSpec *spec = boundspecs[i];
 				ListCell   *c;
 
 				if (spec->strategy != PARTITION_STRATEGY_LIST)
@@ -393,7 +389,6 @@ RelationBuildPartitionDesc(Relation rel)
 				if (spec->is_default)
 				{
 					default_index = i;
-					i++;
 					continue;
 				}
 
@@ -425,8 +420,6 @@ RelationBuildPartitionDesc(Relation rel)
 						non_null_values = lappend(non_null_values,
 												  list_value);
 				}
-
-				i++;
 			}
 
 			ndatums = list_length(non_null_values);
@@ -465,11 +458,10 @@ RelationBuildPartitionDesc(Relation rel)
 			 * Create a unified list of range bounds across all the
 			 * partitions.
 			 */
-			i = ndatums = 0;
-			foreach(cell, boundspecs)
+			ndatums = 0;
+			for (i = 0; i < nparts; ++i)
 			{
-				PartitionBoundSpec *spec = castNode(PartitionBoundSpec,
-													lfirst(cell));
+				PartitionBoundSpec *spec = boundspecs[i];
 				PartitionRangeBound *lower,
 						   *upper;
 
@@ -483,7 +475,7 @@ RelationBuildPartitionDesc(Relation rel)
 				 */
 				if (spec->is_default)
 				{
-					default_index = i++;
+					default_index = i;
 					continue;
 				}
 
@@ -493,7 +485,6 @@ RelationBuildPartitionDesc(Relation rel)
 												  false);
 				all_bounds[ndatums++] = lower;
 				all_bounds[ndatums++] = upper;
-				i++;
 			}
 
 			Assert(ndatums == nparts * 2 ||
-- 
2.14.3 (Apple Git-98)

#40

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Robert Haas (#39)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Wed, Nov 14, 2018 at 02:27:31PM -0500, Robert Haas wrote:

Here are a couple of patches to illustrate this approach to this part
of the overall problem. 0001 is, I think, a good cleanup that may as
well be applied in isolation; it makes the code in
RelationBuildPartitionDesc both cleaner and more efficient. 0002
adjust things so that - I hope - the partition bounds we get for the
individual partitions has to be as of the same point in the commit
sequence as the list of children. As I noted before, Alvaro's patch
doesn't seem to have tackled this part of the problem.

You may want to rebase these patches as of b52b7dc2, and change the
first argument of partition_bounds_create() so as a list is used in
input...
--
Michael

#41

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 7 years ago

In reply to: Robert Haas (#39)

2 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018/11/15 4:27, Robert Haas wrote:

RelationBuildPartitionDesc doesn't lock the children
whose relpartbounds it is fetching (!), so unless we're guaranteed to
have already locked them children earlier for some other reason, we
could grab the partition bound at this point and then it could change
again before we get a lock on them.

Hmm, I think that RelationBuildPartitionDesc doesn't need to lock a
partition before fetching its relpartbound, because the latter can't
change if the caller is holding a lock on the parent, which it must be if
we're in RelationBuildPartitionDesc for parent at all. Am I missing
something?

As Michael pointed out, the first cleanup patch needs to be rebased due to
a recent commit [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b52b7dc2. I did that to see if something we did in that commit
made things worse for your patch, but seems fine. I had to go and change
things outside RelationBuildPartitionDesc as I rebased, due to the
aforementioned commit, but they're simple changes such as changing List *
arguments of some newly added functions to PartitionBoundSpec **. Please
find the rebased patches attached with this email.

Thanks,
Amit

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b52b7dc2

Attachments:

0001-Reduce-unnecessary-list-construction-in-RelationBuil.patchtext/plain; charset=UTF-8; name=0001-Reduce-unnecessary-list-construction-in-RelationBuil.patchDownload

From 3e7642c236f04bc65f3f39e7de98d7ff85166e03 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Nov 2018 11:11:58 -0500
Subject: [PATCH 1/2] Reduce unnecessary list construction in
 RelationBuildPartitionDesc.

The 'partoids' list which was constructed by the previous version
of this code was necessarily identical to 'inhoids'.  There's no
point to duplicating the list, so avoid that.  Instead, construct
the array representation directly from the original 'inhoids' list.

Also, use an array rather than a list for 'boundspecs'.  We know
exactly how many items we need to store, so there's really no
reason to use a list.  Using an array instead reduces the number
of memory allocations we perform.
---
 src/backend/partitioning/partbounds.c | 66 +++++++++++++++--------------------
 src/backend/utils/cache/partcache.c   | 36 +++++++++++--------
 src/include/partitioning/partbounds.h |  5 ++-
 3 files changed, 51 insertions(+), 56 deletions(-)

diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 0b5e0dd89f..a8f4a1a685 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -70,15 +70,12 @@ static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
 							   void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
 						   void *arg);
-static PartitionBoundInfo create_hash_bounds(List *boundspecs,
-				   PartitionKey key,
-				   int **mapping);
-static PartitionBoundInfo create_list_bounds(List *boundspecs,
-				   PartitionKey key,
-				   int **mapping);
-static PartitionBoundInfo create_range_bounds(List *boundspecs,
-					PartitionKey key,
-					int **mapping);
+static PartitionBoundInfo create_hash_bounds(PartitionBoundSpec **boundspecs,
+				   int nparts, PartitionKey key, int **mapping);
+static PartitionBoundInfo create_list_bounds(PartitionBoundSpec **boundspecs,
+				   int nparts, PartitionKey key, int **mapping);
+static PartitionBoundInfo create_range_bounds(PartitionBoundSpec **boundspecs,
+				   int nparts, PartitionKey key, int **mapping);
 static PartitionRangeBound *make_one_partition_rbound(PartitionKey key, int index,
 						  List *datums, bool lower);
 static int32 partition_hbound_cmp(int modulus1, int remainder1, int modulus2,
@@ -169,9 +166,9 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * current memory context.
  */
 PartitionBoundInfo
-partition_bounds_create(List *boundspecs, PartitionKey key, int **mapping)
+partition_bounds_create(PartitionBoundSpec **boundspecs, int nparts,
+						PartitionKey key, int **mapping)
 {
-	int			nparts = list_length(boundspecs);
 	int			i;
 
 	Assert(nparts > 0);
@@ -199,13 +196,13 @@ partition_bounds_create(List *boundspecs, PartitionKey key, int **mapping)
 	switch (key->strategy)
 	{
 		case PARTITION_STRATEGY_HASH:
-			return create_hash_bounds(boundspecs, key, mapping);
+			return create_hash_bounds(boundspecs, nparts, key, mapping);
 
 		case PARTITION_STRATEGY_LIST:
-			return create_list_bounds(boundspecs, key, mapping);
+			return create_list_bounds(boundspecs, nparts, key, mapping);
 
 		case PARTITION_STRATEGY_RANGE:
-			return create_range_bounds(boundspecs, key, mapping);
+			return create_range_bounds(boundspecs, nparts, key, mapping);
 
 		default:
 			elog(ERROR, "unexpected partition strategy: %d",
@@ -222,13 +219,12 @@ partition_bounds_create(List *boundspecs, PartitionKey key, int **mapping)
  *		Create a PartitionBoundInfo for a hash partitioned table
  */
 static PartitionBoundInfo
-create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
+create_hash_bounds(PartitionBoundSpec **boundspecs, int nparts,
+				   PartitionKey key, int **mapping)
 {
 	PartitionBoundInfo boundinfo;
 	PartitionHashBound **hbounds = NULL;
-	ListCell   *cell;
-	int			i,
-				nparts = list_length(boundspecs);
+	int			i;
 	int			ndatums = 0;
 	int			greatest_modulus;
 
@@ -244,10 +240,9 @@ create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		palloc(nparts * sizeof(PartitionHashBound *));
 
 	/* Convert from node to the internal representation */
-	i = 0;
-	foreach(cell, boundspecs)
+	for (i = 0; i < nparts; i++)
 	{
-		PartitionBoundSpec *spec = castNode(PartitionBoundSpec, lfirst(cell));
+		PartitionBoundSpec *spec = boundspecs[i];
 
 		if (spec->strategy != PARTITION_STRATEGY_HASH)
 			elog(ERROR, "invalid strategy in partition bound spec");
@@ -256,7 +251,6 @@ create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		hbounds[i]->modulus = spec->modulus;
 		hbounds[i]->remainder = spec->remainder;
 		hbounds[i]->index = i;
-		i++;
 	}
 
 	/* Sort all the bounds in ascending order */
@@ -307,7 +301,8 @@ create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
  *		Create a PartitionBoundInfo for a list partitioned table
  */
 static PartitionBoundInfo
-create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
+create_list_bounds(PartitionBoundSpec **boundspecs, int nparts,
+				   PartitionKey key, int **mapping)
 {
 	PartitionBoundInfo boundinfo;
 	PartitionListValue **all_values = NULL;
@@ -327,9 +322,9 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 	boundinfo->default_index = -1;
 
 	/* Create a unified list of non-null values across all partitions. */
-	foreach(cell, boundspecs)
+	for (i = 0; i < nparts; i++)
 	{
-		PartitionBoundSpec *spec = castNode(PartitionBoundSpec, lfirst(cell));
+		PartitionBoundSpec *spec = boundspecs[i];
 		ListCell   *c;
 
 		if (spec->strategy != PARTITION_STRATEGY_LIST)
@@ -343,7 +338,6 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		if (spec->is_default)
 		{
 			default_index = i;
-			i++;
 			continue;
 		}
 
@@ -374,8 +368,6 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 			if (list_value)
 				non_null_values = lappend(non_null_values, list_value);
 		}
-
-		i++;
 	}
 
 	ndatums = list_length(non_null_values);
@@ -458,7 +450,7 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 	}
 
 	/* All partition must now have been assigned canonical indexes. */
-	Assert(next_index == list_length(boundspecs));
+	Assert(next_index == nparts);
 	return boundinfo;
 }
 
@@ -467,16 +459,15 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
  *		Create a PartitionBoundInfo for a range partitioned table
  */
 static PartitionBoundInfo
-create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
+create_range_bounds(PartitionBoundSpec **boundspecs, int nparts,
+					PartitionKey key, int **mapping)
 {
 	PartitionBoundInfo boundinfo;
 	PartitionRangeBound **rbounds = NULL;
 	PartitionRangeBound **all_bounds,
 			   *prev;
-	ListCell   *cell;
 	int			i,
-				k,
-				nparts = list_length(boundspecs);
+				k;
 	int			ndatums = 0;
 	int			default_index = -1;
 	int			next_index = 0;
@@ -493,10 +484,10 @@ create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		palloc0(2 * nparts * sizeof(PartitionRangeBound *));
 
 	/* Create a unified list of range bounds across all the partitions. */
-	i = ndatums = 0;
-	foreach(cell, boundspecs)
+	ndatums = 0;
+	for (i = 0; i < nparts; i++)
 	{
-		PartitionBoundSpec *spec = castNode(PartitionBoundSpec, lfirst(cell));
+		PartitionBoundSpec *spec = boundspecs[i];
 		PartitionRangeBound *lower,
 				   *upper;
 
@@ -510,7 +501,7 @@ create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		 */
 		if (spec->is_default)
 		{
-			default_index = i++;
+			default_index = i;
 			continue;
 		}
 
@@ -518,7 +509,6 @@ create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		upper = make_one_partition_rbound(key, i, spec->upperdatums, false);
 		all_bounds[ndatums++] = lower;
 		all_bounds[ndatums++] = upper;
-		i++;
 	}
 
 	Assert(ndatums == nparts * 2 ||
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 07653f312b..0d732b4b84 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -257,26 +257,34 @@ RelationBuildPartitionDesc(Relation rel)
 	PartitionDesc partdesc;
 	PartitionBoundInfo boundinfo;
 	List	   *inhoids;
-	List	   *boundspecs = NIL;
+	PartitionBoundSpec **boundspecs = NULL;
+	Oid		   *oids;
 	ListCell   *cell;
 	int			i,
 				nparts;
 	PartitionKey key = RelationGetPartitionKey(rel);
 	MemoryContext oldcxt;
-	Oid		   *oids_orig;
 	int		   *mapping;
 
 	/* Get partition oids from pg_inherits */
 	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+	nparts = list_length(inhoids);
 
-	/* Collect bound spec nodes in a list */
+	if (nparts > 0)
+	{
+		oids = palloc(nparts * sizeof(Oid));
+		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+	}
+
+	/* Collect bound spec nodes for each partition */
+	i = 0;
 	foreach(cell, inhoids)
 	{
 		Oid			inhrelid = lfirst_oid(cell);
 		HeapTuple	tuple;
 		Datum		datum;
 		bool		isnull;
-		Node	   *boundspec;
+		PartitionBoundSpec *boundspec;
 
 		tuple = SearchSysCache1(RELOID, inhrelid);
 		if (!HeapTupleIsValid(tuple))
@@ -287,14 +295,16 @@ RelationBuildPartitionDesc(Relation rel)
 								&isnull);
 		if (isnull)
 			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = (Node *) stringToNode(TextDatumGetCString(datum));
+		boundspec = stringToNode(TextDatumGetCString(datum));
+		if (!IsA(boundspec, PartitionBoundSpec))
+			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
 		 * Sanity check: If the PartitionBoundSpec says this is the default
 		 * partition, its OID should correspond to whatever's stored in
 		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
 		 */
-		if (castNode(PartitionBoundSpec, boundspec)->is_default)
+		if (boundspec->is_default)
 		{
 			Oid			partdefid;
 
@@ -304,12 +314,12 @@ RelationBuildPartitionDesc(Relation rel)
 					 inhrelid, partdefid);
 		}
 
-		boundspecs = lappend(boundspecs, boundspec);
+		oids[i] = inhrelid;
+		boundspecs[i] = boundspec;
+		++i;
 		ReleaseSysCache(tuple);
 	}
 
-	nparts = list_length(boundspecs);
-
 	/* Now build the actual relcache partition descriptor */
 	rel->rd_pdcxt = AllocSetContextCreate(CacheMemoryContext,
 										  "partition descriptor",
@@ -330,11 +340,7 @@ RelationBuildPartitionDesc(Relation rel)
 	}
 
 	/* First create PartitionBoundInfo */
-	boundinfo = partition_bounds_create(boundspecs, key, &mapping);
-	oids_orig = (Oid *) palloc(sizeof(Oid) * partdesc->nparts);
-	i = 0;
-	foreach(cell, inhoids)
-		oids_orig[i++] = lfirst_oid(cell);
+	boundinfo = partition_bounds_create(boundspecs, nparts, key, &mapping);
 
 	/* Now copy boundinfo and oids into partdesc. */
 	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
@@ -348,7 +354,7 @@ RelationBuildPartitionDesc(Relation rel)
 	 * canonicalized representation of the partition bounds.
 	 */
 	for (i = 0; i < partdesc->nparts; i++)
-		partdesc->oids[mapping[i]] = oids_orig[i];
+		partdesc->oids[mapping[i]] = oids[i];
 	MemoryContextSwitchTo(oldcxt);
 
 	rel->rd_partdesc = partdesc;
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index 7a697d1c0a..36fb584e23 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -80,9 +80,8 @@ extern uint64 compute_partition_hash_value(int partnatts, FmgrInfo *partsupfunc,
 							 Datum *values, bool *isnull);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern PartitionBoundInfo partition_bounds_create(List *boundspecs,
-						PartitionKey key,
-						int **mapping);
+extern PartitionBoundInfo partition_bounds_create(PartitionBoundSpec **boundspecs,
+						int nparts, PartitionKey key, int **mapping);
 extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
 					   bool *parttypbyval, PartitionBoundInfo b1,
 					   PartitionBoundInfo b2);
-- 
2.11.0

0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchtext/plain; charset=UTF-8; name=0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchDownload

From eb5682bde2b070bac09f69b817b2be4dc73e3332 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Nov 2018 12:15:44 -0500
Subject: [PATCH 2/2] Ensure that RelationBuildPartitionDesc sees a consistent
 view.

If partitions are added or removed concurrently, make sure that we
nevertheless get a view of the partition list and the partition
descriptor for each partition which is consistent with the system
state at some single point in the commit history.

To do this, reuse an idea first invented by Noah Misch back in
commit 4240e429d0c2d889d0cda23c618f94e12c13ade7.
---
 src/backend/utils/cache/partcache.c | 135 ++++++++++++++++++++++++++----------
 1 file changed, 100 insertions(+), 35 deletions(-)

diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 0d732b4b84..623b156dd4 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -28,8 +28,10 @@
 #include "optimizer/clauses.h"
 #include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
+#include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
+#include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/partcache.h"
@@ -266,45 +268,113 @@ RelationBuildPartitionDesc(Relation rel)
 	MemoryContext oldcxt;
 	int		   *mapping;
 
-	/* Get partition oids from pg_inherits */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
-	nparts = list_length(inhoids);
-
-	if (nparts > 0)
+	/*
+	 * Fetch catalog information.  Since we want to allow partitions to be
+	 * added and removed without holding AccessExclusiveLock on the parent
+	 * table, it's possible that the catalog contents could be changing under
+	 * us.  That means that by by the time we fetch the partition bound for a
+	 * partition returned by find_inheritance_children, it might no longer be
+	 * a partition or might even be a partition of some other table.
+	 *
+	 * To ensure that we get a consistent view of the catalog data, we first
+	 * fetch everything we need and then call AcceptInvalidationMessages. If
+	 * SharedInvalidMessageCounter advances between the time we start fetching
+	 * information and the time AcceptInvalidationMessages() completes, that
+	 * means something may have changed under us, so we start over and do it
+	 * all again.
+	 */
+	for (;;)
 	{
-		oids = palloc(nparts * sizeof(Oid));
-		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		uint64		inval_count = SharedInvalidMessageCounter;
+
+		/* Get partition oids from pg_inherits */
+		inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+		nparts = list_length(inhoids);
+
+		if (nparts > 0)
+		{
+			oids = palloc(nparts * sizeof(Oid));
+			boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		}
+
+		/* Collect bound spec nodes for each partition */
+		i = 0;
+		foreach(cell, inhoids)
+		{
+			Oid			inhrelid = lfirst_oid(cell);
+			HeapTuple	tuple;
+			PartitionBoundSpec *boundspec = NULL;
+
+			/*
+			 * Don't put any sanity checks here that might fail as a result of
+			 * concurrent DDL, such as a check that relpartbound is not NULL.
+			 * We could transiently see such states as a result of concurrent
+			 * DDL.  Such checks can be performed only after we're sure we got
+			 * a consistent view of the underlying data.
+			 */
+			tuple = SearchSysCache1(RELOID, inhrelid);
+			if (HeapTupleIsValid(tuple))
+			{
+				Datum		datum;
+				bool		isnull;
+
+				datum = SysCacheGetAttr(RELOID, tuple,
+										Anum_pg_class_relpartbound,
+										&isnull);
+				if (!isnull)
+					boundspec = stringToNode(TextDatumGetCString(datum));
+				ReleaseSysCache(tuple);
+			}
+
+			oids[i] = inhrelid;
+			boundspecs[i] = boundspec;
+			++i;
+		}
+
+		/*
+		 * If no relevant catalog changes have occurred (see comments at the
+		 * top of this loop, then we got a consistent view of our partition
+		 * list and can stop now.
+		 */
+		AcceptInvalidationMessages();
+		if (inval_count == SharedInvalidMessageCounter)
+			break;
+
+		/* Something changed, so retry from the top. */
+		if (oids != NULL)
+		{
+			pfree(oids);
+			oids = NULL;
+		}
+		if (boundspecs != NULL)
+		{
+			pfree(boundspecs);
+			boundspecs = NULL;
+		}
+		if (inhoids != NIL)
+			list_free(inhoids);
 	}
 
-	/* Collect bound spec nodes for each partition */
-	i = 0;
-	foreach(cell, inhoids)
+	/*
+	 * At this point, we should have a consistent view of the data we got from
+	 * pg_inherits and pg_class, so it's safe to perform some sanity checks.
+	 */
+	for (i = 0; i < nparts; ++i)
 	{
-		Oid			inhrelid = lfirst_oid(cell);
-		HeapTuple	tuple;
-		Datum		datum;
-		bool		isnull;
-		PartitionBoundSpec *boundspec;
+		Oid			inhrelid = oids[i];
+		PartitionBoundSpec *spec = boundspecs[i];
 
-		tuple = SearchSysCache1(RELOID, inhrelid);
-		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
-
-		datum = SysCacheGetAttr(RELOID, tuple,
-								Anum_pg_class_relpartbound,
-								&isnull);
-		if (isnull)
-			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = stringToNode(TextDatumGetCString(datum));
-		if (!IsA(boundspec, PartitionBoundSpec))
+		if (!spec)
+			elog(ERROR, "missing relpartbound for relation %u", inhrelid);
+		if (!IsA(spec, PartitionBoundSpec))
 			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
-		 * Sanity check: If the PartitionBoundSpec says this is the default
-		 * partition, its OID should correspond to whatever's stored in
-		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
+		 * If the PartitionBoundSpec says this is the default partition, its
+		 * OID should match pg_partitioned_table.partdefid; if not, the
+		 * catalog is corrupt.
 		 */
-		if (boundspec->is_default)
+		if (spec->is_default)
 		{
 			Oid			partdefid;
 
@@ -313,11 +383,6 @@ RelationBuildPartitionDesc(Relation rel)
 				elog(ERROR, "expected partdefid %u, but got %u",
 					 inhrelid, partdefid);
 		}
-
-		oids[i] = inhrelid;
-		boundspecs[i] = boundspec;
-		++i;
-		ReleaseSysCache(tuple);
 	}
 
 	/* Now build the actual relcache partition descriptor */
-- 
2.11.0

#42

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 7 years ago

In reply to: Amit Langote (#41)

2 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018/11/15 11:03, Amit Langote wrote:

As Michael pointed out, the first cleanup patch needs to be rebased due to
a recent commit [1]. I did that to see if something we did in that commit
made things worse for your patch, but seems fine. I had to go and change
things outside RelationBuildPartitionDesc as I rebased, due to the
aforementioned commit, but they're simple changes such as changing List *
arguments of some newly added functions to PartitionBoundSpec **. Please
find the rebased patches attached with this email.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b52b7dc2

I noticed that the regression tests containing partitioned tables fail
randomly with the rebased patches I posted, whereas they didn't if I apply
them to HEAD without [1].

It seems to be due to the slightly confused memory context handling in
RelationBuildPartitionDesc after [1], which Alvaro had expressed some
doubts about yesterday.

I've fixed 0001 again to re-order the code so that allocations happen the
correct context and now tests pass with the rebased patches.

By the way, I noticed that the oids array added by Robert's original 0001
patch wasn't initialized to NULL, which could lead to calling pfree on a
garbage value of oids after the 2nd patch.

Thanks,
Amit

[2]: /messages/by-id/20181113135915.v4r77tdthlajdlqq@alvherre.pgsql
/messages/by-id/20181113135915.v4r77tdthlajdlqq@alvherre.pgsql

Attachments:

0001-Reduce-unnecessary-list-construction-in-RelationBuil.patchtext/plain; charset=UTF-8; name=0001-Reduce-unnecessary-list-construction-in-RelationBuil.patchDownload

From bff9ca17e43d6538c5e01e5de7a95f6e426e0d55 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Nov 2018 11:11:58 -0500
Subject: [PATCH 1/2] Reduce unnecessary list construction in
 RelationBuildPartitionDesc.

The 'partoids' list which was constructed by the previous version
of this code was necessarily identical to 'inhoids'.  There's no
point to duplicating the list, so avoid that.  Instead, construct
the array representation directly from the original 'inhoids' list.

Also, use an array rather than a list for 'boundspecs'.  We know
exactly how many items we need to store, so there's really no
reason to use a list.  Using an array instead reduces the number
of memory allocations we perform.
---
 src/backend/partitioning/partbounds.c | 66 ++++++++++++++-----------------
 src/backend/utils/cache/partcache.c   | 73 ++++++++++++++++++-----------------
 src/include/partitioning/partbounds.h |  5 +--
 3 files changed, 67 insertions(+), 77 deletions(-)

diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 0b5e0dd89f..a8f4a1a685 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -70,15 +70,12 @@ static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
 							   void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
 						   void *arg);
-static PartitionBoundInfo create_hash_bounds(List *boundspecs,
-				   PartitionKey key,
-				   int **mapping);
-static PartitionBoundInfo create_list_bounds(List *boundspecs,
-				   PartitionKey key,
-				   int **mapping);
-static PartitionBoundInfo create_range_bounds(List *boundspecs,
-					PartitionKey key,
-					int **mapping);
+static PartitionBoundInfo create_hash_bounds(PartitionBoundSpec **boundspecs,
+				   int nparts, PartitionKey key, int **mapping);
+static PartitionBoundInfo create_list_bounds(PartitionBoundSpec **boundspecs,
+				   int nparts, PartitionKey key, int **mapping);
+static PartitionBoundInfo create_range_bounds(PartitionBoundSpec **boundspecs,
+				   int nparts, PartitionKey key, int **mapping);
 static PartitionRangeBound *make_one_partition_rbound(PartitionKey key, int index,
 						  List *datums, bool lower);
 static int32 partition_hbound_cmp(int modulus1, int remainder1, int modulus2,
@@ -169,9 +166,9 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * current memory context.
  */
 PartitionBoundInfo
-partition_bounds_create(List *boundspecs, PartitionKey key, int **mapping)
+partition_bounds_create(PartitionBoundSpec **boundspecs, int nparts,
+						PartitionKey key, int **mapping)
 {
-	int			nparts = list_length(boundspecs);
 	int			i;
 
 	Assert(nparts > 0);
@@ -199,13 +196,13 @@ partition_bounds_create(List *boundspecs, PartitionKey key, int **mapping)
 	switch (key->strategy)
 	{
 		case PARTITION_STRATEGY_HASH:
-			return create_hash_bounds(boundspecs, key, mapping);
+			return create_hash_bounds(boundspecs, nparts, key, mapping);
 
 		case PARTITION_STRATEGY_LIST:
-			return create_list_bounds(boundspecs, key, mapping);
+			return create_list_bounds(boundspecs, nparts, key, mapping);
 
 		case PARTITION_STRATEGY_RANGE:
-			return create_range_bounds(boundspecs, key, mapping);
+			return create_range_bounds(boundspecs, nparts, key, mapping);
 
 		default:
 			elog(ERROR, "unexpected partition strategy: %d",
@@ -222,13 +219,12 @@ partition_bounds_create(List *boundspecs, PartitionKey key, int **mapping)
  *		Create a PartitionBoundInfo for a hash partitioned table
  */
 static PartitionBoundInfo
-create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
+create_hash_bounds(PartitionBoundSpec **boundspecs, int nparts,
+				   PartitionKey key, int **mapping)
 {
 	PartitionBoundInfo boundinfo;
 	PartitionHashBound **hbounds = NULL;
-	ListCell   *cell;
-	int			i,
-				nparts = list_length(boundspecs);
+	int			i;
 	int			ndatums = 0;
 	int			greatest_modulus;
 
@@ -244,10 +240,9 @@ create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		palloc(nparts * sizeof(PartitionHashBound *));
 
 	/* Convert from node to the internal representation */
-	i = 0;
-	foreach(cell, boundspecs)
+	for (i = 0; i < nparts; i++)
 	{
-		PartitionBoundSpec *spec = castNode(PartitionBoundSpec, lfirst(cell));
+		PartitionBoundSpec *spec = boundspecs[i];
 
 		if (spec->strategy != PARTITION_STRATEGY_HASH)
 			elog(ERROR, "invalid strategy in partition bound spec");
@@ -256,7 +251,6 @@ create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		hbounds[i]->modulus = spec->modulus;
 		hbounds[i]->remainder = spec->remainder;
 		hbounds[i]->index = i;
-		i++;
 	}
 
 	/* Sort all the bounds in ascending order */
@@ -307,7 +301,8 @@ create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
  *		Create a PartitionBoundInfo for a list partitioned table
  */
 static PartitionBoundInfo
-create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
+create_list_bounds(PartitionBoundSpec **boundspecs, int nparts,
+				   PartitionKey key, int **mapping)
 {
 	PartitionBoundInfo boundinfo;
 	PartitionListValue **all_values = NULL;
@@ -327,9 +322,9 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 	boundinfo->default_index = -1;
 
 	/* Create a unified list of non-null values across all partitions. */
-	foreach(cell, boundspecs)
+	for (i = 0; i < nparts; i++)
 	{
-		PartitionBoundSpec *spec = castNode(PartitionBoundSpec, lfirst(cell));
+		PartitionBoundSpec *spec = boundspecs[i];
 		ListCell   *c;
 
 		if (spec->strategy != PARTITION_STRATEGY_LIST)
@@ -343,7 +338,6 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		if (spec->is_default)
 		{
 			default_index = i;
-			i++;
 			continue;
 		}
 
@@ -374,8 +368,6 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 			if (list_value)
 				non_null_values = lappend(non_null_values, list_value);
 		}
-
-		i++;
 	}
 
 	ndatums = list_length(non_null_values);
@@ -458,7 +450,7 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 	}
 
 	/* All partition must now have been assigned canonical indexes. */
-	Assert(next_index == list_length(boundspecs));
+	Assert(next_index == nparts);
 	return boundinfo;
 }
 
@@ -467,16 +459,15 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
  *		Create a PartitionBoundInfo for a range partitioned table
  */
 static PartitionBoundInfo
-create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
+create_range_bounds(PartitionBoundSpec **boundspecs, int nparts,
+					PartitionKey key, int **mapping)
 {
 	PartitionBoundInfo boundinfo;
 	PartitionRangeBound **rbounds = NULL;
 	PartitionRangeBound **all_bounds,
 			   *prev;
-	ListCell   *cell;
 	int			i,
-				k,
-				nparts = list_length(boundspecs);
+				k;
 	int			ndatums = 0;
 	int			default_index = -1;
 	int			next_index = 0;
@@ -493,10 +484,10 @@ create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		palloc0(2 * nparts * sizeof(PartitionRangeBound *));
 
 	/* Create a unified list of range bounds across all the partitions. */
-	i = ndatums = 0;
-	foreach(cell, boundspecs)
+	ndatums = 0;
+	for (i = 0; i < nparts; i++)
 	{
-		PartitionBoundSpec *spec = castNode(PartitionBoundSpec, lfirst(cell));
+		PartitionBoundSpec *spec = boundspecs[i];
 		PartitionRangeBound *lower,
 				   *upper;
 
@@ -510,7 +501,7 @@ create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		 */
 		if (spec->is_default)
 		{
-			default_index = i++;
+			default_index = i;
 			continue;
 		}
 
@@ -518,7 +509,6 @@ create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		upper = make_one_partition_rbound(key, i, spec->upperdatums, false);
 		all_bounds[ndatums++] = lower;
 		all_bounds[ndatums++] = upper;
-		i++;
 	}
 
 	Assert(ndatums == nparts * 2 ||
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 07653f312b..729b887442 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -255,28 +255,36 @@ void
 RelationBuildPartitionDesc(Relation rel)
 {
 	PartitionDesc partdesc;
-	PartitionBoundInfo boundinfo;
+	PartitionBoundInfo boundinfo = NULL;
 	List	   *inhoids;
-	List	   *boundspecs = NIL;
+	PartitionBoundSpec **boundspecs = NULL;
+	Oid		   *oids = NULL;
 	ListCell   *cell;
 	int			i,
 				nparts;
 	PartitionKey key = RelationGetPartitionKey(rel);
 	MemoryContext oldcxt;
-	Oid		   *oids_orig;
 	int		   *mapping;
 
 	/* Get partition oids from pg_inherits */
 	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+	nparts = list_length(inhoids);
 
-	/* Collect bound spec nodes in a list */
+	if (nparts > 0)
+	{
+		oids = palloc(nparts * sizeof(Oid));
+		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+	}
+
+	/* Collect bound spec nodes for each partition */
+	i = 0;
 	foreach(cell, inhoids)
 	{
 		Oid			inhrelid = lfirst_oid(cell);
 		HeapTuple	tuple;
 		Datum		datum;
 		bool		isnull;
-		Node	   *boundspec;
+		PartitionBoundSpec *boundspec;
 
 		tuple = SearchSysCache1(RELOID, inhrelid);
 		if (!HeapTupleIsValid(tuple))
@@ -287,14 +295,16 @@ RelationBuildPartitionDesc(Relation rel)
 								&isnull);
 		if (isnull)
 			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = (Node *) stringToNode(TextDatumGetCString(datum));
+		boundspec = stringToNode(TextDatumGetCString(datum));
+		if (!IsA(boundspec, PartitionBoundSpec))
+			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
 		 * Sanity check: If the PartitionBoundSpec says this is the default
 		 * partition, its OID should correspond to whatever's stored in
 		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
 		 */
-		if (castNode(PartitionBoundSpec, boundspec)->is_default)
+		if (boundspec->is_default)
 		{
 			Oid			partdefid;
 
@@ -304,11 +314,16 @@ RelationBuildPartitionDesc(Relation rel)
 					 inhrelid, partdefid);
 		}
 
-		boundspecs = lappend(boundspecs, boundspec);
+		oids[i] = inhrelid;
+		boundspecs[i] = boundspec;
+		++i;
 		ReleaseSysCache(tuple);
 	}
 
-	nparts = list_length(boundspecs);
+	/* First create PartitionBoundInfo */
+	if (nparts > 0)
+		boundinfo = partition_bounds_create(boundspecs, nparts, key,
+											&mapping);
 
 	/* Now build the actual relcache partition descriptor */
 	rel->rd_pdcxt = AllocSetContextCreate(CacheMemoryContext,
@@ -316,39 +331,25 @@ RelationBuildPartitionDesc(Relation rel)
 										  ALLOCSET_DEFAULT_SIZES);
 	MemoryContextCopyAndSetIdentifier(rel->rd_pdcxt, RelationGetRelationName(rel));
 
+	/* Make a copy of oids and boundinfo in the cache context. */
 	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
 	partdesc = (PartitionDescData *) palloc0(sizeof(PartitionDescData));
 	partdesc->nparts = nparts;
-	/* oids and boundinfo are allocated below. */
-
-	MemoryContextSwitchTo(oldcxt);
-
-	if (nparts == 0)
+	if (nparts > 0)
 	{
-		rel->rd_partdesc = partdesc;
-		return;
+		partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
+		partdesc->oids = (Oid *) palloc(partdesc->nparts * sizeof(Oid));
+
+		/*
+		 * Now assign OIDs from the original array into mapped indexes of the
+		 * result array.  Order of OIDs in the former is defined by the
+		 * catalog scan that retrieved them, whereas that in the latter is
+		 * defined by canonicalized representation of the partition bounds.
+		 */
+		for (i = 0; i < partdesc->nparts; i++)
+			partdesc->oids[mapping[i]] = oids[i];
 	}
 
-	/* First create PartitionBoundInfo */
-	boundinfo = partition_bounds_create(boundspecs, key, &mapping);
-	oids_orig = (Oid *) palloc(sizeof(Oid) * partdesc->nparts);
-	i = 0;
-	foreach(cell, inhoids)
-		oids_orig[i++] = lfirst_oid(cell);
-
-	/* Now copy boundinfo and oids into partdesc. */
-	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
-	partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
-	partdesc->oids = (Oid *) palloc(partdesc->nparts * sizeof(Oid));
-
-	/*
-	 * Now assign OIDs from the original array into mapped indexes of the
-	 * result array.  Order of OIDs in the former is defined by the catalog
-	 * scan that retrieved them, whereas that in the latter is defined by
-	 * canonicalized representation of the partition bounds.
-	 */
-	for (i = 0; i < partdesc->nparts; i++)
-		partdesc->oids[mapping[i]] = oids_orig[i];
 	MemoryContextSwitchTo(oldcxt);
 
 	rel->rd_partdesc = partdesc;
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index 7a697d1c0a..36fb584e23 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -80,9 +80,8 @@ extern uint64 compute_partition_hash_value(int partnatts, FmgrInfo *partsupfunc,
 							 Datum *values, bool *isnull);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern PartitionBoundInfo partition_bounds_create(List *boundspecs,
-						PartitionKey key,
-						int **mapping);
+extern PartitionBoundInfo partition_bounds_create(PartitionBoundSpec **boundspecs,
+						int nparts, PartitionKey key, int **mapping);
 extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
 					   bool *parttypbyval, PartitionBoundInfo b1,
 					   PartitionBoundInfo b2);
-- 
2.11.0

0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchtext/plain; charset=UTF-8; name=0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchDownload

From fb8ac535b92b62454224356e5fa4892c5b16f327 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Nov 2018 12:15:44 -0500
Subject: [PATCH 2/2] Ensure that RelationBuildPartitionDesc sees a consistent
 view.

If partitions are added or removed concurrently, make sure that we
nevertheless get a view of the partition list and the partition
descriptor for each partition which is consistent with the system
state at some single point in the commit history.

To do this, reuse an idea first invented by Noah Misch back in
commit 4240e429d0c2d889d0cda23c618f94e12c13ade7.
---
 src/backend/utils/cache/partcache.c | 135 ++++++++++++++++++++++++++----------
 1 file changed, 100 insertions(+), 35 deletions(-)

diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 729b887442..91d56d9dfa 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -28,8 +28,10 @@
 #include "optimizer/clauses.h"
 #include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
+#include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
+#include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/partcache.h"
@@ -266,45 +268,113 @@ RelationBuildPartitionDesc(Relation rel)
 	MemoryContext oldcxt;
 	int		   *mapping;
 
-	/* Get partition oids from pg_inherits */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
-	nparts = list_length(inhoids);
-
-	if (nparts > 0)
+	/*
+	 * Fetch catalog information.  Since we want to allow partitions to be
+	 * added and removed without holding AccessExclusiveLock on the parent
+	 * table, it's possible that the catalog contents could be changing under
+	 * us.  That means that by by the time we fetch the partition bound for a
+	 * partition returned by find_inheritance_children, it might no longer be
+	 * a partition or might even be a partition of some other table.
+	 *
+	 * To ensure that we get a consistent view of the catalog data, we first
+	 * fetch everything we need and then call AcceptInvalidationMessages. If
+	 * SharedInvalidMessageCounter advances between the time we start fetching
+	 * information and the time AcceptInvalidationMessages() completes, that
+	 * means something may have changed under us, so we start over and do it
+	 * all again.
+	 */
+	for (;;)
 	{
-		oids = palloc(nparts * sizeof(Oid));
-		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		uint64		inval_count = SharedInvalidMessageCounter;
+
+		/* Get partition oids from pg_inherits */
+		inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+		nparts = list_length(inhoids);
+
+		if (nparts > 0)
+		{
+			oids = palloc(nparts * sizeof(Oid));
+			boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		}
+
+		/* Collect bound spec nodes for each partition */
+		i = 0;
+		foreach(cell, inhoids)
+		{
+			Oid			inhrelid = lfirst_oid(cell);
+			HeapTuple	tuple;
+			PartitionBoundSpec *boundspec = NULL;
+
+			/*
+			 * Don't put any sanity checks here that might fail as a result of
+			 * concurrent DDL, such as a check that relpartbound is not NULL.
+			 * We could transiently see such states as a result of concurrent
+			 * DDL.  Such checks can be performed only after we're sure we got
+			 * a consistent view of the underlying data.
+			 */
+			tuple = SearchSysCache1(RELOID, inhrelid);
+			if (HeapTupleIsValid(tuple))
+			{
+				Datum		datum;
+				bool		isnull;
+
+				datum = SysCacheGetAttr(RELOID, tuple,
+										Anum_pg_class_relpartbound,
+										&isnull);
+				if (!isnull)
+					boundspec = stringToNode(TextDatumGetCString(datum));
+				ReleaseSysCache(tuple);
+			}
+
+			oids[i] = inhrelid;
+			boundspecs[i] = boundspec;
+			++i;
+		}
+
+		/*
+		 * If no relevant catalog changes have occurred (see comments at the
+		 * top of this loop, then we got a consistent view of our partition
+		 * list and can stop now.
+		 */
+		AcceptInvalidationMessages();
+		if (inval_count == SharedInvalidMessageCounter)
+			break;
+
+		/* Something changed, so retry from the top. */
+		if (oids != NULL)
+		{
+			pfree(oids);
+			oids = NULL;
+		}
+		if (boundspecs != NULL)
+		{
+			pfree(boundspecs);
+			boundspecs = NULL;
+		}
+		if (inhoids != NIL)
+			list_free(inhoids);
 	}
 
-	/* Collect bound spec nodes for each partition */
-	i = 0;
-	foreach(cell, inhoids)
+	/*
+	 * At this point, we should have a consistent view of the data we got from
+	 * pg_inherits and pg_class, so it's safe to perform some sanity checks.
+	 */
+	for (i = 0; i < nparts; ++i)
 	{
-		Oid			inhrelid = lfirst_oid(cell);
-		HeapTuple	tuple;
-		Datum		datum;
-		bool		isnull;
-		PartitionBoundSpec *boundspec;
+		Oid			inhrelid = oids[i];
+		PartitionBoundSpec *spec = boundspecs[i];
 
-		tuple = SearchSysCache1(RELOID, inhrelid);
-		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
-
-		datum = SysCacheGetAttr(RELOID, tuple,
-								Anum_pg_class_relpartbound,
-								&isnull);
-		if (isnull)
-			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = stringToNode(TextDatumGetCString(datum));
-		if (!IsA(boundspec, PartitionBoundSpec))
+		if (!spec)
+			elog(ERROR, "missing relpartbound for relation %u", inhrelid);
+		if (!IsA(spec, PartitionBoundSpec))
 			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
-		 * Sanity check: If the PartitionBoundSpec says this is the default
-		 * partition, its OID should correspond to whatever's stored in
-		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
+		 * If the PartitionBoundSpec says this is the default partition, its
+		 * OID should match pg_partitioned_table.partdefid; if not, the
+		 * catalog is corrupt.
 		 */
-		if (boundspec->is_default)
+		if (spec->is_default)
 		{
 			Oid			partdefid;
 
@@ -313,11 +383,6 @@ RelationBuildPartitionDesc(Relation rel)
 				elog(ERROR, "expected partdefid %u, but got %u",
 					 inhrelid, partdefid);
 		}
-
-		oids[i] = inhrelid;
-		boundspecs[i] = boundspec;
-		++i;
-		ReleaseSysCache(tuple);
 	}
 
 	/* First create PartitionBoundInfo */
-- 
2.11.0

#43

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Amit Langote (#42)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote:

I've fixed 0001 again to re-order the code so that allocations happen the
correct context and now tests pass with the rebased patches.

I have been looking at 0001, and it seems to me that you make even more
messy the current situation. Coming to my point: do we have actually
any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation
has no partitions? It seems to me that we had better set rd_pdcxt and
rd_partdesc to NULL in this case.
--
Michael

#44

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 7 years ago

In reply to: Michael Paquier (#43)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018/11/15 14:38, Michael Paquier wrote:

On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote:

I've fixed 0001 again to re-order the code so that allocations happen the
correct context and now tests pass with the rebased patches.

I have been looking at 0001, and it seems to me that you make even more
messy the current situation. Coming to my point: do we have actually
any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation
has no partitions? It seems to me that we had better set rd_pdcxt and
rd_partdesc to NULL in this case.

As things stand today, rd_partdesc of a partitioned table must always be
non-NULL. In fact, there are many places in the backend code that Assert it:

tablecmds.c: ATPrepDropNotNull()

if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
PartitionDesc partdesc = RelationGetPartitionDesc(rel);

Assert(partdesc != NULL);

prepunion.c: expand_partitioned_rtentry()

PartitionDesc partdesc = RelationGetPartitionDesc(parentrel);

check_stack_depth();

/* A partitioned table should always have a partition descriptor. */
Assert(partdesc);

plancat.c: set_relation_partition_info()

partdesc = RelationGetPartitionDesc(relation);
partkey = RelationGetPartitionKey(relation);
rel->part_scheme = find_partition_scheme(root, relation);
Assert(partdesc != NULL && rel->part_scheme != NULL);

Maybe there are others in a different form.

If there are no partitions, nparts is 0, and other fields are NULL, though
rd_partdesc itself is never NULL.

If we want to redesign that and allow it to be NULL until some code in the
backend wants to use it, then maybe we can consider doing what you say.
But, many non-trivial operations on partitioned tables require the
PartitionDesc, so there is perhaps not much point to designing it such
that rd_partdesc is set only when needed, because it will be referenced
sooner than later. Maybe, we can consider doing that sort of thing for
boundinfo, because it's expensive to build, and not all operations want
the canonicalized bounds.

Thanks,
Amit

#45

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Amit Langote (#44)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Nov 15, 2018 at 02:53:47PM +0900, Amit Langote wrote:

As things stand today, rd_partdesc of a partitioned table must always be
non-NULL. In fact, there are many places in the backend code that Assert it:

[...]

I have noticed those, and they actually would not care much if
rd_partdesc was actually NULL. I find interesting that the planner
portion actually does roughly the same thing with a partitioned table
with no partitions and a non-partitioned table.

Maybe there are others in a different form.

If there are no partitions, nparts is 0, and other fields are NULL, though
rd_partdesc itself is never NULL.

I find a bit confusing that both concepts have the same meaning, aka
that a relation has no partition, and that it is actually relkind which
decides rd_partdesc should be NULL or set up. This stuff also does
empty allocations.

If we want to redesign that and allow it to be NULL until some code in the
backend wants to use it, then maybe we can consider doing what you say.
But, many non-trivial operations on partitioned tables require the
PartitionDesc, so there is perhaps not much point to designing it such
that rd_partdesc is set only when needed, because it will be referenced
sooner than later. Maybe, we can consider doing that sort of thing for
boundinfo, because it's expensive to build, and not all operations want
the canonicalized bounds.

I am fine if that's the consensus of this thread. But as far as I can
see it is possible to remove a bit of the memory handling mess by doing
so. My 2c.
--
Michael

#46

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 7 years ago

In reply to: Michael Paquier (#45)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018/11/15 15:22, Michael Paquier wrote:

If there are no partitions, nparts is 0, and other fields are NULL, though
rd_partdesc itself is never NULL.

I find a bit confusing that both concepts have the same meaning, aka
that a relation has no partition, and that it is actually relkind which
decides rd_partdesc should be NULL or set up. This stuff also does
empty allocations.

If we want to redesign that and allow it to be NULL until some code in the
backend wants to use it, then maybe we can consider doing what you say.
But, many non-trivial operations on partitioned tables require the
PartitionDesc, so there is perhaps not much point to designing it such
that rd_partdesc is set only when needed, because it will be referenced
sooner than later. Maybe, we can consider doing that sort of thing for
boundinfo, because it's expensive to build, and not all operations want
the canonicalized bounds.

I am fine if that's the consensus of this thread. But as far as I can
see it is possible to remove a bit of the memory handling mess by doing
so. My 2c.

Perhaps, we can discuss this another thread. I know this thread contains
important points about partition descriptor creation and modification, but
memory context considerations seems like a separate topic. The following
message could be a starting point, because there we were talking about
perhaps a similar design as you're saying:

/messages/by-id/143ed9a4-6038-76d4-9a55-502035815e68@lab.ntt.co.jp

Also, while I understood Alvaro's and your comment on the other thread
that memory handling is messy as is, but sorry, it's not clear to me why
you say this patch makes it messier. It reduces context switch calls so
that RelationBuildPartitionDesc roughly looks like this after the patch:

Start with CurrentMemoryContext...

1. read catalogs and make bounddescs and oids arrays

2. partition_bounds_create(...)

3. create and switch to rd_pdcxt

4. create PartitionDesc, copy partdesc->oids and partdesc->boundinfo

5. switch back to the old context

Thanks,
Amit

#47

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Michael Paquier (#43)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Nov 15, 2018 at 12:38 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote:

I've fixed 0001 again to re-order the code so that allocations happen the
correct context and now tests pass with the rebased patches.

I have been looking at 0001, and it seems to me that you make even more
messy the current situation. Coming to my point: do we have actually
any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation
has no partitions? It seems to me that we had better set rd_pdcxt and
rd_partdesc to NULL in this case.

I think that's unrelated to this patch, as Amit also says, but I have
to say that the last few hunks of the rebased version of this patch do
not make a lot of sense to me. This patch is supposed to be reducing
list construction, and the original version did that, but the rebased
version adds a partition_bounds_copy() operation, whereas my version
did not add any expensive operations - it only removed some cost. I
don't see why anything I changed should necessitate such a change, nor
does it seem like a good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#48

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 7 years ago

In reply to: Robert Haas (#47)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018/11/15 22:57, Robert Haas wrote:

On Thu, Nov 15, 2018 at 12:38 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote:

I've fixed 0001 again to re-order the code so that allocations happen the
correct context and now tests pass with the rebased patches.

I have been looking at 0001, and it seems to me that you make even more
messy the current situation. Coming to my point: do we have actually
any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation
has no partitions? It seems to me that we had better set rd_pdcxt and
rd_partdesc to NULL in this case.

I think that's unrelated to this patch, as Amit also says, but I have
to say that the last few hunks of the rebased version of this patch do
not make a lot of sense to me. This patch is supposed to be reducing
list construction, and the original version did that, but the rebased
version adds a partition_bounds_copy() operation, whereas my version
did not add any expensive operations - it only removed some cost. I
don't see why anything I changed should necessitate such a change, nor
does it seem like a good idea.

The partition_bounds_copy() is not because of your changes, it's there in
HEAD. The reason we do that is because partition_bounds_create allocates
the memory for the PartitionBoundInfo it returns along with other
temporary allocations in CurrentMemoryContext. But we'd need to copy it
into rd_pdcxt before calling it a property of rd_partdesc, so the
partition_bounds_copy.

Maybe partition_bounds_create() should've had a MemoryContext argument to
pass it the context we want it to create the PartitionBoundInfo in. That
way, we can simply pass rd_pdcxt to it and avoid making a copy. As is,
we're now allocating two copies of PartitionBoundInfo, one in the
CurrentMemoryContext and another in rd_pdcxt, whereas the previous code
would only allocate the latter. Maybe we should fix it as being a regression.

Thanks,
Amit

#49

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Amit Langote (#48)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, Nov 16, 2018 at 10:57:57AM +0900, Amit Langote wrote:

Maybe partition_bounds_create() should've had a MemoryContext argument to
pass it the context we want it to create the PartitionBoundInfo in. That
way, we can simply pass rd_pdcxt to it and avoid making a copy. As is,
we're now allocating two copies of PartitionBoundInfo, one in the
CurrentMemoryContext and another in rd_pdcxt, whereas the previous code
would only allocate the latter. Maybe we should fix it as being a regression.

Not sure about what you mean by regression here, but passing the memory
context as an argument has sense as you can remove the extra partition
bound copy, as it has sense to use an array instead of a list for
performance, which may matter if many partitions are handled when
building the cache. So cleaning up both things at the same time would
be nice.
--
Michael

#50

Amit Langote

amitlangote09@gmail.com

about 7 years ago

In reply to: Michael Paquier (#49)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, Nov 16, 2018 at 1:00 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 16, 2018 at 10:57:57AM +0900, Amit Langote wrote:

Maybe partition_bounds_create() should've had a MemoryContext argument to
pass it the context we want it to create the PartitionBoundInfo in. That
way, we can simply pass rd_pdcxt to it and avoid making a copy. As is,
we're now allocating two copies of PartitionBoundInfo, one in the
CurrentMemoryContext and another in rd_pdcxt, whereas the previous code
would only allocate the latter. Maybe we should fix it as being a regression.

Not sure about what you mean by regression here,

The regression is, as I mentioned, that the new code allocates two
copies of PartitionBoundInfo whereas only one would be allocated
before.

but passing the memory
context as an argument has sense as you can remove the extra partition
bound copy, as it has sense to use an array instead of a list for
performance, which may matter if many partitions are handled when
building the cache. So cleaning up both things at the same time would
be nice.

Maybe, the patch to add the memory context argument to
partition_bound_create and other related static functions in
partbound.c should be its own patch, as that seems to be a separate
issue. OTOH, other changes needed to implement Robert's proposal of
using PartitionBoundSpec and Oid arrays instead of existing lists
should be in the same patch.

Thanks,
Amit

#51

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Amit Langote (#48)

1 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Nov 15, 2018 at 8:58 PM Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

The partition_bounds_copy() is not because of your changes, it's there in
HEAD.

OK, but it seems to me that your version of my patch rearranges the
code more than necessary.

How about the attached?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v3-0001-Reduce-unnecessary-list-construction-in-RelationB.patchapplication/octet-stream; name=v3-0001-Reduce-unnecessary-list-construction-in-RelationB.patchDownload

From 102174204b68c35f8c2c087226ab2a72d5958b6c Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Nov 2018 11:11:58 -0500
Subject: [PATCH v3] Reduce unnecessary list construction in
 RelationBuildPartitionDesc.

The 'partoids' list which was constructed by the previous version
of this code was necessarily identical to 'inhoids'.  There's no
point to duplicating the list, so avoid that.  Instead, construct
the array representation directly from the original 'inhoids' list.

Also, use an array rather than a list for 'boundspecs'.  We know
exactly how many items we need to store, so there's really no
reason to use a list.  Using an array instead reduces the number
of memory allocations we perform.
---
 src/backend/partitioning/partbounds.c | 66 +++++++++++++++--------------------
 src/backend/utils/cache/partcache.c   | 38 +++++++++++---------
 src/include/partitioning/partbounds.h |  5 ++-
 3 files changed, 52 insertions(+), 57 deletions(-)

diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index be9fd49cd2..eeaab2f4c9 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -70,15 +70,12 @@ static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
 							   void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
 						   void *arg);
-static PartitionBoundInfo create_hash_bounds(List *boundspecs,
-				   PartitionKey key,
-				   int **mapping);
-static PartitionBoundInfo create_list_bounds(List *boundspecs,
-				   PartitionKey key,
-				   int **mapping);
-static PartitionBoundInfo create_range_bounds(List *boundspecs,
-					PartitionKey key,
-					int **mapping);
+static PartitionBoundInfo create_hash_bounds(PartitionBoundSpec **boundspecs,
+				   int nparts, PartitionKey key, int **mapping);
+static PartitionBoundInfo create_list_bounds(PartitionBoundSpec **boundspecs,
+				   int nparts, PartitionKey key, int **mapping);
+static PartitionBoundInfo create_range_bounds(PartitionBoundSpec **boundspecs,
+					int nparts, PartitionKey key, int **mapping);
 static PartitionRangeBound *make_one_partition_rbound(PartitionKey key, int index,
 						  List *datums, bool lower);
 static int32 partition_hbound_cmp(int modulus1, int remainder1, int modulus2,
@@ -169,9 +166,9 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * current memory context.
  */
 PartitionBoundInfo
-partition_bounds_create(List *boundspecs, PartitionKey key, int **mapping)
+partition_bounds_create(PartitionBoundSpec **boundspecs, int nparts,
+						PartitionKey key, int **mapping)
 {
-	int			nparts = list_length(boundspecs);
 	int			i;
 
 	Assert(nparts > 0);
@@ -199,13 +196,13 @@ partition_bounds_create(List *boundspecs, PartitionKey key, int **mapping)
 	switch (key->strategy)
 	{
 		case PARTITION_STRATEGY_HASH:
-			return create_hash_bounds(boundspecs, key, mapping);
+			return create_hash_bounds(boundspecs, nparts, key, mapping);
 
 		case PARTITION_STRATEGY_LIST:
-			return create_list_bounds(boundspecs, key, mapping);
+			return create_list_bounds(boundspecs, nparts, key, mapping);
 
 		case PARTITION_STRATEGY_RANGE:
-			return create_range_bounds(boundspecs, key, mapping);
+			return create_range_bounds(boundspecs, nparts, key, mapping);
 
 		default:
 			elog(ERROR, "unexpected partition strategy: %d",
@@ -222,13 +219,12 @@ partition_bounds_create(List *boundspecs, PartitionKey key, int **mapping)
  *		Create a PartitionBoundInfo for a hash partitioned table
  */
 static PartitionBoundInfo
-create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
+create_hash_bounds(PartitionBoundSpec **boundspecs, int nparts,
+				   PartitionKey key, int **mapping)
 {
 	PartitionBoundInfo boundinfo;
 	PartitionHashBound **hbounds = NULL;
-	ListCell   *cell;
-	int			i,
-				nparts = list_length(boundspecs);
+	int			i;
 	int			ndatums = 0;
 	int			greatest_modulus;
 
@@ -244,10 +240,9 @@ create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		palloc(nparts * sizeof(PartitionHashBound *));
 
 	/* Convert from node to the internal representation */
-	i = 0;
-	foreach(cell, boundspecs)
+	for (i = 0; i < nparts; i++)
 	{
-		PartitionBoundSpec *spec = castNode(PartitionBoundSpec, lfirst(cell));
+		PartitionBoundSpec *spec = boundspecs[i];
 
 		if (spec->strategy != PARTITION_STRATEGY_HASH)
 			elog(ERROR, "invalid strategy in partition bound spec");
@@ -256,7 +251,6 @@ create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		hbounds[i]->modulus = spec->modulus;
 		hbounds[i]->remainder = spec->remainder;
 		hbounds[i]->index = i;
-		i++;
 	}
 
 	/* Sort all the bounds in ascending order */
@@ -307,7 +301,8 @@ create_hash_bounds(List *boundspecs, PartitionKey key, int **mapping)
  *		Create a PartitionBoundInfo for a list partitioned table
  */
 static PartitionBoundInfo
-create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
+create_list_bounds(PartitionBoundSpec **boundspecs, int nparts,
+				   PartitionKey key, int **mapping)
 {
 	PartitionBoundInfo boundinfo;
 	PartitionListValue **all_values = NULL;
@@ -327,9 +322,9 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 	boundinfo->default_index = -1;
 
 	/* Create a unified list of non-null values across all partitions. */
-	foreach(cell, boundspecs)
+	for (i = 0; i < nparts; i++)
 	{
-		PartitionBoundSpec *spec = castNode(PartitionBoundSpec, lfirst(cell));
+		PartitionBoundSpec *spec = boundspecs[i];
 		ListCell   *c;
 
 		if (spec->strategy != PARTITION_STRATEGY_LIST)
@@ -343,7 +338,6 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		if (spec->is_default)
 		{
 			default_index = i;
-			i++;
 			continue;
 		}
 
@@ -374,8 +368,6 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 			if (list_value)
 				non_null_values = lappend(non_null_values, list_value);
 		}
-
-		i++;
 	}
 
 	ndatums = list_length(non_null_values);
@@ -458,7 +450,7 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
 	}
 
 	/* All partition must now have been assigned canonical indexes. */
-	Assert(next_index == list_length(boundspecs));
+	Assert(next_index == nparts);
 	return boundinfo;
 }
 
@@ -467,16 +459,15 @@ create_list_bounds(List *boundspecs, PartitionKey key, int **mapping)
  *		Create a PartitionBoundInfo for a range partitioned table
  */
 static PartitionBoundInfo
-create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
+create_range_bounds(PartitionBoundSpec **boundspecs, int nparts,
+					PartitionKey key, int **mapping)
 {
 	PartitionBoundInfo boundinfo;
 	PartitionRangeBound **rbounds = NULL;
 	PartitionRangeBound **all_bounds,
 			   *prev;
-	ListCell   *cell;
 	int			i,
-				k,
-				nparts = list_length(boundspecs);
+				k;
 	int			ndatums = 0;
 	int			default_index = -1;
 	int			next_index = 0;
@@ -493,10 +484,10 @@ create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		palloc0(2 * nparts * sizeof(PartitionRangeBound *));
 
 	/* Create a unified list of range bounds across all the partitions. */
-	i = ndatums = 0;
-	foreach(cell, boundspecs)
+	ndatums = 0;
+	for (i = 0; i < nparts; i++)
 	{
-		PartitionBoundSpec *spec = castNode(PartitionBoundSpec, lfirst(cell));
+		PartitionBoundSpec *spec = boundspecs[i];
 		PartitionRangeBound *lower,
 				   *upper;
 
@@ -510,7 +501,7 @@ create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		 */
 		if (spec->is_default)
 		{
-			default_index = i++;
+			default_index = i;
 			continue;
 		}
 
@@ -518,7 +509,6 @@ create_range_bounds(List *boundspecs, PartitionKey key, int **mapping)
 		upper = make_one_partition_rbound(key, i, spec->upperdatums, false);
 		all_bounds[ndatums++] = lower;
 		all_bounds[ndatums++] = upper;
-		i++;
 	}
 
 	Assert(ndatums == nparts * 2 ||
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 07653f312b..a87c460ea2 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -255,28 +255,36 @@ void
 RelationBuildPartitionDesc(Relation rel)
 {
 	PartitionDesc partdesc;
-	PartitionBoundInfo boundinfo;
+	PartitionBoundInfo boundinfo = NULL;
 	List	   *inhoids;
-	List	   *boundspecs = NIL;
+	PartitionBoundSpec **boundspecs = NULL;
+	Oid		   *oids = NULL;
 	ListCell   *cell;
 	int			i,
 				nparts;
 	PartitionKey key = RelationGetPartitionKey(rel);
 	MemoryContext oldcxt;
-	Oid		   *oids_orig;
 	int		   *mapping;
 
 	/* Get partition oids from pg_inherits */
 	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+	nparts = list_length(inhoids);
 
-	/* Collect bound spec nodes in a list */
+	if (nparts > 0)
+	{
+		oids = palloc(nparts * sizeof(Oid));
+		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+	}
+
+	/* Collect bound spec nodes for each partition */
+	i = 0;
 	foreach(cell, inhoids)
 	{
 		Oid			inhrelid = lfirst_oid(cell);
 		HeapTuple	tuple;
 		Datum		datum;
 		bool		isnull;
-		Node	   *boundspec;
+		PartitionBoundSpec *boundspec;
 
 		tuple = SearchSysCache1(RELOID, inhrelid);
 		if (!HeapTupleIsValid(tuple))
@@ -287,14 +295,16 @@ RelationBuildPartitionDesc(Relation rel)
 								&isnull);
 		if (isnull)
 			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = (Node *) stringToNode(TextDatumGetCString(datum));
+		boundspec = stringToNode(TextDatumGetCString(datum));
+		if (!IsA(boundspec, PartitionBoundSpec))
+			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
 		 * Sanity check: If the PartitionBoundSpec says this is the default
 		 * partition, its OID should correspond to whatever's stored in
 		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
 		 */
-		if (castNode(PartitionBoundSpec, boundspec)->is_default)
+		if (boundspec->is_default)
 		{
 			Oid			partdefid;
 
@@ -304,12 +314,12 @@ RelationBuildPartitionDesc(Relation rel)
 					 inhrelid, partdefid);
 		}
 
-		boundspecs = lappend(boundspecs, boundspec);
+		oids[i] = inhrelid;
+		boundspecs[i] = boundspec;
+		++i;
 		ReleaseSysCache(tuple);
 	}
 
-	nparts = list_length(boundspecs);
-
 	/* Now build the actual relcache partition descriptor */
 	rel->rd_pdcxt = AllocSetContextCreate(CacheMemoryContext,
 										  "partition descriptor",
@@ -330,11 +340,7 @@ RelationBuildPartitionDesc(Relation rel)
 	}
 
 	/* First create PartitionBoundInfo */
-	boundinfo = partition_bounds_create(boundspecs, key, &mapping);
-	oids_orig = (Oid *) palloc(sizeof(Oid) * partdesc->nparts);
-	i = 0;
-	foreach(cell, inhoids)
-		oids_orig[i++] = lfirst_oid(cell);
+	boundinfo = partition_bounds_create(boundspecs, nparts, key, &mapping);
 
 	/* Now copy boundinfo and oids into partdesc. */
 	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
@@ -348,7 +354,7 @@ RelationBuildPartitionDesc(Relation rel)
 	 * canonicalized representation of the partition bounds.
 	 */
 	for (i = 0; i < partdesc->nparts; i++)
-		partdesc->oids[mapping[i]] = oids_orig[i];
+		partdesc->oids[mapping[i]] = oids[i];
 	MemoryContextSwitchTo(oldcxt);
 
 	rel->rd_partdesc = partdesc;
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index 7a697d1c0a..36fb584e23 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -80,9 +80,8 @@ extern uint64 compute_partition_hash_value(int partnatts, FmgrInfo *partsupfunc,
 							 Datum *values, bool *isnull);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern PartitionBoundInfo partition_bounds_create(List *boundspecs,
-						PartitionKey key,
-						int **mapping);
+extern PartitionBoundInfo partition_bounds_create(PartitionBoundSpec **boundspecs,
+						int nparts, PartitionKey key, int **mapping);
 extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
 					   bool *parttypbyval, PartitionBoundInfo b1,
 					   PartitionBoundInfo b2);
-- 
2.14.3 (Apple Git-98)

#52

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Robert Haas (#51)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, Nov 16, 2018 at 09:38:40AM -0500, Robert Haas wrote:

OK, but it seems to me that your version of my patch rearranges the
code more than necessary.

How about the attached?

What you are proposing here looks good to me. Thanks!
--
Michael

#53

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 7 years ago

In reply to: Michael Paquier (#52)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018/11/17 9:06, Michael Paquier wrote:

On Fri, Nov 16, 2018 at 09:38:40AM -0500, Robert Haas wrote:

OK, but it seems to me that your version of my patch rearranges the
code more than necessary.

How about the attached?

What you are proposing here looks good to me. Thanks!

Me too, now that I see the patch closely. The errors I'd seen in the
regression tests were due to uninitialized oids variable which is fixed in
the later patches, not due to "confused memory context switching" as I'd
put it [1]/messages/by-id/1be8055c-137b-5639-9bcf-8a2d5fef6e5a@lab.ntt.co.jp and made that the reason for additional changes.

Thanks,
Amit

[1]: /messages/by-id/1be8055c-137b-5639-9bcf-8a2d5fef6e5a@lab.ntt.co.jp
/messages/by-id/1be8055c-137b-5639-9bcf-8a2d5fef6e5a@lab.ntt.co.jp

#54

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Amit Langote (#53)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Sun, Nov 18, 2018 at 9:43 PM Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2018/11/17 9:06, Michael Paquier wrote:

On Fri, Nov 16, 2018 at 09:38:40AM -0500, Robert Haas wrote:

OK, but it seems to me that your version of my patch rearranges the
code more than necessary.

How about the attached?

What you are proposing here looks good to me. Thanks!

Me too, now that I see the patch closely. The errors I'd seen in the
regression tests were due to uninitialized oids variable which is fixed in
the later patches, not due to "confused memory context switching" as I'd
put it [1] and made that the reason for additional changes.

OK. Rebased again, and committed (although I forgot to include a link
to this discussion - sorry about that).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#55

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Amit Langote (#41)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Wed, Nov 14, 2018 at 9:03 PM Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2018/11/15 4:27, Robert Haas wrote:

RelationBuildPartitionDesc doesn't lock the children
whose relpartbounds it is fetching (!), so unless we're guaranteed to
have already locked them children earlier for some other reason, we
could grab the partition bound at this point and then it could change
again before we get a lock on them.

Hmm, I think that RelationBuildPartitionDesc doesn't need to lock a
partition before fetching its relpartbound, because the latter can't
change if the caller is holding a lock on the parent, which it must be if
we're in RelationBuildPartitionDesc for parent at all. Am I missing
something?

After thinking about this for a bit, I think that right now it's fine,
because you can't create or drop or attach or detach a partition
without holding AccessExclusiveLock on both the parent and the child,
so if you hold even AccessShareLock on the parent, the child's
relpartbound can't be changing. However, what we want to do is get
the lock level on the parent down to ShareUpdateExclusiveLock, at
which point the child's relpartbound could indeed change under us. I
think, however, that what I previously posted as 0002 is sufficient to
fix that part of the problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#56

Sergei Kornilov

sk@zsrv.org

about 7 years ago

In reply to: Robert Haas (#54)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Hello

OK. Rebased again, and committed (although I forgot to include a link
to this discussion - sorry about that).

Seems we erroneously moved this thread to next CF: https://commitfest.postgresql.org/21/1842/
Can you close this entry?

regards, Sergei

#57

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Sergei Kornilov (#56)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Sat, Dec 15, 2018 at 01:04:00PM +0300, Sergei Kornilov wrote:

Seems we erroneously moved this thread to next CF:
https://commitfest.postgresql.org/21/1842/
Can you close this entry?

Robert has committed a patch to refactor a bit the list contruction of
RelationBuildPartitionDesc thanks to 7ee5f88e, but the main patch has
not been committed, so the current status looks right to me.
--
Michael

#58

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Michael Paquier (#57)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Sun, Dec 16, 2018 at 6:43 AM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Dec 15, 2018 at 01:04:00PM +0300, Sergei Kornilov wrote:

Seems we erroneously moved this thread to next CF:
https://commitfest.postgresql.org/21/1842/
Can you close this entry?

Robert has committed a patch to refactor a bit the list contruction of
RelationBuildPartitionDesc thanks to 7ee5f88e, but the main patch has
not been committed, so the current status looks right to me.

I have done a bit more work on this, but need to spend some more time
on it before I have something that is worth posting. Not sure whether
I'll get to that before the New Year at this point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#59

Alvaro Herrera

alvherre@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#58)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018-Dec-17, Robert Haas wrote:

I have done a bit more work on this, but need to spend some more time
on it before I have something that is worth posting. Not sure whether
I'll get to that before the New Year at this point.

This patch missing the CF deadline would not be a happy way for me to
begin the new year.

I'm not sure what's the best way to move forward with this patch, but I
encourage you to post whatever version you have before the deadline,
even if you're not fully happy with it (and, heck, even if it doesn't
compile and/or is full of FIXME or TODO comments).

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#60

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Alvaro Herrera (#59)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Mon, Dec 17, 2018 at 06:52:51PM -0300, Alvaro Herrera wrote:

On 2018-Dec-17, Robert Haas wrote:
This patch missing the CF deadline would not be a happy way for me to
begin the new year.

I'm not sure what's the best way to move forward with this patch, but I
encourage you to post whatever version you have before the deadline,
even if you're not fully happy with it (and, heck, even if it doesn't
compile and/or is full of FIXME or TODO comments).

Agreed. This patch has value, and somebody else could always take it
from the point where you were.
--
Michael

#61

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Michael Paquier (#60)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Mon, Dec 17, 2018 at 6:44 PM Michael Paquier <michael@paquier.xyz> wrote:

Agreed. This patch has value, and somebody else could always take it
from the point where you were.

OK. I'll post what I have by the end of the week.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#62

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Robert Haas (#61)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote:

OK. I'll post what I have by the end of the week.

Thanks, Robert.
--
Michael

#63

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Michael Paquier (#62)

3 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Dec 18, 2018 at 8:04 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote:

OK. I'll post what I have by the end of the week.

Thanks, Robert.

OK, so I got slightly delayed here by utterly destroying my laptop,
but I've mostly reconstructed what I did. I think there are some
remaining problems, but this seems like a good time to share what I've
got so far. I'm attaching three patches.

0001 is one which I posted before. It attempts to fix up
RelationBuildPartitionDesc() so that this function will always return
a partition descriptor based on a consistent snapshot of the catalogs.
Without this, I think there's nothing to prevent a commit which
happens while the function is running from causing the function to
fail or produce nonsense answers.

0002 introduces the concept of a partition directory. The idea is
that the planner will create a partition directory, and so will the
executor, and all calls which occur in those places to
RelationGetPartitionDesc() will instead call
PartitionDirectoryLookup(). Every lookup for the same relation in the
same partition directory is guaranteed to produce the same answer. I
believe this patch still has a number of weaknesses. More on that
below.

0003 actually lowers the lock level. The comment here might need some
more work.

Here is a list of possible or definite problems that are known to me:

- I think we need a way to make partition directory lookups consistent
across backends in the case of parallel query. I believe this can be
done with a dshash and some serialization and deserialization logic,
but I haven't attempted that yet.

- I refactored expand_inherited_rtentry() to drive partition expansion
entirely off of PartitionDescs. The reason why this is necessary is
that it clearly will not work to have find_all_inheritors() use a
current snapshot to decide what children we have and lock them, and
then consult a different source of truth to decide which relations to
open with NoLock. There's nothing to keep the lists of partitions
from being different in the two cases, and that demonstrably causes
assertion failures if you SELECT with an ATTACH/DETACH loop running in
the background. However, it also changes the order in which tables get
locked. Possibly that could be fixed by teaching
expand_partitioned_rtentry() to qsort() the OIDs the way
find_inheritance_children() does. It also loses the infinite-loop
protection which find_all_inheritors() has. Not sure what to do about
that.

- In order for the new PartitionDirectory machinery to avoid
use-after-free bugs, we have to either copy the PartitionDesc out of
the relcache into the partition directory or avoid freeing it while it
is still in use. Copying it seems unappealing for performance
reasons, so I took the latter approach. However, what I did here in
terms of reclaiming memory is just about the least aggressive strategy
short of leaking it altogether - it just keeps it around until the
next rebuild that occurs while the relcache entry is not in use. We
might want to do better, e.g. freeing old copies immediately as soon
as the relcache reference count drops to 0. I just did it this way
because it was simple to code.

- I tried this with Alvaro's isolation tests and it fails some tests.
That's because Alvaro's tests expect that the list of accessible
partitions is based on the query snapshot. For reasons I attempted to
explain in the comments in 0003, I think the idea that we can choose
the set of accessible partitions based on the query snapshot is very
wrong. For example, suppose transaction 1 begins, reads an unrelated
table to establish a snapshot, and then goes idle. Then transaction 2
comes along, detaches a partition from an important table, and then
does crazy stuff with that table -- changes the column list, drops it,
reattaches it with different bounds, whatever. Then it commits. If
transaction 1 now comes along and uses the query snapshot to decide
that the detached partition ought to still be seen as a partition of
that partitioned table, disaster will ensue.

- I don't have any tests here, but I think it would be good to add
some, perhaps modeled on Alvaro's, and also some that involve multiple
ATTACH and DETACH operations mixed with other interesting kinds of
DDL. I also didn't make any attempt to update the documentation,
which is another thing that will probably have to be done at some
point. Not sure how much documentation we have of any of this now.

- I am uncertain whether the logic that builds up the partition
constraint is really safe in the face of concurrent DDL. I kinda
suspect there are some problems there, but maybe not. Once you hold
ANY lock on a partition, the partition constraint can't concurrently
become stricter, because no ATTACH can be done without
AccessExclusiveLock on the partition being attached; but it could
concurrently become less strict, because you only need a lesser lock
for a detach. Not sure if/how that could foul with this logic.

- I have not done anything about the fact that index detach takes
AccessExclusiveLock.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchapplication/octet-stream; name=0001-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchDownload

From 91a62cf6ed14ca4e644a10c7fe7dd11fe7f19c2e Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 20 Dec 2018 12:37:24 -0500
Subject: [PATCH 1/3] Ensure that RelationBuildPartitionDesc sees a consistent
 view.

If partitions are added or removed concurrently, make sure that we
nevertheless get a view of the partition list and the partition
descriptor for each partition which is consistent with the system
state at some single point in the commit history.

To do this, reuse an idea first invented by Noah Misch back in
commit 4240e429d0c2d889d0cda23c618f94e12c13ade7.
---
 src/backend/utils/cache/partcache.c | 137 ++++++++++++++++++++++++++----------
 1 file changed, 101 insertions(+), 36 deletions(-)

diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 6db2c6f783..f2bb4bbeb5 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -28,8 +28,10 @@
 #include "optimizer/clauses.h"
 #include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
+#include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
+#include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/partcache.h"
@@ -266,45 +268,113 @@ RelationBuildPartitionDesc(Relation rel)
 	MemoryContext oldcxt;
 	int		   *mapping;
 
-	/* Get partition oids from pg_inherits */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
-	nparts = list_length(inhoids);
-
-	if (nparts > 0)
+	/*
+	 * Fetch catalog information.  Since we want to allow partitions to be
+	 * added and removed without holding AccessExclusiveLock on the parent
+	 * table, it's possible that the catalog contents could be changing under
+	 * us.  That means that by by the time we fetch the partition bound for a
+	 * partition returned by find_inheritance_children, it might no longer be
+	 * a partition or might even be a partition of some other table.
+	 *
+	 * To ensure that we get a consistent view of the catalog data, we first
+	 * fetch everything we need and then call AcceptInvalidationMessages. If
+	 * SharedInvalidMessageCounter advances between the time we start fetching
+	 * information and the time AcceptInvalidationMessages() completes, that
+	 * means something may have changed under us, so we start over and do it
+	 * all again.
+	 */
+	for (;;)
 	{
-		oids = palloc(nparts * sizeof(Oid));
-		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		uint64		inval_count = SharedInvalidMessageCounter;
+
+		/* Get partition oids from pg_inherits */
+		inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+		nparts = list_length(inhoids);
+
+		if (nparts > 0)
+		{
+			oids = palloc(nparts * sizeof(Oid));
+			boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		}
+
+		/* Collect bound spec nodes for each partition */
+		i = 0;
+		foreach(cell, inhoids)
+		{
+			Oid			inhrelid = lfirst_oid(cell);
+			HeapTuple	tuple;
+			PartitionBoundSpec *boundspec = NULL;
+
+			/*
+			 * Don't put any sanity checks here that might fail as a result of
+			 * concurrent DDL, such as a check that relpartbound is not NULL.
+			 * We could transiently see such states as a result of concurrent
+			 * DDL.  Such checks can be performed only after we're sure we got
+			 * a consistent view of the underlying data.
+			 */
+			tuple = SearchSysCache1(RELOID, inhrelid);
+			if (HeapTupleIsValid(tuple))
+			{
+				Datum		datum;
+				bool		isnull;
+
+				datum = SysCacheGetAttr(RELOID, tuple,
+										Anum_pg_class_relpartbound,
+										&isnull);
+				if (!isnull)
+					boundspec = stringToNode(TextDatumGetCString(datum));
+				ReleaseSysCache(tuple);
+			}
+
+			oids[i] = inhrelid;
+			boundspecs[i] = boundspec;
+			++i;
+		}
+
+		/*
+		 * If no relevant catalog changes have occurred (see comments at the
+		 * top of this loop, then we got a consistent view of our partition
+		 * list and can stop now.
+		 */
+		AcceptInvalidationMessages();
+		if (inval_count == SharedInvalidMessageCounter)
+			break;
+
+		/* Something changed, so retry from the top. */
+		if (oids != NULL)
+		{
+			pfree(oids);
+			oids = NULL;
+		}
+		if (boundspecs != NULL)
+		{
+			pfree(boundspecs);
+			boundspecs = NULL;
+		}
+		if (inhoids != NIL)
+			list_free(inhoids);
 	}
 
-	/* Collect bound spec nodes for each partition */
-	i = 0;
-	foreach(cell, inhoids)
+	/*
+	 * At this point, we should have a consistent view of the data we got from
+	 * pg_inherits and pg_class, so it's safe to perform some sanity checks.
+	 */
+	for (i = 0; i < nparts; ++i)
 	{
-		Oid			inhrelid = lfirst_oid(cell);
-		HeapTuple	tuple;
-		Datum		datum;
-		bool		isnull;
-		PartitionBoundSpec *boundspec;
-
-		tuple = SearchSysCache1(RELOID, inhrelid);
-		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
-
-		datum = SysCacheGetAttr(RELOID, tuple,
-								Anum_pg_class_relpartbound,
-								&isnull);
-		if (isnull)
-			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = stringToNode(TextDatumGetCString(datum));
-		if (!IsA(boundspec, PartitionBoundSpec))
+		Oid			inhrelid = oids[i];
+		PartitionBoundSpec *spec = boundspecs[i];
+
+		if (!spec)
+			elog(ERROR, "missing relpartbound for relation %u", inhrelid);
+		if (!IsA(spec, PartitionBoundSpec))
 			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
-		 * Sanity check: If the PartitionBoundSpec says this is the default
-		 * partition, its OID should correspond to whatever's stored in
-		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
+		 * If the PartitionBoundSpec says this is the default partition, its
+		 * OID should match pg_partitioned_table.partdefid; if not, the
+		 * catalog is corrupt.
 		 */
-		if (boundspec->is_default)
+		if (spec->is_default)
 		{
 			Oid			partdefid;
 
@@ -313,11 +383,6 @@ RelationBuildPartitionDesc(Relation rel)
 				elog(ERROR, "expected partdefid %u, but got %u",
 					 inhrelid, partdefid);
 		}
-
-		oids[i] = inhrelid;
-		boundspecs[i] = boundspec;
-		++i;
-		ReleaseSysCache(tuple);
 	}
 
 	/* Now build the actual relcache partition descriptor */
-- 
2.14.3 (Apple Git-98)

0002-Introduce-the-concept-of-a-partition-directory.patchapplication/octet-stream; name=0002-Introduce-the-concept-of-a-partition-directory.patchDownload

From b621382a1907f0aa5db5b81fc3be71e2027c3056 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 18 Dec 2018 18:55:37 -0500
Subject: [PATCH 2/3] Introduce the concept of a partition directory.

Teach the optimizer and executor to use it, so that a single planning
cycle or query execution gets the same PartitionDesc for the same table
every time it looks it up.  This does not prevent changes between
planning and execution, nor does it guarantee that all tables are
expanded according to the same snapshot.
---
 src/backend/commands/copy.c            |  2 +-
 src/backend/executor/execPartition.c   | 32 +++++++++----
 src/backend/executor/nodeModifyTable.c |  2 +-
 src/backend/optimizer/prep/prepunion.c | 88 +++++++++++++++-------------------
 src/backend/optimizer/util/plancat.c   |  6 ++-
 src/backend/partitioning/Makefile      |  2 +-
 src/backend/partitioning/partdir.c     | 76 +++++++++++++++++++++++++++++
 src/backend/utils/cache/relcache.c     | 24 ++++++++++
 src/include/executor/execPartition.h   |  4 +-
 src/include/nodes/execnodes.h          |  4 ++
 src/include/nodes/relation.h           |  4 ++
 src/include/partitioning/partdefs.h    |  2 +
 src/include/partitioning/partdir.h     | 21 ++++++++
 13 files changed, 202 insertions(+), 65 deletions(-)
 create mode 100644 src/backend/partitioning/partdir.c
 create mode 100644 src/include/partitioning/partdir.h

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4311e16007..1b69e3c700 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2528,7 +2528,7 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
+		proute = ExecSetupPartitionTupleRouting(estate, NULL, cstate->rel);
 
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 179a501f30..f10e6fb95c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -23,6 +23,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdir.h"
 #include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -165,8 +166,10 @@ static void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					PartitionDispatch dispatch,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
-							  Oid partoid, PartitionDispatch parent_pd, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(EState *estate,
+							  PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd,
+							  int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -202,7 +205,7 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * it should be estate->es_query_cxt.
  */
 PartitionTupleRouting *
-ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
+ExecSetupPartitionTupleRouting(EState *estate, ModifyTableState *mtstate, Relation rel)
 {
 	PartitionTupleRouting *proute;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
@@ -227,7 +230,8 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 * parent as NULL as we don't need to care about any parent of the target
 	 * partitioned table.
 	 */
-	ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL, 0);
+	ExecInitPartitionDispatchInfo(estate, proute, RelationGetRelid(rel),
+								  NULL, 0);
 
 	/*
 	 * If performing an UPDATE with tuple routing, we can reuse partition
@@ -428,7 +432,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 				 * Create the new PartitionDispatch.  We pass the current one
 				 * in as the parent PartitionDispatch
 				 */
-				subdispatch = ExecInitPartitionDispatchInfo(proute,
+				subdispatch = ExecInitPartitionDispatchInfo(estate, proute,
 															partdesc->oids[partidx],
 															dispatch, partidx);
 				Assert(dispatch->indexes[partidx] >= 0 &&
@@ -970,8 +974,9 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
  *		newly created PartitionDispatch later.
  */
 static PartitionDispatch
-ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
-							  PartitionDispatch parent_pd, int partidx)
+ExecInitPartitionDispatchInfo(EState *estate, PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd,
+							  int partidx)
 {
 	Relation	rel;
 	PartitionDesc partdesc;
@@ -985,7 +990,12 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 		rel = heap_open(partoid, NoLock);
 	else
 		rel = proute->partition_root;
-	partdesc = RelationGetPartitionDesc(rel);
+
+	if (estate->es_partition_directory == NULL)
+		estate->es_partition_directory =
+			CreatePartitionDirectory(estate->es_query_cxt);
+	partdesc = PartitionDirectoryLookup(estate->es_partition_directory,
+										rel);
 
 	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes) +
 									partdesc->nparts * sizeof(int));
@@ -1548,6 +1558,10 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 	prunestate->do_exec_prune = false;	/* may be set below */
 	prunestate->num_partprunedata = n_part_hierarchies;
 
+	if (estate->es_partition_directory == NULL)
+		estate->es_partition_directory =
+			CreatePartitionDirectory(estate->es_query_cxt);
+
 	/*
 	 * Create a short-term memory context which we'll use when making calls to
 	 * the partition pruning functions.  This avoids possible memory leaks,
@@ -1610,7 +1624,7 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			 */
 			partrel = ExecGetRangeTableRelation(estate, pinfo->rtindex);
 			partkey = RelationGetPartitionKey(partrel);
-			partdesc = RelationGetPartitionDesc(partrel);
+			partdesc = PartitionDirectoryLookup(estate->es_partition_directory, partrel);
 
 			n_steps = list_length(pinfo->pruning_steps);
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 3c60bbcd9c..91ffad26f0 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2229,7 +2229,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
 		(operation == CMD_INSERT || update_tuple_routing_needed))
 		mtstate->mt_partition_tuple_routing =
-			ExecSetupPartitionTupleRouting(mtstate, rel);
+			ExecSetupPartitionTupleRouting(estate, mtstate, rel);
 
 	/*
 	 * Build state for collecting transition tuples.  This requires having a
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index da278f785e..6060302e51 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -48,6 +48,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_coerce.h"
 #include "parser/parsetree.h"
+#include "partitioning/partdir.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
@@ -104,8 +105,7 @@ static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
 static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
-						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos);
+						   PlanRowMark *top_parentrc, List **appinfos);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1518,7 +1518,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	Oid			parentOID;
 	PlanRowMark *oldrc;
 	Relation	oldrelation;
-	LOCKMODE	lockmode;
 	List	   *inhOIDs;
 	ListCell   *l;
 
@@ -1541,37 +1540,13 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	}
 
 	/*
-	 * The rewriter should already have obtained an appropriate lock on each
-	 * relation named in the query.  However, for each child relation we add
-	 * to the query, we must obtain an appropriate lock, because this will be
-	 * the first use of those relations in the parse/rewrite/plan pipeline.
-	 * Child rels should use the same lockmode as their parent.
-	 */
-	lockmode = rte->rellockmode;
-
-	/* Scan for all members of inheritance set, acquire needed locks */
-	inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
-
-	/*
-	 * Check that there's at least one descendant, else treat as no-child
-	 * case.  This could happen despite above has_subclass() check, if table
-	 * once had a child but no longer does.
-	 */
-	if (list_length(inhOIDs) < 2)
-	{
-		/* Clear flag before returning */
-		rte->inh = false;
-		return;
-	}
-
-	/*
-	 * If parent relation is selected FOR UPDATE/SHARE, we need to mark its
-	 * PlanRowMark as isParent = true, and generate a new PlanRowMark for each
-	 * child.
+	 * If parent relation is selected FOR UPDATE/SHARE, we will need to mark
+	 * its PlanRowMark as isParent = true, and generate a new PlanRowMark for
+	 * each child. expand_single_inheritance_child() will handle this, but we
+	 * need to pass down the rowmark for the original parent to make it
+	 * possible.
 	 */
 	oldrc = get_plan_rowmark(root->rowMarks, rti);
-	if (oldrc)
-		oldrc->isParent = true;
 
 	/*
 	 * Must open the parent relation to examine its tupdesc.  We need not lock
@@ -1580,9 +1555,11 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	oldrelation = heap_open(parentOID, NoLock);
 
 	/* Scan the inheritance set and expand it */
-	if (RelationGetPartitionDesc(oldrelation) != NULL)
+	if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
+		/* Create a partition directory unless already done. */
+		if (root->partition_directory == NULL)
+			root->partition_directory = CreatePartitionDirectory(CurrentMemoryContext);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
@@ -1590,7 +1567,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
-								   lockmode, &root->append_rel_list);
+								   &root->append_rel_list);
 	}
 	else
 	{
@@ -1598,9 +1575,12 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		RangeTblEntry *childrte;
 		Index		childRTindex;
 
+		/* Scan for all members of inheritance set, acquire needed locks */
+		inhOIDs = find_all_inheritors(parentOID, rte->rellockmode, NULL);
+
 		/*
-		 * This table has no partitions.  Expand any plain inheritance
-		 * children in the order the OIDs were returned by
+		 * This is not a partitioned table, but it may have plain inheritance
+		 * children.  Expand them in the order that the OIDs were returned by
 		 * find_all_inheritors.
 		 */
 		foreach(l, inhOIDs)
@@ -1622,7 +1602,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			 */
 			if (childOID != parentOID && RELATION_IS_OTHER_TEMP(newrelation))
 			{
-				heap_close(newrelation, lockmode);
+				heap_close(newrelation, rte->rellockmode);
 				continue;
 			}
 
@@ -1637,11 +1617,11 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		}
 
 		/*
-		 * If all the children were temp tables, pretend it's a
-		 * non-inheritance situation; we don't need Append node in that case.
-		 * The duplicate RTE we added for the parent table is harmless, so we
-		 * don't bother to get rid of it; ditto for the useless PlanRowMark
-		 * node.
+		 * If all the children were temp tables, or there were none, pretend
+		 * it's a non-inheritance situation; we don't need Append node in that
+		 * case.  The duplicate RTE we added for the parent table is harmless,
+		 * so we don't bother to get rid of it; ditto for the useless
+		 * PlanRowMark node.
 		 */
 		if (list_length(appinfos) < 2)
 			rte->inh = false;
@@ -1661,16 +1641,17 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
-						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos)
+						   PlanRowMark *top_parentrc, List **appinfos)
 {
 	int			i;
 	RangeTblEntry *childrte;
 	Index		childRTindex;
-	PartitionDesc partdesc = RelationGetPartitionDesc(parentrel);
+	PartitionDesc partdesc;
 
 	check_stack_depth();
 
+	partdesc = PartitionDirectoryLookup(root->partition_directory, parentrel);
+
 	/* A partitioned table should always have a partition descriptor. */
 	Assert(partdesc);
 
@@ -1707,8 +1688,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		Oid			childOID = partdesc->oids[i];
 		Relation	childrel;
 
-		/* Open rel; we already have required locks */
-		childrel = heap_open(childOID, NoLock);
+		/* Open and lock child rel */
+		childrel = heap_open(childOID, parentrte->rellockmode);
 
 		/*
 		 * Temporary partitions belonging to other sessions should have been
@@ -1725,8 +1706,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		/* If this child is itself partitioned, recurse */
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
-									   childrel, top_parentrc, lockmode,
-									   appinfos);
+									   childrel, top_parentrc, appinfos);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
@@ -1862,6 +1842,14 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		/* Include child's rowmark type in top parent's allMarkTypes */
 		top_parentrc->allMarkTypes |= childrc->allMarkTypes;
 
+		/*
+		 * If we create at least one child rowmark, isParent should be set
+		 * on the original rowmark. That's very cheap, so just do it here
+		 * unconditionally without worrying about whether it has been done
+		 * previously.
+		 */
+		top_parentrc->isParent = true;
+
 		root->rowMarks = lappend(root->rowMarks, childrc);
 	}
 }
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index a570ac0aab..f0e5ef070f 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -39,6 +39,7 @@
 #include "optimizer/predtest.h"
 #include "optimizer/prep.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdir.h"
 #include "parser/parse_relation.h"
 #include "parser/parsetree.h"
 #include "rewrite/rewriteManip.h"
@@ -1903,7 +1904,10 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
 
 	Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
 
-	partdesc = RelationGetPartitionDesc(relation);
+	/* shouldn't reach here unless expand_inherited_rtentry initialized this */
+	Assert(root->partition_directory != NULL);
+
+	partdesc = PartitionDirectoryLookup(root->partition_directory, relation);
 	partkey = RelationGetPartitionKey(relation);
 	rel->part_scheme = find_partition_scheme(root, relation);
 	Assert(partdesc != NULL && rel->part_scheme != NULL);
diff --git a/src/backend/partitioning/Makefile b/src/backend/partitioning/Makefile
index 278fac3afa..a096b0a0bb 100644
--- a/src/backend/partitioning/Makefile
+++ b/src/backend/partitioning/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/partitioning
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = partprune.o partbounds.o
+OBJS = partprune.o partbounds.o partdir.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/partitioning/partdir.c b/src/backend/partitioning/partdir.c
new file mode 100644
index 0000000000..463d192f13
--- /dev/null
+++ b/src/backend/partitioning/partdir.c
@@ -0,0 +1,76 @@
+/*-------------------------------------------------------------------------
+ *
+ * partdir.c
+ *		Support for partition directories
+ *
+ * Partition directories provide a mechanism for looking up the
+ * PartitionDesc for a relation in such a way that the answer will be
+ * the same every time the directory is interrogated, even in the face
+ * of concurrent DDL.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		  src/backend/partitioning/partdir.c
+ *
+ *-------------------------------------------------------------------------
+*/
+#include "postgres.h"
+
+#include "catalog/pg_class.h"
+#include "partitioning/partdir.h"
+#include "utils/hsearch.h"
+#include "utils/rel.h"
+
+typedef struct PartitionDirectoryData
+{
+	MemoryContext pdir_mcxt;
+	HTAB *pdir_htab;
+} PartitionDirectoryData;
+
+typedef struct PartitionDirectoryEntry
+{
+	Oid	relid;
+	PartitionDesc pd;
+} PartitionDirectoryEntry;
+
+PartitionDirectory
+CreatePartitionDirectory(MemoryContext mcxt)
+{
+	HASHCTL hctl;
+	MemoryContext oldcontext;
+	PartitionDirectory pdir;
+
+	hctl.keysize = sizeof(Oid);
+	hctl.entrysize = sizeof(PartitionDirectoryEntry);
+	hctl.hcxt = mcxt;
+
+	oldcontext = MemoryContextSwitchTo(mcxt);
+
+	pdir = palloc(sizeof(PartitionDirectoryData));
+	pdir->pdir_mcxt = mcxt;
+	pdir->pdir_htab = hash_create("PartitionDirectory", 256, &hctl,
+								  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	MemoryContextSwitchTo(oldcontext);
+	return pdir;
+}
+
+PartitionDesc
+PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel)
+{
+	PartitionDirectoryEntry *pde;
+	Oid relid;
+	bool found;
+
+	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+	relid = RelationGetRelid(rel);
+	pde = hash_search(pdir->pdir_htab, &relid, HASH_ENTER, &found);
+	if (!found)
+	{
+		pde->pd = RelationGetPartitionDesc(rel);
+		Assert(pde->pd != NULL);
+	}
+	return pde->pd;
+}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index c3071db1cd..5ec20767de 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2537,6 +2537,30 @@ RelationClearRelation(Relation relation, bool rebuild)
 			SWAPFIELD(PartitionDesc, rd_partdesc);
 			SWAPFIELD(MemoryContext, rd_pdcxt);
 		}
+		else if (rebuild && newrel->rd_partdesc != NULL)
+		{
+			/*
+			 * If this is a rebuild, that means that the reference count of this
+			 * relation is greater than 0, which means somebody is using it.  We want
+			 * to allow for the possibility that they might still have a pointer to the
+			 * old PartitionDesc, so we don't free it here. Instead, we reparent its
+			 * context under the context for the newly-build PartitionDesc, so that it
+			 * will get freed when that context is eventually destroyed.  While this
+			 * doesn't leak memory permanently, there's no upper limit to how long the
+			 * old PartitionDesc could stick around, so we might want to consider a
+			 * more clever strategy here at some point.  Note also that this strategy
+			 * relies on the fact that a relation which has a partition descriptor
+			 * will never cease having one after a rebuild, which is currently true
+			 * even if the table ends up with no partitions.
+			 *
+			 * NB: At this point in the code, the contents of 'relation' and 'newrel'
+			 * have been swapped and then partially unswapped, so, confusingly, it is
+			 * 'newrel' that points to the old data.
+			 */
+			MemoryContextSetParent(newrel->rd_pdcxt, relation->rd_pdcxt);
+			newrel->rd_pdcxt = NULL;
+			newrel->rd_partdesc = NULL;
+		}
 
 #undef SWAPFIELD
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index d3cfb55f9f..17766b1c49 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -135,8 +135,8 @@ typedef struct PartitionPruneState
 	PartitionPruningData *partprunedata[FLEXIBLE_ARRAY_MEMBER];
 } PartitionPruneState;
 
-extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
-							   Relation rel);
+extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(EState *estate,
+							   ModifyTableState *mtstate, Relation rel);
 extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
 				  ResultRelInfo *rootResultRelInfo,
 				  PartitionTupleRouting *proute,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5ed0f40f69..985c752d01 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -21,6 +21,7 @@
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "partitioning/partdefs.h"
 #include "utils/hsearch.h"
 #include "utils/queryenvironment.h"
 #include "utils/reltrigger.h"
@@ -523,6 +524,9 @@ typedef struct EState
 	 */
 	List	   *es_tuple_routing_result_relations;
 
+	/* Directory of partitions used for any purpose. */
+	PartitionDirectory	es_partition_directory;
+
 	/* Stuff used for firing triggers: */
 	List	   *es_trig_target_relations;	/* trigger-only ResultRelInfos */
 	TupleTableSlot *es_trig_tuple_slot; /* for trigger output tuples */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6fd24203dd..bcf2054838 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -19,6 +19,7 @@
 #include "lib/stringinfo.h"
 #include "nodes/params.h"
 #include "nodes/parsenodes.h"
+#include "partitioning/partdefs.h"
 #include "storage/block.h"
 
 
@@ -343,6 +344,9 @@ typedef struct PlannerInfo
 
 	/* Does this query modify any partition key columns? */
 	bool		partColsUpdated;
+
+	/* Partition directory. */
+	PartitionDirectory	partition_directory;
 } PlannerInfo;
 
 
diff --git a/src/include/partitioning/partdefs.h b/src/include/partitioning/partdefs.h
index 1fe1b4868e..9d94740d1d 100644
--- a/src/include/partitioning/partdefs.h
+++ b/src/include/partitioning/partdefs.h
@@ -21,4 +21,6 @@ typedef struct PartitionBoundSpec PartitionBoundSpec;
 
 typedef struct PartitionDescData *PartitionDesc;
 
+typedef struct PartitionDirectoryData *PartitionDirectory;
+
 #endif							/* PARTDEFS_H */
diff --git a/src/include/partitioning/partdir.h b/src/include/partitioning/partdir.h
new file mode 100644
index 0000000000..0472575bc1
--- /dev/null
+++ b/src/include/partitioning/partdir.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * partdir.h
+ *		A partition directory provides stable PartitionDesc lookups
+ *
+ * Copyright (c) 2007-2018, PostgreSQL Global Development Group
+ *
+ * src/include/partitioning/partdir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PARTDIR_H
+#define PARTDIR_H
+
+#include "partitioning/partdefs.h"
+#include "utils/relcache.h"
+
+extern PartitionDirectory CreatePartitionDirectory(MemoryContext mcxt);
+extern PartitionDesc PartitionDirectoryLookup(PartitionDirectory, Relation);
+
+#endif							/* PARTDIR_H */
-- 
2.14.3 (Apple Git-98)

0003-Lower-the-lock-level-for-ALTER-TABLE-.-ATTACH-DETACH.patchapplication/octet-stream; name=0003-Lower-the-lock-level-for-ALTER-TABLE-.-ATTACH-DETACH.patchDownload

From 3b28c098a89100e18c55832378d39e4ecd5a03ba Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 20 Dec 2018 12:59:21 -0500
Subject: [PATCH 3/3] Lower the lock level for ALTER TABLE .. ATTACH/DETACH
 PARTITION.

---
 src/backend/commands/tablecmds.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ce0c7b3153..42862d9ef5 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3627,7 +3627,29 @@ AlterTableGetLockLevel(List *cmds)
 
 			case AT_AttachPartition:
 			case AT_DetachPartition:
-				cmd_lockmode = AccessExclusiveLock;
+				/*
+				 * We can attach or detach a partition with only
+				 * ShareUpdateExclusiveLock on the partitioned table, but at
+				 * least in the case of an ATTACH PARTITION operation, we need
+				 * a stronger lock on the partition itself and on any default
+				 * partition of the partitioned table.  If we didn't do this,
+				 * we could be in the middle of routing a tuple to a table and
+				 * at the same time its partition constraint could be changing
+				 * under us, which would possibly result in inserting a tuple
+				 * that does not satisfy the partition constraint.  Or, we
+				 * could decide to prune the table from the query while the
+				 * partition constraint is changing in such a way that the
+				 * table should no longer be pruned.
+				 *
+				 * Note that attaching or detaching a partition becomes visible
+				 * to other sessions as soon as the transaction which performed
+				 * the operation commits. We can't use the query snapshot,
+				 * which might be older, to determine which partitions are
+				 * visible to a particular query, because the tables that were
+				 * visible at that time might no longer exist, might no longer
+				 * have a matching tuple descriptor, etc.
+				 */
+				cmd_lockmode = ShareUpdateExclusiveLock;
 				break;
 
 			default:			/* oops */
-- 
2.14.3 (Apple Git-98)

#64

Alvaro Herrera

alvherre@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#63)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Thanks for this work! I like the name "partition directory".

On 2018-Dec-20, Robert Haas wrote:

0002 introduces the concept of a partition directory. The idea is
that the planner will create a partition directory, and so will the
executor, and all calls which occur in those places to
RelationGetPartitionDesc() will instead call
PartitionDirectoryLookup(). Every lookup for the same relation in the
same partition directory is guaranteed to produce the same answer. I
believe this patch still has a number of weaknesses. More on that
below.

The commit message for this one also points out another potential
problem,

Introduce the concept of a partition directory.

Teach the optimizer and executor to use it, so that a single planning
cycle or query execution gets the same PartitionDesc for the same table
every time it looks it up. This does not prevent changes between
planning and execution, nor does it guarantee that all tables are
expanded according to the same snapshot.

Namely: how does this handle the case of partition pruning structure
being passed from planner to executor, if an attach happens in the
middle of it and puts a partition in between existing partitions? Array
indexes of any partitions that appear later in the partition descriptor
will change.

This is the reason I used the query snapshot rather than EState.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#65

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Alvaro Herrera (#64)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Introduce the concept of a partition directory.

Teach the optimizer and executor to use it, so that a single planning
cycle or query execution gets the same PartitionDesc for the same table
every time it looks it up. This does not prevent changes between
planning and execution, nor does it guarantee that all tables are
expanded according to the same snapshot.

Namely: how does this handle the case of partition pruning structure
being passed from planner to executor, if an attach happens in the
middle of it and puts a partition in between existing partitions? Array
indexes of any partitions that appear later in the partition descriptor
will change.

This is the reason I used the query snapshot rather than EState.

I didn't handle that. If partition pruning relies on nothing changing
between planning and execution, isn't that broken regardless of any of
this? It's true that with the simple query protocol we'll hold locks
continuously from planning into execution, and therefore with the
current locking regime we couldn't really have a problem. But unless
I'm confused, with the extended query protocol it's quite possible to
generate a plan, release locks, and then reacquire locks at execution
time. Unless we have some guarantee that a new plan will always be
generated if any DDL has happened in the middle, I think we've got
trouble, and I don't think that is guaranteed in all cases.

Maybe I'm all wet, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#66

Alvaro Herrera

alvherre@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#65)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018-Dec-20, Robert Haas wrote:

I didn't handle that. If partition pruning relies on nothing changing
between planning and execution, isn't that broken regardless of any of
this? It's true that with the simple query protocol we'll hold locks
continuously from planning into execution, and therefore with the
current locking regime we couldn't really have a problem. But unless
I'm confused, with the extended query protocol it's quite possible to
generate a plan, release locks, and then reacquire locks at execution
time. Unless we have some guarantee that a new plan will always be
generated if any DDL has happened in the middle, I think we've got
trouble, and I don't think that is guaranteed in all cases.

Oh, so maybe this case is already handled by plan invalidation -- I
mean, if we run DDL, the stored plan is thrown away and a new one
recomputed. IOW this was already a solved problem and I didn't need to
spend effort on it. /me slaps own forehead

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#67

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Alvaro Herrera (#66)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Dec 20, 2018 at 4:11 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Oh, so maybe this case is already handled by plan invalidation -- I
mean, if we run DDL, the stored plan is thrown away and a new one
recomputed. IOW this was already a solved problem and I didn't need to
spend effort on it. /me slaps own forehead

I'm kinda saying the opposite - I'm not sure that it's safe even with
the higher lock levels. If the plan is relying on the same partition
descriptor being in effect at plan time as at execution time, that
sounds kinda dangerous to me.

Lowering the lock level might also make something that was previously
safe into something unsafe, because now there's no longer a guarantee
that invalidation messages are received soon enough. With
AccessExclusiveLock, we'll send invalidation messages before releasing
the lock, and other processes will acquire the lock and then
AcceptInvalidationMessages(). But with ShareUpdateExclusiveLock the
locks can coexist, so now there might be trouble. I think this is an
area where we need to do some more investigation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#68

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Robert Haas (#67)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Dec 20, 2018 at 4:38 PM Robert Haas <robertmhaas@gmail.com> wrote:

Lowering the lock level might also make something that was previously
safe into something unsafe, because now there's no longer a guarantee
that invalidation messages are received soon enough. With
AccessExclusiveLock, we'll send invalidation messages before releasing
the lock, and other processes will acquire the lock and then
AcceptInvalidationMessages(). But with ShareUpdateExclusiveLock the
locks can coexist, so now there might be trouble. I think this is an
area where we need to do some more investigation.

So there are definitely problems here. With my patch:

create table tab (a int, b text) partition by range (a);
create table tab1 partition of tab for values from (0) to (10);
prepare t as select * from tab;
begin;
explain execute t; -- seq scan on tab1
execute t; -- no rows

Then, in another session:

alter table tab detach partition tab1;
insert into tab1 values (300, 'oops');

Back to the first session:

execute t; -- shows (300, 'oops')
explain execute t; -- still planning to scan tab1
commit;
explain execute t; -- now it got the memo, and plans to scan nothing
execute t; -- no rows

Well, that's not good. We're showing a value that was never within
the partition bounds of any partition of tab. The problem is that,
since we already have locks on all relevant objects, nothing triggers
the second 'explain execute' to process invalidation messages, so we
don't update the plan. Generally, any DDL with less than
AccessExclusiveLock has this issue. On another thread, I was
discussing with Tom and Peter the possibility of trying to rejigger
things so that we always AcceptInvalidationMessages() at least once
per command, but I think that just turns this into a race: if a
concurrent commit happens after 'explain execute t' decides not to
re-plan but before it begins executing, we have the same problem.

This example doesn't involve partition pruning, and in general I don't
think that the problem is confined to partition pruning. It's rather
that if there's no conflict between the lock that is needed to change
the set of partitions and the lock that is needed to run a query, then
there's no way to guarantee that the query runs with the same set of
partitions for which it was planned. Unless I'm mistaken, which I
might be, this is also a problem with your approach -- if you repeat
the same prepared query in the same transaction, the transaction
snapshot will be updated, and thus the PartitionDesc will be expanded
differently at execution time, but the plan will not have changed,
because invalidation messages have not been processed.

Anyway, I think the only fix here is likely to be making the executor
resilient against concurrent changes in the PartitionDesc. I don't
think there's going to be any easy way to compensate for added
partitions without re-planning, but maybe we could find a way to flag
detached partitions so that they return no rows without actually
touching the underlying relation.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#69

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#63)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, 21 Dec 2018 at 09:43, Robert Haas <robertmhaas@gmail.com> wrote:

- I refactored expand_inherited_rtentry() to drive partition expansion
entirely off of PartitionDescs. The reason why this is necessary is
that it clearly will not work to have find_all_inheritors() use a
current snapshot to decide what children we have and lock them, and
then consult a different source of truth to decide which relations to
open with NoLock. There's nothing to keep the lists of partitions
from being different in the two cases, and that demonstrably causes
assertion failures if you SELECT with an ATTACH/DETACH loop running in
the background. However, it also changes the order in which tables get
locked. Possibly that could be fixed by teaching
expand_partitioned_rtentry() to qsort() the OIDs the way
find_inheritance_children() does. It also loses the infinite-loop
protection which find_all_inheritors() has. Not sure what to do about
that.

I don't think you need to qsort() the Oids before locking. What the
qsort() does today is ensure we get a consistent locking order. Any
other order would surely do, providing we stick to it consistently. I
think PartitionDesc order is fine, as it's consistent. Having it
locked in PartitionDesc order I think is what's needed for [1]https://commitfest.postgresql.org/21/1778/ anyway.
[2]: https://commitfest.postgresql.org/21/1887/

[1]: https://commitfest.postgresql.org/21/1778/
[2]: https://commitfest.postgresql.org/21/1887/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#70

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#65)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, 21 Dec 2018 at 10:05, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Namely: how does this handle the case of partition pruning structure
being passed from planner to executor, if an attach happens in the
middle of it and puts a partition in between existing partitions? Array
indexes of any partitions that appear later in the partition descriptor
will change.

This is the reason I used the query snapshot rather than EState.

I didn't handle that. If partition pruning relies on nothing changing
between planning and execution, isn't that broken regardless of any of
this? It's true that with the simple query protocol we'll hold locks
continuously from planning into execution, and therefore with the
current locking regime we couldn't really have a problem. But unless
I'm confused, with the extended query protocol it's quite possible to
generate a plan, release locks, and then reacquire locks at execution
time. Unless we have some guarantee that a new plan will always be
generated if any DDL has happened in the middle, I think we've got
trouble, and I don't think that is guaranteed in all cases.

Today the plan would be invalidated if a partition was ATTACHED or
DETACHED. The newly built plan would get the updated list of
partitions.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#71

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: David Rowley (#69)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, Dec 21, 2018 at 6:04 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

I don't think you need to qsort() the Oids before locking. What the
qsort() does today is ensure we get a consistent locking order. Any
other order would surely do, providing we stick to it consistently. I
think PartitionDesc order is fine, as it's consistent. Having it
locked in PartitionDesc order I think is what's needed for [1] anyway.
[2] proposes to relax the locking order taken during execution.

If queries take locks in one order and DDL takes them in some other
order, queries and DDL starting around the same time could deadlock.
Unless we convert the whole system to lock everything in PartitionDesc
order the issue doesn't go away completely. But maybe we just have to
live with that. Surely we're not going to pay the cost of locking
partitions that we don't otherwise need to avoid a deadlock-vs-DDL
risk, and once we've decided to assume that risk, I'm not sure a
qsort() here helps anything much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#72

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: David Rowley (#70)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, Dec 21, 2018 at 6:06 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

I didn't handle that. If partition pruning relies on nothing changing
between planning and execution, isn't that broken regardless of any of
this? It's true that with the simple query protocol we'll hold locks
continuously from planning into execution, and therefore with the
current locking regime we couldn't really have a problem. But unless
I'm confused, with the extended query protocol it's quite possible to
generate a plan, release locks, and then reacquire locks at execution
time. Unless we have some guarantee that a new plan will always be
generated if any DDL has happened in the middle, I think we've got
trouble, and I don't think that is guaranteed in all cases.

Today the plan would be invalidated if a partition was ATTACHED or
DETACHED. The newly built plan would get the updated list of
partitions.

I think you're right, and that won't be true any more once we lower
the lock level, so it has to be handled somehow. The entire plan
invalidation mechanism seems to depend fundamentally on
AccessExclusiveLock being used everywhere, so this is likely to be an
ongoing issue every time we want to reduce a lock level anywhere. I
wonder if there is any kind of systematic fix or if we are just going
to have to keep inventing ad-hoc solutions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#73

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#71)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, 25 Dec 2018 at 08:15, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 21, 2018 at 6:04 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

I don't think you need to qsort() the Oids before locking. What the
qsort() does today is ensure we get a consistent locking order. Any
other order would surely do, providing we stick to it consistently. I
think PartitionDesc order is fine, as it's consistent. Having it
locked in PartitionDesc order I think is what's needed for [1] anyway.
[2] proposes to relax the locking order taken during execution.

If queries take locks in one order and DDL takes them in some other
order, queries and DDL starting around the same time could deadlock.
Unless we convert the whole system to lock everything in PartitionDesc
order the issue doesn't go away completely. But maybe we just have to
live with that. Surely we're not going to pay the cost of locking
partitions that we don't otherwise need to avoid a deadlock-vs-DDL
risk, and once we've decided to assume that risk, I'm not sure a
qsort() here helps anything much.

When I said "consistent" I meant consistent over all places where we
obtain locks on all partitions. My original v1-0002 patch attempted
something like this.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#74

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Alvaro Herrera (#64)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Namely: how does this handle the case of partition pruning structure
being passed from planner to executor, if an attach happens in the
middle of it and puts a partition in between existing partitions? Array
indexes of any partitions that appear later in the partition descriptor
will change.

I finally gotten a little more time to work on this. It took me a
while to understand that a PartitionedRelPruneInfos assumes that the
indexes of partitions in the PartitionDesc don't change between
planning and execution, because subplan_map[] and subplan_map[] are
indexed by PartitionDesc offset. I suppose the reason for this is so
that we don't have to go to the expense of copying the partition
bounds from the PartitionDesc into the final plan, but it somehow
seems a little scary to me. Perhaps I am too easily frightened, but
it's certainly a problem from the point of view of this project, which
wants to let the PartitionDesc change concurrently.

I wrote a little patch that stores the relation OIDs of the partitions
into the PartitionedPruneRelInfo and then, at execution time, does an
Assert() that what it gets matches what existed at plan time. I
figured that a good start would be to find a test case where this
fails with concurrent DDL allowed, but I haven't so far succeeded in
devising one. To make the Assert() fail, I need to come up with a
case where concurrent DDL has caused the PartitionDesc to be rebuilt
but without causing an update to the plan. If I use prepared queries
inside of a transaction block, I can continue to run old plans after
concurrent DDL has changed things, but I can't actually make the
Assert() fail, because the queries continue to use the old plans
precisely because they haven't processed invalidation messages, and
therefore they also have the old PartitionDesc and everything works.
Maybe if I try it with CLOBBER_CACHE_ALWAYS...

I also had the idea of trying to use a cursor, because if I could
start execution of a query, then force a relcache rebuild, then
continue executing the query, maybe something would blow up somehow.
But that's not so easy because I don't think we have any way using SQL
to declare a cursor for a prepared query, so how do I need to get a
query plan that involves run-time pruning without using parameters,
which I'm pretty sure is possible but I haven't figured it out yet.
And even there the PartitionDirectory concept might preserve us from
any damage if the change happens after the executor is initialized,
though I'm not sure if there are any cases where we don't do the first
PartitionDesc lookup for a particular table until mid-execution.

Anyway, I think this idea of passing a list of relation OIDs that we
saw at planning time through to the executor and cross-checking them
might have some legs. If we only allowed concurrently *adding*
partitions and not concurrently *removing* them, then even if we find
the case(s) where the PartitionDesc can change under us, we can
probably just adjust subplan_map and subpart_map to compensate, since
we can iterate through the old and new arrays of relation OIDs and
just figure out which things have shifted to higher indexes in the
PartitionDesc. This is all kind of hand-waving at the moment; tips
appreciated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#75

Alvaro Herrera

alvherre@2ndquadrant.com

almost 7 years ago

In reply to: Robert Haas (#74)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2019-Jan-25, Robert Haas wrote:

I finally gotten a little more time to work on this. It took me a
while to understand that a PartitionedRelPruneInfos assumes that the
indexes of partitions in the PartitionDesc don't change between
planning and execution, because subplan_map[] and subplan_map[] are
indexed by PartitionDesc offset.

Right, the planner/executor "disconnect" is one of the challenges, and
why I was trying to keep the old copy of the PartitionDesc around
instead of building updated ones as needed.

I suppose the reason for this is so
that we don't have to go to the expense of copying the partition
bounds from the PartitionDesc into the final plan, but it somehow
seems a little scary to me. Perhaps I am too easily frightened, but
it's certainly a problem from the point of view of this project, which
wants to let the PartitionDesc change concurrently.

Well, my definition of the problem started with the assumption that we
would keep the partition array indexes unchanged, so "change
concurrently" is what we needed to avoid. Yes, I realize that you've
opted to change that definition.

I may have forgotten some of your earlier emails on this, but one aspect
(possibly a key one) is that I'm not sure we really need to cope, other
than with an ERROR, with queries that continue to run across an
attach/detach -- moreso in absurd scenarios such as the ones you
described where the detached table is later re-attached, possibly to a
different partitioned table. I mean, if we can just detect the case and
raise an error, and this let us make it all work reasonably, that might
be better.

I wrote a little patch that stores the relation OIDs of the partitions
into the PartitionedPruneRelInfo and then, at execution time, does an
Assert() that what it gets matches what existed at plan time. I
figured that a good start would be to find a test case where this
fails with concurrent DDL allowed, but I haven't so far succeeded in
devising one. To make the Assert() fail, I need to come up with a
case where concurrent DDL has caused the PartitionDesc to be rebuilt
but without causing an update to the plan. If I use prepared queries
inside of a transaction block, [...]

I also had the idea of trying to use a cursor, because if I could
start execution of a query, [...]

Those are the ways I thought of, and the reason for the shape of some of
those .spec tests. I wasn't able to hit the situation.

Maybe if I try it with CLOBBER_CACHE_ALWAYS...

I didn't try this one.

Anyway, I think this idea of passing a list of relation OIDs that we
saw at planning time through to the executor and cross-checking them
might have some legs. If we only allowed concurrently *adding*
partitions and not concurrently *removing* them, then even if we find
the case(s) where the PartitionDesc can change under us, we can
probably just adjust subplan_map and subpart_map to compensate, since
we can iterate through the old and new arrays of relation OIDs and
just figure out which things have shifted to higher indexes in the
PartitionDesc. This is all kind of hand-waving at the moment; tips
appreciated.

I think detaching partitions concurrently is a necessary part of this
feature, so I would prefer not to go with a solution that works for
attaching partitions but not for detaching them. That said, I don't see
why it's impossible to adjust the partition maps in both cases. But I
don't have anything better than hand-waving ATM.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#76

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Alvaro Herrera (#75)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Right, the planner/executor "disconnect" is one of the challenges, and
why I was trying to keep the old copy of the PartitionDesc around
instead of building updated ones as needed.

I agree that would be simpler, but I just don't see how to make it
work. For one thing, keeping the old copy around can't work in
parallel workers, which never had a copy in the first place. For two
things, we don't have a really good mechanism to keep the
PartitionDesc that was used at plan time around until execution time.
Keeping the relation open would do it, but I'm pretty sure that causes
other problems; the system doesn't expect any residual references.

I know you had a solution to this problem, but I don't see how it can
work. You said "Snapshots have their own cache (hash table) of
partition descriptors. If a partdesc is requested and the snapshot has
already obtained that partdesc, the original one is returned -- we
don't request a new one from partcache." But presumably this means
when the last snapshot is unregistered, the cache is flushed
(otherwise, when?) and if that's so then this relies on the snapshot
that was used for planning still being around at execution time, which
I am pretty sure isn't guaranteed.

Also, and I think this point is worthy of some further discussion, the
thing that really seems broken to me about your design is the idea
that it's OK to use the query or transaction snapshot to decide which
partitions exist. The problem with that is that some query or
transaction with an old snapshot might see as a partition some table
that has been dropped or radically altered - different column types,
attached to some other table now, attached to same table but with
different bounds, or just dropped. And therefore it might try to
insert data into that table and fail in all kinds of crazy ways, about
the mildest of which is inserting data that doesn't match the current
partition constraint. I'm willing to be told that I've misunderstood
the way it all works and this isn't really a problem for some reason,
but my current belief is that not only is it a problem with your
design, but that it's such a bad problem that there's really no way to
fix it and we have to abandon your entire approach and go a different
route. If you don't think that's true, then perhaps we should discuss
it further.

I suppose the reason for this is so
that we don't have to go to the expense of copying the partition
bounds from the PartitionDesc into the final plan, but it somehow
seems a little scary to me. Perhaps I am too easily frightened, but
it's certainly a problem from the point of view of this project, which
wants to let the PartitionDesc change concurrently.

Well, my definition of the problem started with the assumption that we
would keep the partition array indexes unchanged, so "change
concurrently" is what we needed to avoid. Yes, I realize that you've
opted to change that definition.

I don't think I made a conscious decision to change this, and I'm kind
of wondering whether I have missed some better approach here. I feel
like the direction I'm pursuing is an inevitable consequence of having
no good way to keep the PartitionDesc around from plan-time to
execution-time, which in turn feels like an inevitable consequence of
the two points I made above: there's no guarantee that the plan-time
snapshot is still registered anywhere by the time we get to execution
time, and even if there were, the associated PartitionDesc may point
to tables that have been drastically modified or don't exist any more.
But it's possible that my chain-of-inevitable-consequences has a weak
link, in which case I would surely like it if you (or someone else)
would point it out to me.

I may have forgotten some of your earlier emails on this, but one aspect
(possibly a key one) is that I'm not sure we really need to cope, other
than with an ERROR, with queries that continue to run across an
attach/detach -- moreso in absurd scenarios such as the ones you
described where the detached table is later re-attached, possibly to a
different partitioned table. I mean, if we can just detect the case and
raise an error, and this let us make it all work reasonably, that might
be better.

Well, that's an interesting idea. I assumed that users would hate
that kind of behavior with a fiery passion that could never be
quenched. If not, then the problem changes from *coping* with
concurrent changes to *detecting* concurrent changes, which may be
easier, but see below.

I think detaching partitions concurrently is a necessary part of this
feature, so I would prefer not to go with a solution that works for
attaching partitions but not for detaching them. That said, I don't see
why it's impossible to adjust the partition maps in both cases. But I
don't have anything better than hand-waving ATM.

The general problem here goes back to what I wrote in the third
paragraph of this email: a PartitionDesc that was built with a
particular snapshot can't be assumed to be usable after any subsequent
DDL has occurred that might affect the shape of the PartitionDesc.
For example, if somebody detaches a partition and reattaches it with
different partition bounds, and we use the old PartitionDesc for tuple
routing, we'll route tuples into that partition that do not satisfy
its partition constraint. And we won't even get an ERROR, because the
system assumes that any tuple which arrives at a partition as a result
of tuple routing must necessarily satisfy the partition constraint.

If only concurrent CREATE/ATTACH operations are allowed and
DROP/DETACH is not, then that kind of thing isn't possible. Any new
partitions which have shown up since the plan was created can just be
ignored, and the old ones must still have the same partition bounds
that they did before, and everything is fine. Or, if we're OK with a
less-nice solution, we could just ERROR out when the number of
partitions have changed. Some people will get errors they don't like,
but they won't end up with rows in their partitions that violate the
constraints.

But as soon as you allow concurrent DETACH, then things get really
crazy. Even if, at execution time, there are the same number of
partitions as I had at plan time, and even if those partitions have
the same OIDs as what I had at plan time, and even if those OIDs are
in the same order in the PartitionDesc, it does not prove that things
are OK. The partition could have been detached and reattached with a
narrower set of partition bounds. And if so, then we might route a
tuple to it that doesn't fit that narrower set of bounds, and there
will be no error, just database corruption.

I suppose one idea for handling this is to stick a counter into
pg_class or something that gets incremented every time a partition is
detached. At plan time, save the counter value; if it has changed at
execution time, ERROR. If not, then you have only added partitions to
worry about, and that's an easier problem.

But I kind of wonder whether we're really gaining as much as you think
by trying to support concurrent DETACH in the first place. If there
are queries running against the table, there's probably at least
AccessShareLock on the partition itself, not just the parent. And
that means that reducing the necessary lock on the parent from
AccessExclusiveLock to something less doesn't really help that much,
because we're still going to block trying to acquire
AccessExclusiveLock on the child. Now you might say: OK, well, just
reduce that lock level, too.

But that seems to me to be opening a giant can of worms which we are
unlikely to get resolved in time for this release. The worst of those
problem is that if you also reduce the lock level on the partition
when attaching it, then you are adding a constraint while somebody
might at the exact same time be inserting a tuple that violates that
constraint. Those two things have to be synchronized somehow. You
could avoid that by reducing the lock level on the partition when
detaching and not when attaching. But even then, detaching a
partition can involve performing a whole bunch of operations for which
we currently require AccessExclusiveLock. AlterTableGetLockLevel says:

/*
* Removing constraints can affect SELECTs that have been
* optimised assuming the constraint holds true.
*/
case AT_DropConstraint: /* as DROP INDEX */
case AT_DropNotNull: /* may change some SQL plans */
cmd_lockmode = AccessExclusiveLock;
break;

Dropping a partition implicitly involves dropping a constraint. We
could gamble that the above has no consequences that are really
serious enough to care about, and that none of the other subsidiary
objects that we adjust during detach (indexes, foreign keys, triggers)
really need AccessExclusiveLock now, and that none of the other kinds
of subsidiary objects that we might need to adjust in the future
during a detach will be changes that require AccessExclusiveLock
either, but that sounds awfully risky to me. We have very little DDL
that runs with less than AccessExclusiveLock, and I've already found
lots of subtle problems that have to be patched up just for the
particular case of allowing attach/detach to take a lesser lock on the
parent table, and I bet that there are a whole bunch more similar
problems when you start talking about weakening the lock on the child
table, and I'm not convinced that there are any reasonable solutions
to some of those problems, let alone that we can come up with good
solutions to all of them in the very near future.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#77

Simon Riggs

simon@2ndquadrant.com

almost 7 years ago

In reply to: Robert Haas (#76)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Mon, 28 Jan 2019 at 20:15, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Right, the planner/executor "disconnect" is one of the challenges, and
why I was trying to keep the old copy of the PartitionDesc around
instead of building updated ones as needed.

I agree that would be simpler, but I just don't see how to make it
work. For one thing, keeping the old copy around can't work in
parallel workers, which never had a copy in the first place. For two
things, we don't have a really good mechanism to keep the
PartitionDesc that was used at plan time around until execution time.
Keeping the relation open would do it, but I'm pretty sure that causes
other problems; the system doesn't expect any residual references.

I know you had a solution to this problem, but I don't see how it can
work. You said "Snapshots have their own cache (hash table) of
partition descriptors. If a partdesc is requested and the snapshot has
already obtained that partdesc, the original one is returned -- we
don't request a new one from partcache." But presumably this means
when the last snapshot is unregistered, the cache is flushed
(otherwise, when?) and if that's so then this relies on the snapshot
that was used for planning still being around at execution time, which
I am pretty sure isn't guaranteed.

Also, and I think this point is worthy of some further discussion, the
thing that really seems broken to me about your design is the idea
that it's OK to use the query or transaction snapshot to decide which
partitions exist. The problem with that is that some query or
transaction with an old snapshot might see as a partition some table
that has been dropped or radically altered - different column types,
attached to some other table now, attached to same table but with
different bounds, or just dropped. And therefore it might try to
insert data into that table and fail in all kinds of crazy ways, about
the mildest of which is inserting data that doesn't match the current
partition constraint. I'm willing to be told that I've misunderstood
the way it all works and this isn't really a problem for some reason,
but my current belief is that not only is it a problem with your
design, but that it's such a bad problem that there's really no way to
fix it and we have to abandon your entire approach and go a different
route. If you don't think that's true, then perhaps we should discuss
it further.

I suppose the reason for this is so
that we don't have to go to the expense of copying the partition
bounds from the PartitionDesc into the final plan, but it somehow
seems a little scary to me. Perhaps I am too easily frightened, but
it's certainly a problem from the point of view of this project, which
wants to let the PartitionDesc change concurrently.

Well, my definition of the problem started with the assumption that we
would keep the partition array indexes unchanged, so "change
concurrently" is what we needed to avoid. Yes, I realize that you've
opted to change that definition.

I don't think I made a conscious decision to change this, and I'm kind
of wondering whether I have missed some better approach here. I feel
like the direction I'm pursuing is an inevitable consequence of having
no good way to keep the PartitionDesc around from plan-time to
execution-time, which in turn feels like an inevitable consequence of
the two points I made above: there's no guarantee that the plan-time
snapshot is still registered anywhere by the time we get to execution
time, and even if there were, the associated PartitionDesc may point
to tables that have been drastically modified or don't exist any more.
But it's possible that my chain-of-inevitable-consequences has a weak
link, in which case I would surely like it if you (or someone else)
would point it out to me.

I may have forgotten some of your earlier emails on this, but one aspect
(possibly a key one) is that I'm not sure we really need to cope, other
than with an ERROR, with queries that continue to run across an
attach/detach -- moreso in absurd scenarios such as the ones you
described where the detached table is later re-attached, possibly to a
different partitioned table. I mean, if we can just detect the case and
raise an error, and this let us make it all work reasonably, that might
be better.

Well, that's an interesting idea. I assumed that users would hate
that kind of behavior with a fiery passion that could never be
quenched. If not, then the problem changes from *coping* with
concurrent changes to *detecting* concurrent changes, which may be
easier, but see below.

I think detaching partitions concurrently is a necessary part of this
feature, so I would prefer not to go with a solution that works for
attaching partitions but not for detaching them. That said, I don't see
why it's impossible to adjust the partition maps in both cases. But I
don't have anything better than hand-waving ATM.

The general problem here goes back to what I wrote in the third
paragraph of this email: a PartitionDesc that was built with a
particular snapshot can't be assumed to be usable after any subsequent
DDL has occurred that might affect the shape of the PartitionDesc.
For example, if somebody detaches a partition and reattaches it with
different partition bounds, and we use the old PartitionDesc for tuple
routing, we'll route tuples into that partition that do not satisfy
its partition constraint. And we won't even get an ERROR, because the
system assumes that any tuple which arrives at a partition as a result
of tuple routing must necessarily satisfy the partition constraint.

If only concurrent CREATE/ATTACH operations are allowed and
DROP/DETACH is not, then that kind of thing isn't possible. Any new
partitions which have shown up since the plan was created can just be
ignored, and the old ones must still have the same partition bounds
that they did before, and everything is fine. Or, if we're OK with a
less-nice solution, we could just ERROR out when the number of
partitions have changed. Some people will get errors they don't like,
but they won't end up with rows in their partitions that violate the
constraints.

But as soon as you allow concurrent DETACH, then things get really
crazy. Even if, at execution time, there are the same number of
partitions as I had at plan time, and even if those partitions have
the same OIDs as what I had at plan time, and even if those OIDs are
in the same order in the PartitionDesc, it does not prove that things
are OK. The partition could have been detached and reattached with a
narrower set of partition bounds. And if so, then we might route a
tuple to it that doesn't fit that narrower set of bounds, and there
will be no error, just database corruption.

I suppose one idea for handling this is to stick a counter into
pg_class or something that gets incremented every time a partition is
detached. At plan time, save the counter value; if it has changed at
execution time, ERROR. If not, then you have only added partitions to
worry about, and that's an easier problem.

Yes, a version number would solve that issue.

But I kind of wonder whether we're really gaining as much as you think
by trying to support concurrent DETACH in the first place. If there
are queries running against the table, there's probably at least
AccessShareLock on the partition itself, not just the parent. And
that means that reducing the necessary lock on the parent from
AccessExclusiveLock to something less doesn't really help that much,
because we're still going to block trying to acquire
AccessExclusiveLock on the child. Now you might say: OK, well, just
reduce that lock level, too.

The whole point of CONCURRENT detach is that you're not removing it whilst
people are still using it, you're just marking it for later disuse.

But that seems to me to be opening a giant can of worms which we are

unlikely to get resolved in time for this release. The worst of those
problem is that if you also reduce the lock level on the partition
when attaching it, then you are adding a constraint while somebody
might at the exact same time be inserting a tuple that violates that
constraint.

Spurious.

This would only be true if we were adding a constraint that affected
existing partitions.

The constraint being added affects the newly added partition, not existing
ones.

Those two things have to be synchronized somehow. You
could avoid that by reducing the lock level on the partition when
detaching and not when attaching. But even then, detaching a
partition can involve performing a whole bunch of operations for which
we currently require AccessExclusiveLock. AlterTableGetLockLevel says:

/*
* Removing constraints can affect SELECTs that have been
* optimised assuming the constraint holds true.
*/
case AT_DropConstraint: /* as DROP INDEX */
case AT_DropNotNull: /* may change some SQL plans */
cmd_lockmode = AccessExclusiveLock;
break;

Dropping a partition implicitly involves dropping a constraint. We
could gamble that the above has no consequences that are really

It's not a gamble if you know that the constraints being dropped constrain
only the object being dropped.

serious enough to care about, and that none of the other subsidiary
objects that we adjust during detach (indexes, foreign keys, triggers)
really need AccessExclusiveLock now, and that none of the other kinds
of subsidiary objects that we might need to adjust in the future
during a detach will be changes that require AccessExclusiveLock
either, but that sounds awfully risky to me. We have very little DDL
that runs with less than AccessExclusiveLock, and I've already found
lots of subtle problems that have to be patched up just for the
particular case of allowing attach/detach to take a lesser lock on the
parent table, and I bet that there are a whole bunch more similar
problems when you start talking about weakening the lock on the child
table, and I'm not convinced that there are any reasonable solutions
to some of those problems, let alone that we can come up with good
solutions to all of them in the very near future.

I've not read every argument on this thread, but many of the later points
made here are spurious, by which I mean they sound like they could apply
but in fact do not.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#78

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Simon Riggs (#77)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Jan 29, 2019 at 12:29 AM Simon Riggs <simon@2ndquadrant.com> wrote:

But I kind of wonder whether we're really gaining as much as you think
by trying to support concurrent DETACH in the first place. If there
are queries running against the table, there's probably at least
AccessShareLock on the partition itself, not just the parent. And
that means that reducing the necessary lock on the parent from
AccessExclusiveLock to something less doesn't really help that much,
because we're still going to block trying to acquire
AccessExclusiveLock on the child. Now you might say: OK, well, just
reduce that lock level, too.

The whole point of CONCURRENT detach is that you're not removing it whilst people are still using it, you're just marking it for later disuse.

Well, I don't think that's the way any patch so far proposed actually works.

But that seems to me to be opening a giant can of worms which we are
unlikely to get resolved in time for this release. The worst of those
problem is that if you also reduce the lock level on the partition
when attaching it, then you are adding a constraint while somebody
might at the exact same time be inserting a tuple that violates that
constraint.

Spurious.

This would only be true if we were adding a constraint that affected existing partitions.

The constraint being added affects the newly added partition, not existing ones.

I agree that it affects the newly added partition, not existing ones.
But if you don't hold an AccessExclusiveLock on that partition while
you are adding that constraint to it, then somebody could be
concurrently inserting a tuple that violates that constraint. This
would be an INSERT targeting the partition directly, not somebody
operating on the partitioning hierarchy to which it is being attached.

Those two things have to be synchronized somehow. You
could avoid that by reducing the lock level on the partition when
detaching and not when attaching. But even then, detaching a
partition can involve performing a whole bunch of operations for which
we currently require AccessExclusiveLock. AlterTableGetLockLevel says:

/*
* Removing constraints can affect SELECTs that have been
* optimised assuming the constraint holds true.
*/
case AT_DropConstraint: /* as DROP INDEX */
case AT_DropNotNull: /* may change some SQL plans */
cmd_lockmode = AccessExclusiveLock;
break;

Dropping a partition implicitly involves dropping a constraint. We
could gamble that the above has no consequences that are really

It's not a gamble if you know that the constraints being dropped constrain only the object being dropped.

That's not true, but I can't refute your argument any more than that
because you haven't made one.

I've not read every argument on this thread, but many of the later points made here are spurious, by which I mean they sound like they could apply but in fact do not.

I think they do apply, and until somebody explains convincingly why
they don't, I'm going to keep thinking that they do. Telling me that
my points are wrong without making any kind of argument about why they
are wrong is not constructive. I've put a lot of energy into
analyzing this topic, both recently and in previous release cycles,
and I'm not inclined to just say "OK, well, Simon says I'm wrong, so
that's the end of it."

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#79

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Alvaro Herrera (#75)

8 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I wrote a little patch that stores the relation OIDs of the partitions
into the PartitionedPruneRelInfo and then, at execution time, does an
Assert() that what it gets matches what existed at plan time. I
figured that a good start would be to find a test case where this
fails with concurrent DDL allowed, but I haven't so far succeeded in
devising one. To make the Assert() fail, I need to come up with a
case where concurrent DDL has caused the PartitionDesc to be rebuilt
but without causing an update to the plan. If I use prepared queries
inside of a transaction block, [...]

I also had the idea of trying to use a cursor, because if I could
start execution of a query, [...]

Those are the ways I thought of, and the reason for the shape of some of
those .spec tests. I wasn't able to hit the situation.

I've managed to come up with a test case that seems to hit this case.

Preparation:

create table foo (a int, b text, primary key (a)) partition by range (a);
create table foo1 partition of foo for values from (0) to (1000);
create table foo2 partition of foo for values from (1000) to (2000);
insert into foo1 values (1, 'one');
insert into foo2 values (1001, 'two');
alter system set plan_cache_mode = force_generic_plan;
select pg_reload_conf();

$ cat >x
alter table foo detach partition foo2;
alter table foo attach partition foo2 for values from (1000) to (2000);
^D

Window #1:

prepare foo as select * from foo where a = $1;
explain execute foo(1500);
\watch 0.01

Window #2:

$ pgbench -n -f x -T 60

Boom:

TRAP: FailedAssertion("!(partdesc->nparts == pinfo->nparts)", File:
"execPartition.c", Line: 1631)

I don't know how to reduce this to something reliable enough to
include it in the regression tests, and maybe we don't really need
that, but it's good to know that this is not a purely theoretical
problem. I think next I'll try to write some code to make
execPartition.c able to cope with the situation when it arises.

(My draft/WIP patches attached, if you're interested.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0008-Drop-lock-level.patchapplication/octet-stream; name=0008-Drop-lock-level.patchDownload

From 18ddaabd3a6486a59c1720c121fa0eec8b03672e Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 17 Jan 2019 10:25:12 -0500
Subject: [PATCH 8/8] Drop lock level.

---
 src/backend/commands/tablecmds.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cf16e287af..fe78c76fa7 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3657,7 +3657,7 @@ AlterTableGetLockLevel(List *cmds)
 
 			case AT_AttachPartition:
 			case AT_DetachPartition:
-				cmd_lockmode = AccessExclusiveLock;
+				cmd_lockmode = ShareUpdateExclusiveLock;
 				break;
 
 			default:			/* oops */
-- 
2.17.2 (Apple Git-113)

0006-Adapt-the-executor-to-use-a-PartitionDirectory.patchapplication/octet-stream; name=0006-Adapt-the-executor-to-use-a-PartitionDirectory.patchDownload

From e0eddf372d883eed6dc67af548f6a22d4ae499d6 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 18 Dec 2018 14:18:44 -0500
Subject: [PATCH 6/8] Adapt the executor to use a PartitionDirectory.

---
 src/backend/commands/copy.c            |  2 +-
 src/backend/executor/execPartition.c   | 28 +++++++++++++++++++-------
 src/backend/executor/nodeModifyTable.c |  2 +-
 src/include/executor/execPartition.h   |  3 ++-
 src/include/nodes/execnodes.h          |  2 ++
 5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4411b19e58..e22c8dab91 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2561,7 +2561,7 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
+		proute = ExecSetupPartitionTupleRouting(estate, NULL, cstate->rel);
 
 	if (cstate->whereClause)
 		cstate->qualexpr = ExecInitQual(castNode(List, cstate->whereClause),
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 58666fcf26..9124d5e54d 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -167,7 +167,8 @@ static void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					PartitionDispatch dispatch,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+static PartitionDispatch ExecInitPartitionDispatchInfo(EState *estate,
+							  PartitionTupleRouting *proute,
 							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
@@ -204,7 +205,8 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * it should be estate->es_query_cxt.
  */
 PartitionTupleRouting *
-ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
+ExecSetupPartitionTupleRouting(EState *estate, ModifyTableState *mtstate,
+							   Relation rel)
 {
 	PartitionTupleRouting *proute;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
@@ -229,7 +231,8 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 * parent as NULL as we don't need to care about any parent of the target
 	 * partitioned table.
 	 */
-	ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL, 0);
+	ExecInitPartitionDispatchInfo(estate, proute, RelationGetRelid(rel),
+								  NULL, 0);
 
 	/*
 	 * If performing an UPDATE with tuple routing, we can reuse partition
@@ -430,7 +433,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 				 * Create the new PartitionDispatch.  We pass the current one
 				 * in as the parent PartitionDispatch
 				 */
-				subdispatch = ExecInitPartitionDispatchInfo(proute,
+				subdispatch = ExecInitPartitionDispatchInfo(mtstate->ps.state,
+															proute,
 															partdesc->oids[partidx],
 															dispatch, partidx);
 				Assert(dispatch->indexes[partidx] >= 0 &&
@@ -972,7 +976,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
  *		newly created PartitionDispatch later.
  */
 static PartitionDispatch
-ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+ExecInitPartitionDispatchInfo(EState *estate,
+							  PartitionTupleRouting *proute, Oid partoid,
 							  PartitionDispatch parent_pd, int partidx)
 {
 	Relation	rel;
@@ -981,13 +986,17 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	int			dispatchidx;
 	MemoryContext oldcxt;
 
+	if (estate->es_partition_directory == NULL)
+		estate->es_partition_directory =
+			CreatePartitionDirectory(estate->es_query_cxt);
+
 	oldcxt = MemoryContextSwitchTo(proute->memcxt);
 
 	if (partoid != RelationGetRelid(proute->partition_root))
 		rel = table_open(partoid, NoLock);
 	else
 		rel = proute->partition_root;
-	partdesc = RelationGetPartitionDesc(rel);
+	partdesc = PartitionDirectoryLookup(estate->es_partition_directory, rel);
 
 	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes) +
 									partdesc->nparts * sizeof(int));
@@ -1533,6 +1542,10 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 	ListCell   *lc;
 	int			i;
 
+	if (estate->es_partition_directory == NULL)
+		estate->es_partition_directory =
+			CreatePartitionDirectory(estate->es_query_cxt);
+
 	n_part_hierarchies = list_length(partitionpruneinfo->prune_infos);
 	Assert(n_part_hierarchies > 0);
 
@@ -1612,7 +1625,8 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			 */
 			partrel = ExecGetRangeTableRelation(estate, pinfo->rtindex);
 			partkey = RelationGetPartitionKey(partrel);
-			partdesc = RelationGetPartitionDesc(partrel);
+			partdesc = PartitionDirectoryLookup(estate->es_partition_directory,
+												partrel);
 
 			n_steps = list_length(pinfo->pruning_steps);
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 566858c19b..b9ecd8d24e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2229,7 +2229,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
 		(operation == CMD_INSERT || update_tuple_routing_needed))
 		mtstate->mt_partition_tuple_routing =
-			ExecSetupPartitionTupleRouting(mtstate, rel);
+			ExecSetupPartitionTupleRouting(estate, mtstate, rel);
 
 	/*
 	 * Build state for collecting transition tuples.  This requires having a
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 2048c43c37..b363aba2a5 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -135,7 +135,8 @@ typedef struct PartitionPruneState
 	PartitionPruningData *partprunedata[FLEXIBLE_ARRAY_MEMBER];
 } PartitionPruneState;
 
-extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
+extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(EState *estate,
+							   ModifyTableState *mtstate,
 							   Relation rel);
 extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
 				  ResultRelInfo *rootResultRelInfo,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3b789ee7cf..84de8efeda 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -19,6 +19,7 @@
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "partitioning/partdefs.h"
 #include "utils/hsearch.h"
 #include "utils/queryenvironment.h"
 #include "utils/reltrigger.h"
@@ -515,6 +516,7 @@ typedef struct EState
 	 */
 	ResultRelInfo *es_root_result_relations;	/* array of ResultRelInfos */
 	int			es_num_root_result_relations;	/* length of the array */
+	PartitionDirectory es_partition_directory;	/* for PartitionDesc lookup */
 
 	/*
 	 * The following list contains ResultRelInfos created by the tuple routing
-- 
2.17.2 (Apple Git-113)

0005-Adapt-the-optimizer-to-use-a-PartitionDirectory.patchapplication/octet-stream; name=0005-Adapt-the-optimizer-to-use-a-PartitionDirectory.patchDownload

From ee174e99c2a8a3382645b87f4e1d0098dd01f7f9 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 17 Jan 2019 10:04:15 -0500
Subject: [PATCH 5/8] Adapt the optimizer to use a PartitionDirectory.

Along the way, make expand_partitioned_rtentry responsible for
acquiring locks.

Hey, it's in the wrong order, but maybe I won't worry about that
for right now.
---
 src/backend/optimizer/util/inherit.c | 68 +++++++++++++++-------------
 src/backend/optimizer/util/plancat.c |  2 +-
 src/include/nodes/relation.h         |  3 ++
 3 files changed, 40 insertions(+), 33 deletions(-)

diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index faba493200..04a930d65b 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -124,28 +124,15 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 
 	/*
 	 * The rewriter should already have obtained an appropriate lock on each
-	 * relation named in the query.  However, for each child relation we add
-	 * to the query, we must obtain an appropriate lock, because this will be
-	 * the first use of those relations in the parse/rewrite/plan pipeline.
-	 * Child rels should use the same lockmode as their parent.
+	 * relation named in the query, so we can open the parent relation without
+	 * locking it.  However, for each child relation we add to the query, we
+	 * must obtain an appropriate lock, because this will be the first use of
+	 * those relations in the parse/rewrite/plan pipeline.  Child rels should
+	 * use the same lockmode as their parent.
 	 */
+	oldrelation = table_open(parentOID, NoLock);
 	lockmode = rte->rellockmode;
 
-	/* Scan for all members of inheritance set, acquire needed locks */
-	inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
-
-	/*
-	 * Check that there's at least one descendant, else treat as no-child
-	 * case.  This could happen despite above has_subclass() check, if table
-	 * once had a child but no longer does.
-	 */
-	if (list_length(inhOIDs) < 2)
-	{
-		/* Clear flag before returning */
-		rte->inh = false;
-		return;
-	}
-
 	/*
 	 * If parent relation is selected FOR UPDATE/SHARE, we need to mark its
 	 * PlanRowMark as isParent = true, and generate a new PlanRowMark for each
@@ -155,21 +142,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (oldrc)
 		oldrc->isParent = true;
 
-	/*
-	 * Must open the parent relation to examine its tupdesc.  We need not lock
-	 * it; we assume the rewriter already did.
-	 */
-	oldrelation = table_open(parentOID, NoLock);
-
 	/* Scan the inheritance set and expand it */
-	if (RelationGetPartitionDesc(oldrelation) != NULL)
+	if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
+		if (root->partition_directory == NULL)
+			root->partition_directory =
+				CreatePartitionDirectory(CurrentMemoryContext);
+
 		/*
-		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.  While at it, also
-		 * extract the partition key columns of all the partitioned tables.
+		 * If this table has partitions, recursively expand and lock them.
+		 * While at it, also extract the partition key columns of all the
+		 * partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list);
@@ -180,6 +165,22 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		RangeTblEntry *childrte;
 		Index		childRTindex;
 
+		/* Scan for all members of inheritance set, acquire needed locks */
+		inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+
+		/*
+		 * Check that there's at least one descendant, else treat as no-child
+		 * case.  This could happen despite above has_subclass() check, if the
+		 * table once had a child but no longer does.
+		 */
+		if (list_length(inhOIDs) < 2)
+		{
+			/* Clear flag before returning */
+			rte->inh = false;
+			heap_close(oldrelation, NoLock);
+			return;
+		}
+
 		/*
 		 * This table has no partitions.  Expand any plain inheritance
 		 * children in the order the OIDs were returned by
@@ -249,7 +250,10 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 	int			i;
 	RangeTblEntry *childrte;
 	Index		childRTindex;
-	PartitionDesc partdesc = RelationGetPartitionDesc(parentrel);
+	PartitionDesc partdesc;
+
+	partdesc = PartitionDirectoryLookup(root->partition_directory,
+										parentrel);
 
 	check_stack_depth();
 
@@ -289,8 +293,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		Oid			childOID = partdesc->oids[i];
 		Relation	childrel;
 
-		/* Open rel; we already have required locks */
-		childrel = table_open(childOID, NoLock);
+		/* Open rel, acquiring required locks */
+		childrel = table_open(childOID, lockmode);
 
 		/*
 		 * Temporary partitions belonging to other sessions should have been
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 1bb1edd8a4..e9a8d99063 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1904,7 +1904,7 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
 
 	Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
 
-	partdesc = RelationGetPartitionDesc(relation);
+	partdesc = PartitionDirectoryLookup(root->partition_directory, relation);
 	partkey = RelationGetPartitionKey(relation);
 	rel->part_scheme = find_partition_scheme(root, relation);
 	Assert(partdesc != NULL && rel->part_scheme != NULL);
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 420ca05c30..a78481906e 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -342,6 +342,9 @@ typedef struct PlannerInfo
 
 	/* Does this query modify any partition key columns? */
 	bool		partColsUpdated;
+
+	/* Directory of partition descriptors. */
+	PartitionDirectory partition_directory;
 } PlannerInfo;
 
 
-- 
2.17.2 (Apple Git-113)

0007-relid_map-crosschecks.patchapplication/octet-stream; name=0007-relid_map-crosschecks.patchDownload

From 79905c7b064e4c3b779492744cbb9109bf170a23 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 17 Jan 2019 09:11:10 -0500
Subject: [PATCH 7/8] relid_map crosschecks

---
 src/backend/executor/execPartition.c | 4 ++++
 src/backend/nodes/copyfuncs.c        | 1 +
 src/backend/nodes/outfuncs.c         | 1 +
 src/backend/nodes/readfuncs.c        | 1 +
 src/backend/partitioning/partprune.c | 7 ++++++-
 src/include/nodes/plannodes.h        | 1 +
 6 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 9124d5e54d..495d26c7ae 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1628,6 +1628,10 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			partdesc = PartitionDirectoryLookup(estate->es_partition_directory,
 												partrel);
 
+			Assert(partdesc->nparts == pinfo->nparts);
+			Assert(memcmp(partdesc->oids, pinfo->relid_map,
+				   pinfo->nparts * sizeof(Oid)) == 0);
+
 			n_steps = list_length(pinfo->pruning_steps);
 
 			context->strategy = partkey->strategy;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 3eb7e95d64..7eb7925472 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1197,6 +1197,7 @@ _copyPartitionedRelPruneInfo(const PartitionedRelPruneInfo *from)
 	COPY_SCALAR_FIELD(nexprs);
 	COPY_POINTER_FIELD(subplan_map, from->nparts * sizeof(int));
 	COPY_POINTER_FIELD(subpart_map, from->nparts * sizeof(int));
+	COPY_POINTER_FIELD(relid_map, from->nparts * sizeof(int));
 	COPY_POINTER_FIELD(hasexecparam, from->nexprs * sizeof(bool));
 	COPY_SCALAR_FIELD(do_initial_prune);
 	COPY_SCALAR_FIELD(do_exec_prune);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 33f7939e05..b31cae99bc 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -947,6 +947,7 @@ _outPartitionedRelPruneInfo(StringInfo str, const PartitionedRelPruneInfo *node)
 	WRITE_INT_FIELD(nexprs);
 	WRITE_INT_ARRAY(subplan_map, node->nparts);
 	WRITE_INT_ARRAY(subpart_map, node->nparts);
+	WRITE_OID_ARRAY(relid_map, node->nparts);
 	WRITE_BOOL_ARRAY(hasexecparam, node->nexprs);
 	WRITE_BOOL_FIELD(do_initial_prune);
 	WRITE_BOOL_FIELD(do_exec_prune);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 43491e297b..4433438fb6 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2384,6 +2384,7 @@ _readPartitionedRelPruneInfo(void)
 	READ_INT_FIELD(nexprs);
 	READ_INT_ARRAY(subplan_map, local_node->nparts);
 	READ_INT_ARRAY(subpart_map, local_node->nparts);
+	READ_OID_ARRAY(relid_map, local_node->nparts);
 	READ_BOOL_ARRAY(hasexecparam, local_node->nexprs);
 	READ_BOOL_FIELD(do_initial_prune);
 	READ_BOOL_FIELD(do_exec_prune);
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 901433c68c..1b0cbe2c95 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -51,8 +51,9 @@
 #include "optimizer/predtest.h"
 #include "optimizer/prep.h"
 #include "optimizer/var.h"
-#include "partitioning/partprune.h"
+#include "parser/parsetree.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
 
@@ -363,6 +364,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		int			partnatts = subpart->part_scheme->partnatts;
 		int		   *subplan_map;
 		int		   *subpart_map;
+		Oid		   *relid_map;
 		List	   *partprunequal;
 		List	   *pruning_steps;
 		bool		contradictory;
@@ -438,6 +440,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		 */
 		subplan_map = (int *) palloc(nparts * sizeof(int));
 		subpart_map = (int *) palloc(nparts * sizeof(int));
+		relid_map = (Oid *) palloc(nparts * sizeof(int));
 		present_parts = NULL;
 
 		for (i = 0; i < nparts; i++)
@@ -448,6 +451,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 
 			subplan_map[i] = subplanidx;
 			subpart_map[i] = subpartidx;
+			relid_map[i] = planner_rt_fetch(partrel->relid, root)->relid;
 			if (subplanidx >= 0)
 			{
 				present_parts = bms_add_member(present_parts, i);
@@ -466,6 +470,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		pinfo->nparts = nparts;
 		pinfo->subplan_map = subplan_map;
 		pinfo->subpart_map = subpart_map;
+		pinfo->relid_map = relid_map;
 
 		/* Determine which pruning types should be enabled at this level */
 		doruntimeprune |= analyze_partkey_exprs(pinfo, pruning_steps,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6d087c268f..d66a187a53 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1108,6 +1108,7 @@ typedef struct PartitionedRelPruneInfo
 	int			nexprs;			/* Length of hasexecparam[] */
 	int		   *subplan_map;	/* subplan index by partition index, or -1 */
 	int		   *subpart_map;	/* subpart index by partition index, or -1 */
+	Oid		   *relid_map;		/* relation OID by partition index, or -1 */
 	bool	   *hasexecparam;	/* true if corresponding pruning_step contains
 								 * any PARAM_EXEC Params. */
 	bool		do_initial_prune;	/* true if pruning should be performed
-- 
2.17.2 (Apple Git-113)

0004-Postpone-old-context-removal.patchapplication/octet-stream; name=0004-Postpone-old-context-removal.patchDownload

From 7bc8e2ba64b2803f67acec5c82e6bdd400ad3af1 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 28 Nov 2018 11:50:52 -0500
Subject: [PATCH 4/8] Postpone old context removal.

---
 src/backend/utils/cache/relcache.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 98257c8057..a764a8fe04 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2482,6 +2482,26 @@ RelationClearRelation(Relation relation, bool rebuild)
 			SWAPFIELD(PartitionDesc, rd_partdesc);
 			SWAPFIELD(MemoryContext, rd_pdcxt);
 		}
+		else if (rebuild && newrel->rd_pdcxt != NULL)
+		{
+			/*
+			 * We are rebuilding a partitioned relation with a non-zero
+			 * reference count, so keep the old partition descriptor around,
+			 * in case there's a PartitionDirectory with a pointer to it.
+			 * Attach it to the new rd_pdcxt so that it gets cleaned up
+			 * eventually.  In the case where the reference count is 0, this
+			 * code is not reached, which should be OK because in that case
+			 * there should be no PartitionDirectory with a pointer to the old
+			 * entry.
+			 *
+			 * Note that newrel and relation have already been swapped, so
+			 * the "old" partition descriptor is actually the one hanging off
+			 * of newrel.
+			 */
+			MemoryContextSetParent(newrel->rd_pdcxt, relation->rd_pdcxt);
+			newrel->rd_partdesc = NULL;
+			newrel->rd_pdcxt = NULL;
+		}
 
 #undef SWAPFIELD
 
-- 
2.17.2 (Apple Git-113)

0003-Initial-cut-at-PartitionDirectory.patchapplication/octet-stream; name=0003-Initial-cut-at-PartitionDirectory.patchDownload

From 588ce747026bfdd0da5d6c0a95c73082cb9316d7 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 28 Nov 2018 10:15:55 -0500
Subject: [PATCH 3/8] Initial cut at PartitionDirectory.

---
 src/backend/partitioning/partdesc.c | 64 ++++++++++++++++++++++++++++-
 src/include/partitioning/partdefs.h |  2 +
 src/include/partitioning/partdesc.h |  3 ++
 3 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 66b1e38527..a207ff35ee 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -21,12 +21,25 @@
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/inval.h"
+#include "utils/hsearch.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/partcache.h"
 #include "utils/syscache.h"
 
+typedef struct PartitionDirectoryData
+{
+	MemoryContext pdir_mcxt;
+	HTAB	   *pdir_hash;
+} PartitionDirectoryData;
+
+typedef struct PartitionDirectoryEntry
+{
+	Oid			reloid;
+	PartitionDesc pd;
+} PartitionDirectoryEntry;
+
 /*
  * RelationBuildPartitionDesc
  *		Form rel's partition descriptor
@@ -208,13 +221,62 @@ RelationBuildPartitionDesc(Relation rel)
 		partdesc->oids[index] = oids[i];
 		/* Record if the partition is a leaf partition */
 		partdesc->is_leaf[index] =
-				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+			(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
 	}
 	MemoryContextSwitchTo(oldcxt);
 
 	rel->rd_partdesc = partdesc;
 }
 
+/*
+ * CreatePartitionDirectory
+ *		Create a new partition directory object.
+ */
+PartitionDirectory
+CreatePartitionDirectory(MemoryContext mcxt)
+{
+	MemoryContext oldcontext = MemoryContextSwitchTo(mcxt);
+	PartitionDirectory pdir;
+	HASHCTL		ctl;
+
+	MemSet(&ctl, 0, sizeof(HASHCTL));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(PartitionDirectoryEntry);
+	ctl.hcxt = mcxt;
+
+	pdir = palloc(sizeof(PartitionDirectoryData));
+	pdir->pdir_mcxt = mcxt;
+	pdir->pdir_hash = hash_create("partition directory", 256, &ctl,
+								  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	MemoryContextSwitchTo(oldcontext);
+	return pdir;
+}
+
+/*
+ * PartitionDirectoryLookup
+ *		Look up the partition descriptor for a relation in the directory.
+ *
+ * The purpose of this function is to ensure that we get the same
+ * PartitionDesc for each relation every time we look it up.  In the
+ * face of current DDL, different PartitionDescs may be constructed with
+ * different views of the catalog state, but any single particular OID
+ * will always get the same PartitionDesc for as long as the same
+ * PartitionDirectory is used.
+ */
+PartitionDesc
+PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel)
+{
+	PartitionDirectoryEntry *pde;
+	Oid			relid = RelationGetRelid(rel);
+	bool		found;
+
+	pde = hash_search(pdir->pdir_hash, &relid, HASH_ENTER, &found);
+	if (!found)
+		pde->pd = RelationGetPartitionDesc(rel);
+	return pde->pd;
+}
+
 /*
  * equalPartitionDescs
  *		Compare two partition descriptors for logical equality
diff --git a/src/include/partitioning/partdefs.h b/src/include/partitioning/partdefs.h
index 6e9c128b2c..aec3b3fe63 100644
--- a/src/include/partitioning/partdefs.h
+++ b/src/include/partitioning/partdefs.h
@@ -21,4 +21,6 @@ typedef struct PartitionBoundSpec PartitionBoundSpec;
 
 typedef struct PartitionDescData *PartitionDesc;
 
+typedef struct PartitionDirectoryData *PartitionDirectory;
+
 #endif							/* PARTDEFS_H */
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index f72b70dded..6e384541da 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -31,6 +31,9 @@ typedef struct PartitionDescData
 
 extern void RelationBuildPartitionDesc(Relation rel);
 
+extern PartitionDirectory CreatePartitionDirectory(MemoryContext mcxt);
+extern PartitionDesc PartitionDirectoryLookup(PartitionDirectory, Relation);
+
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 
 extern bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
-- 
2.17.2 (Apple Git-113)

0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchapplication/octet-stream; name=0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchDownload

From 3faefaac13193aa3dafc2c04b141837305300e24 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Nov 2018 12:15:44 -0500
Subject: [PATCH 2/8] Ensure that RelationBuildPartitionDesc sees a consistent
 view.

If partitions are added or removed concurrently, make sure that we
nevertheless get a view of the partition list and the partition
descriptor for each partition which is consistent with the system
state at some single point in the commit history.

To do this, reuse an idea first invented by Noah Misch back in
commit 4240e429d0c2d889d0cda23c618f94e12c13ade7.
---
 src/backend/partitioning/partdesc.c | 137 ++++++++++++++++++++--------
 1 file changed, 101 insertions(+), 36 deletions(-)

diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 8a4b63aa26..66b1e38527 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -18,7 +18,9 @@
 #include "catalog/pg_inherits.h"
 #include "partitioning/partbounds.h"
 #include "partitioning/partdesc.h"
+#include "storage/sinval.h"
 #include "utils/builtins.h"
+#include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -47,45 +49,113 @@ RelationBuildPartitionDesc(Relation rel)
 	MemoryContext oldcxt;
 	int		   *mapping;
 
-	/* Get partition oids from pg_inherits */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
-	nparts = list_length(inhoids);
-
-	if (nparts > 0)
+	/*
+	 * Fetch catalog information.  Since we want to allow partitions to be
+	 * added and removed without holding AccessExclusiveLock on the parent
+	 * table, it's possible that the catalog contents could be changing under
+	 * us.  That means that by by the time we fetch the partition bound for a
+	 * partition returned by find_inheritance_children, it might no longer be
+	 * a partition or might even be a partition of some other table.
+	 *
+	 * To ensure that we get a consistent view of the catalog data, we first
+	 * fetch everything we need and then call AcceptInvalidationMessages. If
+	 * SharedInvalidMessageCounter advances between the time we start fetching
+	 * information and the time AcceptInvalidationMessages() completes, that
+	 * means something may have changed under us, so we start over and do it
+	 * all again.
+	 */
+	for (;;)
 	{
-		oids = palloc(nparts * sizeof(Oid));
-		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		uint64		inval_count = SharedInvalidMessageCounter;
+
+		/* Get partition oids from pg_inherits */
+		inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+		nparts = list_length(inhoids);
+
+		if (nparts > 0)
+		{
+			oids = palloc(nparts * sizeof(Oid));
+			boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		}
+
+		/* Collect bound spec nodes for each partition */
+		i = 0;
+		foreach(cell, inhoids)
+		{
+			Oid			inhrelid = lfirst_oid(cell);
+			HeapTuple	tuple;
+			PartitionBoundSpec *boundspec = NULL;
+
+			/*
+			 * Don't put any sanity checks here that might fail as a result of
+			 * concurrent DDL, such as a check that relpartbound is not NULL.
+			 * We could transiently see such states as a result of concurrent
+			 * DDL.  Such checks can be performed only after we're sure we got
+			 * a consistent view of the underlying data.
+			 */
+			tuple = SearchSysCache1(RELOID, inhrelid);
+			if (HeapTupleIsValid(tuple))
+			{
+				Datum		datum;
+				bool		isnull;
+
+				datum = SysCacheGetAttr(RELOID, tuple,
+										Anum_pg_class_relpartbound,
+										&isnull);
+				if (!isnull)
+					boundspec = stringToNode(TextDatumGetCString(datum));
+				ReleaseSysCache(tuple);
+			}
+
+			oids[i] = inhrelid;
+			boundspecs[i] = boundspec;
+			++i;
+		}
+
+		/*
+		 * If no relevant catalog changes have occurred (see comments at the
+		 * top of this loop, then we got a consistent view of our partition
+		 * list and can stop now.
+		 */
+		AcceptInvalidationMessages();
+		if (inval_count == SharedInvalidMessageCounter)
+			break;
+
+		/* Something changed, so retry from the top. */
+		if (oids != NULL)
+		{
+			pfree(oids);
+			oids = NULL;
+		}
+		if (boundspecs != NULL)
+		{
+			pfree(boundspecs);
+			boundspecs = NULL;
+		}
+		if (inhoids != NIL)
+			list_free(inhoids);
 	}
 
-	/* Collect bound spec nodes for each partition */
-	i = 0;
-	foreach(cell, inhoids)
+	/*
+	 * At this point, we should have a consistent view of the data we got from
+	 * pg_inherits and pg_class, so it's safe to perform some sanity checks.
+	 */
+	for (i = 0; i < nparts; ++i)
 	{
-		Oid			inhrelid = lfirst_oid(cell);
-		HeapTuple	tuple;
-		Datum		datum;
-		bool		isnull;
-		PartitionBoundSpec *boundspec;
-
-		tuple = SearchSysCache1(RELOID, inhrelid);
-		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
-
-		datum = SysCacheGetAttr(RELOID, tuple,
-								Anum_pg_class_relpartbound,
-								&isnull);
-		if (isnull)
-			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = stringToNode(TextDatumGetCString(datum));
-		if (!IsA(boundspec, PartitionBoundSpec))
+		Oid			inhrelid = oids[i];
+		PartitionBoundSpec *spec = boundspecs[i];
+
+		if (!spec)
+			elog(ERROR, "missing relpartbound for relation %u", inhrelid);
+		if (!IsA(spec, PartitionBoundSpec))
 			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
-		 * Sanity check: If the PartitionBoundSpec says this is the default
-		 * partition, its OID should correspond to whatever's stored in
-		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
+		 * If the PartitionBoundSpec says this is the default partition, its
+		 * OID should match pg_partitioned_table.partdefid; if not, the
+		 * catalog is corrupt.
 		 */
-		if (boundspec->is_default)
+		if (spec->is_default)
 		{
 			Oid			partdefid;
 
@@ -94,11 +164,6 @@ RelationBuildPartitionDesc(Relation rel)
 				elog(ERROR, "expected partdefid %u, but got %u",
 					 inhrelid, partdefid);
 		}
-
-		oids[i] = inhrelid;
-		boundspecs[i] = boundspec;
-		++i;
-		ReleaseSysCache(tuple);
 	}
 
 	/* Now build the actual relcache partition descriptor */
-- 
2.17.2 (Apple Git-113)

0001-Move-code-for-managing-PartitionDescs-into-a-new-fil.patchapplication/octet-stream; name=0001-Move-code-for-managing-PartitionDescs-into-a-new-fil.patchDownload

From 290e2bc15a6deac07e386d1e39045dc0138e4bf8 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 26 Nov 2018 14:31:53 -0500
Subject: [PATCH 1/8] Move code for managing PartitionDescs into a new file,
 partdesc.c

---
 src/backend/catalog/heap.c            |   1 +
 src/backend/catalog/partition.c       |  16 --
 src/backend/catalog/pg_constraint.c   |   2 +-
 src/backend/commands/indexcmds.c      |   2 +-
 src/backend/commands/tablecmds.c      |   1 +
 src/backend/commands/trigger.c        |   1 +
 src/backend/executor/execPartition.c  |   1 +
 src/backend/optimizer/util/inherit.c  |   1 +
 src/backend/optimizer/util/plancat.c  |   2 +-
 src/backend/partitioning/Makefile     |   2 +-
 src/backend/partitioning/partbounds.c |   6 +-
 src/backend/partitioning/partdesc.c   | 221 ++++++++++++++++++++++++++
 src/backend/utils/cache/partcache.c   | 124 ---------------
 src/backend/utils/cache/relcache.c    |  57 +------
 src/include/catalog/partition.h       |  15 --
 src/include/partitioning/partdesc.h   |  39 +++++
 src/include/utils/partcache.h         |   1 -
 17 files changed, 274 insertions(+), 218 deletions(-)
 create mode 100644 src/backend/partitioning/partdesc.c
 create mode 100644 src/include/partitioning/partdesc.h

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index cc865de627..fec03fc31e 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -71,6 +71,7 @@
 #include "parser/parse_collate.h"
 #include "parser/parse_expr.h"
 #include "parser/parse_relation.h"
+#include "partitioning/partdesc.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 62d1ec60ba..9c5f7fe352 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -255,22 +255,6 @@ has_partition_attrs(Relation rel, Bitmapset *attnums, bool *used_in_expr)
 	return false;
 }
 
-/*
- * get_default_oid_from_partdesc
- *
- * Given a partition descriptor, return the OID of the default partition, if
- * one exists; else, return InvalidOid.
- */
-Oid
-get_default_oid_from_partdesc(PartitionDesc partdesc)
-{
-	if (partdesc && partdesc->boundinfo &&
-		partition_bound_has_default(partdesc->boundinfo))
-		return partdesc->oids[partdesc->boundinfo->default_index];
-
-	return InvalidOid;
-}
-
 /*
  * get_default_partition_oid
  *
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index 698b493fc4..ea8817e8c8 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -24,12 +24,12 @@
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
-#include "catalog/partition.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablecmds.h"
+#include "partitioning/partdesc.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5b2b8d2969..ba0fcc1156 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -24,7 +24,6 @@
 #include "catalog/catalog.h"
 #include "catalog/index.h"
 #include "catalog/indexing.h"
-#include "catalog/partition.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_inherits.h"
@@ -48,6 +47,7 @@
 #include "parser/parse_coerce.h"
 #include "parser/parse_func.h"
 #include "parser/parse_oper.h"
+#include "partitioning/partdesc.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ff76499137..cf16e287af 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -78,6 +78,7 @@
 #include "parser/parse_utilcmd.h"
 #include "parser/parser.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
 #include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rewriteHandler.h"
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 499030c445..a3b9c96086 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -43,6 +43,7 @@
 #include "parser/parse_func.h"
 #include "parser/parse_relation.h"
 #include "parser/parsetree.h"
+#include "partitioning/partdesc.h"
 #include "pgstat.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/bufmgr.h"
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 2a7bc01563..58666fcf26 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -24,6 +24,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
 #include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index eaf788e578..faba493200 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -23,6 +23,7 @@
 #include "optimizer/inherit.h"
 #include "optimizer/planner.h"
 #include "optimizer/prep.h"
+#include "partitioning/partdesc.h"
 #include "utils/rel.h"
 
 
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 243344a011..1bb1edd8a4 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -27,7 +27,6 @@
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/heap.h"
-#include "catalog/partition.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_statistic_ext.h"
 #include "foreign/fdwapi.h"
@@ -39,6 +38,7 @@
 #include "optimizer/predtest.h"
 #include "optimizer/prep.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
 #include "parser/parse_relation.h"
 #include "parser/parsetree.h"
 #include "rewrite/rewriteManip.h"
diff --git a/src/backend/partitioning/Makefile b/src/backend/partitioning/Makefile
index 278fac3afa..82093c615f 100644
--- a/src/backend/partitioning/Makefile
+++ b/src/backend/partitioning/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/partitioning
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = partprune.o partbounds.o
+OBJS = partbounds.o partdesc.o partprune.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index f21c9b32a6..681bd60c77 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -10,7 +10,8 @@
  *		  src/backend/partitioning/partbounds.c
  *
  *-------------------------------------------------------------------------
-*/
+ */
+
 #include "postgres.h"
 
 #include "access/heapam.h"
@@ -24,8 +25,9 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "parser/parse_coerce.h"
-#include "partitioning/partprune.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
+#include "partitioning/partprune.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
new file mode 100644
index 0000000000..8a4b63aa26
--- /dev/null
+++ b/src/backend/partitioning/partdesc.c
@@ -0,0 +1,221 @@
+/*-------------------------------------------------------------------------
+ *
+ * partdesc.c
+ *		Support routines for manipulating partition descriptors
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		  src/backend/partitioning/partdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "catalog/partition.h"
+#include "catalog/pg_inherits.h"
+#include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+#include "utils/partcache.h"
+#include "utils/syscache.h"
+
+/*
+ * RelationBuildPartitionDesc
+ *		Form rel's partition descriptor
+ *
+ * Not flushed from the cache by RelationClearRelation() unless changed because
+ * of addition or removal of partition.
+ */
+void
+RelationBuildPartitionDesc(Relation rel)
+{
+	PartitionDesc partdesc;
+	PartitionBoundInfo boundinfo = NULL;
+	List	   *inhoids;
+	PartitionBoundSpec **boundspecs = NULL;
+	Oid		   *oids = NULL;
+	ListCell   *cell;
+	int			i,
+				nparts;
+	PartitionKey key = RelationGetPartitionKey(rel);
+	MemoryContext oldcxt;
+	int		   *mapping;
+
+	/* Get partition oids from pg_inherits */
+	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+	nparts = list_length(inhoids);
+
+	if (nparts > 0)
+	{
+		oids = palloc(nparts * sizeof(Oid));
+		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+	}
+
+	/* Collect bound spec nodes for each partition */
+	i = 0;
+	foreach(cell, inhoids)
+	{
+		Oid			inhrelid = lfirst_oid(cell);
+		HeapTuple	tuple;
+		Datum		datum;
+		bool		isnull;
+		PartitionBoundSpec *boundspec;
+
+		tuple = SearchSysCache1(RELOID, inhrelid);
+		if (!HeapTupleIsValid(tuple))
+			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
+
+		datum = SysCacheGetAttr(RELOID, tuple,
+								Anum_pg_class_relpartbound,
+								&isnull);
+		if (isnull)
+			elog(ERROR, "null relpartbound for relation %u", inhrelid);
+		boundspec = stringToNode(TextDatumGetCString(datum));
+		if (!IsA(boundspec, PartitionBoundSpec))
+			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
+
+		/*
+		 * Sanity check: If the PartitionBoundSpec says this is the default
+		 * partition, its OID should correspond to whatever's stored in
+		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
+		 */
+		if (boundspec->is_default)
+		{
+			Oid			partdefid;
+
+			partdefid = get_default_partition_oid(RelationGetRelid(rel));
+			if (partdefid != inhrelid)
+				elog(ERROR, "expected partdefid %u, but got %u",
+					 inhrelid, partdefid);
+		}
+
+		oids[i] = inhrelid;
+		boundspecs[i] = boundspec;
+		++i;
+		ReleaseSysCache(tuple);
+	}
+
+	/* Now build the actual relcache partition descriptor */
+	rel->rd_pdcxt = AllocSetContextCreate(CacheMemoryContext,
+										  "partition descriptor",
+										  ALLOCSET_DEFAULT_SIZES);
+	MemoryContextCopyAndSetIdentifier(rel->rd_pdcxt,
+									  RelationGetRelationName(rel));
+
+	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
+	partdesc = (PartitionDescData *) palloc0(sizeof(PartitionDescData));
+	partdesc->nparts = nparts;
+	/* oids and boundinfo are allocated below. */
+
+	MemoryContextSwitchTo(oldcxt);
+
+	if (nparts == 0)
+	{
+		rel->rd_partdesc = partdesc;
+		return;
+	}
+
+	/* First create PartitionBoundInfo */
+	boundinfo = partition_bounds_create(boundspecs, nparts, key, &mapping);
+
+	/* Now copy boundinfo and oids into partdesc. */
+	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
+	partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
+	partdesc->oids = (Oid *) palloc(partdesc->nparts * sizeof(Oid));
+	partdesc->is_leaf = (bool *) palloc(partdesc->nparts * sizeof(bool));
+
+	/*
+	 * Now assign OIDs from the original array into mapped indexes of the
+	 * result array.  The order of OIDs in the former is defined by the
+	 * catalog scan that retrieved them, whereas that in the latter is defined
+	 * by canonicalized representation of the partition bounds.
+	 */
+	for (i = 0; i < partdesc->nparts; i++)
+	{
+		int			index = mapping[i];
+
+		partdesc->oids[index] = oids[i];
+		/* Record if the partition is a leaf partition */
+		partdesc->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+	}
+	MemoryContextSwitchTo(oldcxt);
+
+	rel->rd_partdesc = partdesc;
+}
+
+/*
+ * equalPartitionDescs
+ *		Compare two partition descriptors for logical equality
+ */
+bool
+equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
+					PartitionDesc partdesc2)
+{
+	int			i;
+
+	if (partdesc1 != NULL)
+	{
+		if (partdesc2 == NULL)
+			return false;
+		if (partdesc1->nparts != partdesc2->nparts)
+			return false;
+
+		Assert(key != NULL || partdesc1->nparts == 0);
+
+		/*
+		 * Same oids? If the partitioning structure did not change, that is,
+		 * no partitions were added or removed to the relation, the oids array
+		 * should still match element-by-element.
+		 */
+		for (i = 0; i < partdesc1->nparts; i++)
+		{
+			if (partdesc1->oids[i] != partdesc2->oids[i])
+				return false;
+		}
+
+		/*
+		 * Now compare partition bound collections.  The logic to iterate over
+		 * the collections is private to partition.c.
+		 */
+		if (partdesc1->boundinfo != NULL)
+		{
+			if (partdesc2->boundinfo == NULL)
+				return false;
+
+			if (!partition_bounds_equal(key->partnatts, key->parttyplen,
+										key->parttypbyval,
+										partdesc1->boundinfo,
+										partdesc2->boundinfo))
+				return false;
+		}
+		else if (partdesc2->boundinfo != NULL)
+			return false;
+	}
+	else if (partdesc2 != NULL)
+		return false;
+
+	return true;
+}
+
+/*
+ * get_default_oid_from_partdesc
+ *
+ * Given a partition descriptor, return the OID of the default partition, if
+ * one exists; else, return InvalidOid.
+ */
+Oid
+get_default_oid_from_partdesc(PartitionDesc partdesc)
+{
+	if (partdesc && partdesc->boundinfo &&
+		partition_bound_has_default(partdesc->boundinfo))
+		return partdesc->oids[partdesc->boundinfo->default_index];
+
+	return InvalidOid;
+}
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 2404073bc8..c0a9c21483 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -244,130 +244,6 @@ RelationBuildPartitionKey(Relation relation)
 	relation->rd_partkey = key;
 }
 
-/*
- * RelationBuildPartitionDesc
- *		Form rel's partition descriptor
- *
- * Not flushed from the cache by RelationClearRelation() unless changed because
- * of addition or removal of partition.
- */
-void
-RelationBuildPartitionDesc(Relation rel)
-{
-	PartitionDesc partdesc;
-	PartitionBoundInfo boundinfo = NULL;
-	List	   *inhoids;
-	PartitionBoundSpec **boundspecs = NULL;
-	Oid		   *oids = NULL;
-	ListCell   *cell;
-	int			i,
-				nparts;
-	PartitionKey key = RelationGetPartitionKey(rel);
-	MemoryContext oldcxt;
-	int		   *mapping;
-
-	/* Get partition oids from pg_inherits */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
-	nparts = list_length(inhoids);
-
-	if (nparts > 0)
-	{
-		oids = palloc(nparts * sizeof(Oid));
-		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
-	}
-
-	/* Collect bound spec nodes for each partition */
-	i = 0;
-	foreach(cell, inhoids)
-	{
-		Oid			inhrelid = lfirst_oid(cell);
-		HeapTuple	tuple;
-		Datum		datum;
-		bool		isnull;
-		PartitionBoundSpec *boundspec;
-
-		tuple = SearchSysCache1(RELOID, inhrelid);
-		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
-
-		datum = SysCacheGetAttr(RELOID, tuple,
-								Anum_pg_class_relpartbound,
-								&isnull);
-		if (isnull)
-			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = stringToNode(TextDatumGetCString(datum));
-		if (!IsA(boundspec, PartitionBoundSpec))
-			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
-
-		/*
-		 * Sanity check: If the PartitionBoundSpec says this is the default
-		 * partition, its OID should correspond to whatever's stored in
-		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
-		 */
-		if (boundspec->is_default)
-		{
-			Oid			partdefid;
-
-			partdefid = get_default_partition_oid(RelationGetRelid(rel));
-			if (partdefid != inhrelid)
-				elog(ERROR, "expected partdefid %u, but got %u",
-					 inhrelid, partdefid);
-		}
-
-		oids[i] = inhrelid;
-		boundspecs[i] = boundspec;
-		++i;
-		ReleaseSysCache(tuple);
-	}
-
-	/* Now build the actual relcache partition descriptor */
-	rel->rd_pdcxt = AllocSetContextCreate(CacheMemoryContext,
-										  "partition descriptor",
-										  ALLOCSET_DEFAULT_SIZES);
-	MemoryContextCopyAndSetIdentifier(rel->rd_pdcxt, RelationGetRelationName(rel));
-
-	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
-	partdesc = (PartitionDescData *) palloc0(sizeof(PartitionDescData));
-	partdesc->nparts = nparts;
-	/* oids and boundinfo are allocated below. */
-
-	MemoryContextSwitchTo(oldcxt);
-
-	if (nparts == 0)
-	{
-		rel->rd_partdesc = partdesc;
-		return;
-	}
-
-	/* First create PartitionBoundInfo */
-	boundinfo = partition_bounds_create(boundspecs, nparts, key, &mapping);
-
-	/* Now copy boundinfo and oids into partdesc. */
-	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
-	partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
-	partdesc->oids = (Oid *) palloc(partdesc->nparts * sizeof(Oid));
-	partdesc->is_leaf = (bool *) palloc(partdesc->nparts * sizeof(bool));
-
-	/*
-	 * Now assign OIDs from the original array into mapped indexes of the
-	 * result array.  The order of OIDs in the former is defined by the
-	 * catalog scan that retrieved them, whereas that in the latter is defined
-	 * by canonicalized representation of the partition bounds.
-	 */
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		int			index = mapping[i];
-
-		partdesc->oids[index] = oids[i];
-		/* Record if the partition is a leaf partition */
-		partdesc->is_leaf[index] =
-				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
-	}
-	MemoryContextSwitchTo(oldcxt);
-
-	rel->rd_partdesc = partdesc;
-}
-
 /*
  * RelationGetPartitionQual
  *
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index bcf4f104cf..98257c8057 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -74,6 +74,7 @@
 #include "optimizer/prep.h"
 #include "optimizer/var.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -285,8 +286,6 @@ static OpClassCacheEnt *LookupOpclassInfo(Oid operatorClassOid,
 				  StrategyNumber numSupport);
 static void RelationCacheInitFileRemoveInDir(const char *tblspcpath);
 static void unlink_initfile(const char *initfilename, int elevel);
-static bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
-					PartitionDesc partdesc2);
 
 
 /*
@@ -997,60 +996,6 @@ equalRSDesc(RowSecurityDesc *rsdesc1, RowSecurityDesc *rsdesc2)
 	return true;
 }
 
-/*
- * equalPartitionDescs
- *		Compare two partition descriptors for logical equality
- */
-static bool
-equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
-					PartitionDesc partdesc2)
-{
-	int			i;
-
-	if (partdesc1 != NULL)
-	{
-		if (partdesc2 == NULL)
-			return false;
-		if (partdesc1->nparts != partdesc2->nparts)
-			return false;
-
-		Assert(key != NULL || partdesc1->nparts == 0);
-
-		/*
-		 * Same oids? If the partitioning structure did not change, that is,
-		 * no partitions were added or removed to the relation, the oids array
-		 * should still match element-by-element.
-		 */
-		for (i = 0; i < partdesc1->nparts; i++)
-		{
-			if (partdesc1->oids[i] != partdesc2->oids[i])
-				return false;
-		}
-
-		/*
-		 * Now compare partition bound collections.  The logic to iterate over
-		 * the collections is private to partition.c.
-		 */
-		if (partdesc1->boundinfo != NULL)
-		{
-			if (partdesc2->boundinfo == NULL)
-				return false;
-
-			if (!partition_bounds_equal(key->partnatts, key->parttyplen,
-										key->parttypbyval,
-										partdesc1->boundinfo,
-										partdesc2->boundinfo))
-				return false;
-		}
-		else if (partdesc2->boundinfo != NULL)
-			return false;
-	}
-	else if (partdesc2 != NULL)
-		return false;
-
-	return true;
-}
-
 /*
  *		RelationBuildDesc
  *
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 5685d2fd57..d84e325983 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -19,20 +19,6 @@
 /* Seed for the extended hash function */
 #define HASH_PARTITION_SEED UINT64CONST(0x7A5B22367996DCFD)
 
-/*
- * Information about partitions of a partitioned table.
- */
-typedef struct PartitionDescData
-{
-	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* Array of 'nparts' elements containing
-								 * partition OIDs in order of the their bounds */
-	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
-								 * the corresponding 'oids' element belongs to
-								 * a leaf partition or not */
-	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
-} PartitionDescData;
-
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_partition_ancestors(Oid relid);
 extern List *map_partition_varattnos(List *expr, int fromrel_varno,
@@ -41,7 +27,6 @@ extern List *map_partition_varattnos(List *expr, int fromrel_varno,
 extern bool has_partition_attrs(Relation rel, Bitmapset *attnums,
 					bool *used_in_expr);
 
-extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
 extern List *get_proposed_default_constraint(List *new_part_constaints);
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
new file mode 100644
index 0000000000..f72b70dded
--- /dev/null
+++ b/src/include/partitioning/partdesc.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * partdesc.h
+ *
+ * Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/utils/partdesc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PARTDESC_H
+#define PARTDESC_H
+
+#include "partitioning/partdefs.h"
+#include "utils/relcache.h"
+
+/*
+ * Information about partitions of a partitioned table.
+ */
+typedef struct PartitionDescData
+{
+	int			nparts;			/* Number of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
+	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
+} PartitionDescData;
+
+extern void RelationBuildPartitionDesc(Relation rel);
+
+extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
+
+extern bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
+					PartitionDesc partdesc2);
+
+#endif							/* PARTCACHE_H */
diff --git a/src/include/utils/partcache.h b/src/include/utils/partcache.h
index 7c2f973f68..823ad2eeb6 100644
--- a/src/include/utils/partcache.h
+++ b/src/include/utils/partcache.h
@@ -47,7 +47,6 @@ typedef struct PartitionKeyData
 }			PartitionKeyData;
 
 extern void RelationBuildPartitionKey(Relation relation);
-extern void RelationBuildPartitionDesc(Relation rel);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
-- 
2.17.2 (Apple Git-113)

#80

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Robert Haas (#79)

5 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Jan 29, 2019 at 1:59 PM Robert Haas <robertmhaas@gmail.com> wrote:

I don't know how to reduce this to something reliable enough to
include it in the regression tests, and maybe we don't really need
that, but it's good to know that this is not a purely theoretical
problem. I think next I'll try to write some code to make
execPartition.c able to cope with the situation when it arises.

OK, that seems to be pretty easy. New patch series attached. The
patch with that new logic is 0004. I've consolidated some of the
things I had as separate patches in my last post and rewritten the
commit messages to explain more clearly the purpose of each patch.

Open issues:

- For now, I haven't tried to handle the DETACH PARTITION case. I
don't think there's anything preventing someone - possibly even me -
from implementing the counter-based approach that I described in the
previous message, but I think it would be good to have some more
discussion first on whether it's acceptable to make concurrent queries
error out. I think any queries that were already up and running would
be fine, but any that were planned before the DETACH and tried to
execute afterwards would get an ERROR. That's fairly low-probability,
because normally the shared invalidation machinery would cause
replanning, but there's a race condition, so we'd have to document
something like: if you use this feature, it'll probably just work, but
you might get some funny errors in other sessions if you're unlucky.
That kinda sucks but maybe we should just suck it up. Possibly we
should consider making the concurrent behavior optional, so that if
you'd rather take blocking locks than risk errors, you have that
option. Of course I guess you could also just let people do an
explicit LOCK TABLE if that's what they want. Or we could try to
actually make it work in that case, I guess by ignoring the detached
partitions, but that seems a lot harder.

- 0003 doesn't have any handling for parallel query at this point, so
even though within a single backend a single query execution will
always get the same PartitionDesc for the same relation, the answers
might not be consistent across the parallel group. I keep going back
and forth on whether this really matters. It's too late to modify the
plan, so any relations attached since it was generated are not going
to get scanned. As for detached relations, we're talking about making
them error out, so we don't have to worry about different backends
come to different conclusions about whether they should be scanned.
But maybe we should try to be smarter instead. One concern is that
even relations that aren't scanned could still be affected because of
tuple routing, but right now parallel queries can't INSERT or UPDATE
rows anyway. Then again, maybe we should try not to add more
obstacles in the way of lifting that restriction. Then again again,
AFAICT we wouldn't be able to test that the new code is actually
solving a problem right now today, and how much untested code do we
really want in the tree? And then on the eleventh hand, maybe there
are other reasons why it's important to use the same PartitionDesc
across all parallel workers that I'm not thinking about at the moment.

- 0003 also changes the order in which locks are acquired. I am not
sure whether we care about this, especially in view of other pending
changes.

If you know of other problems, have solutions to or opinions about
these, or think the whole approach is wrong, please speak up!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchapplication/octet-stream; name=0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patchDownload

From d08abdedafc1bc7ee5bad85acc2adb83a391d330 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Nov 2018 12:15:44 -0500
Subject: [PATCH 2/5] Ensure that RelationBuildPartitionDesc sees a consistent
 view.

If partitions are added or removed concurrently, make sure that we
nevertheless get a view of the partition list and the partition
descriptor for each partition which is consistent with the system
state at some single point in the commit history.

To do this, reuse an idea first invented by Noah Misch back in
commit 4240e429d0c2d889d0cda23c618f94e12c13ade7.

Nothing in this commit permits partitions to be added or removed
concurrently; it just allows RelationBuildPartitionDesc to produce
reasonable results if they do.  It also does not guarantee that
the results produced by RelationBuildPartitionDesc will be stable
from one call to the next; it only tries to make sure that they
will be sane.
---
 src/backend/partitioning/partdesc.c | 137 ++++++++++++++++++++--------
 1 file changed, 101 insertions(+), 36 deletions(-)

diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 8a4b63aa26..66b1e38527 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -18,7 +18,9 @@
 #include "catalog/pg_inherits.h"
 #include "partitioning/partbounds.h"
 #include "partitioning/partdesc.h"
+#include "storage/sinval.h"
 #include "utils/builtins.h"
+#include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -47,45 +49,113 @@ RelationBuildPartitionDesc(Relation rel)
 	MemoryContext oldcxt;
 	int		   *mapping;
 
-	/* Get partition oids from pg_inherits */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
-	nparts = list_length(inhoids);
-
-	if (nparts > 0)
+	/*
+	 * Fetch catalog information.  Since we want to allow partitions to be
+	 * added and removed without holding AccessExclusiveLock on the parent
+	 * table, it's possible that the catalog contents could be changing under
+	 * us.  That means that by by the time we fetch the partition bound for a
+	 * partition returned by find_inheritance_children, it might no longer be
+	 * a partition or might even be a partition of some other table.
+	 *
+	 * To ensure that we get a consistent view of the catalog data, we first
+	 * fetch everything we need and then call AcceptInvalidationMessages. If
+	 * SharedInvalidMessageCounter advances between the time we start fetching
+	 * information and the time AcceptInvalidationMessages() completes, that
+	 * means something may have changed under us, so we start over and do it
+	 * all again.
+	 */
+	for (;;)
 	{
-		oids = palloc(nparts * sizeof(Oid));
-		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		uint64		inval_count = SharedInvalidMessageCounter;
+
+		/* Get partition oids from pg_inherits */
+		inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+		nparts = list_length(inhoids);
+
+		if (nparts > 0)
+		{
+			oids = palloc(nparts * sizeof(Oid));
+			boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		}
+
+		/* Collect bound spec nodes for each partition */
+		i = 0;
+		foreach(cell, inhoids)
+		{
+			Oid			inhrelid = lfirst_oid(cell);
+			HeapTuple	tuple;
+			PartitionBoundSpec *boundspec = NULL;
+
+			/*
+			 * Don't put any sanity checks here that might fail as a result of
+			 * concurrent DDL, such as a check that relpartbound is not NULL.
+			 * We could transiently see such states as a result of concurrent
+			 * DDL.  Such checks can be performed only after we're sure we got
+			 * a consistent view of the underlying data.
+			 */
+			tuple = SearchSysCache1(RELOID, inhrelid);
+			if (HeapTupleIsValid(tuple))
+			{
+				Datum		datum;
+				bool		isnull;
+
+				datum = SysCacheGetAttr(RELOID, tuple,
+										Anum_pg_class_relpartbound,
+										&isnull);
+				if (!isnull)
+					boundspec = stringToNode(TextDatumGetCString(datum));
+				ReleaseSysCache(tuple);
+			}
+
+			oids[i] = inhrelid;
+			boundspecs[i] = boundspec;
+			++i;
+		}
+
+		/*
+		 * If no relevant catalog changes have occurred (see comments at the
+		 * top of this loop, then we got a consistent view of our partition
+		 * list and can stop now.
+		 */
+		AcceptInvalidationMessages();
+		if (inval_count == SharedInvalidMessageCounter)
+			break;
+
+		/* Something changed, so retry from the top. */
+		if (oids != NULL)
+		{
+			pfree(oids);
+			oids = NULL;
+		}
+		if (boundspecs != NULL)
+		{
+			pfree(boundspecs);
+			boundspecs = NULL;
+		}
+		if (inhoids != NIL)
+			list_free(inhoids);
 	}
 
-	/* Collect bound spec nodes for each partition */
-	i = 0;
-	foreach(cell, inhoids)
+	/*
+	 * At this point, we should have a consistent view of the data we got from
+	 * pg_inherits and pg_class, so it's safe to perform some sanity checks.
+	 */
+	for (i = 0; i < nparts; ++i)
 	{
-		Oid			inhrelid = lfirst_oid(cell);
-		HeapTuple	tuple;
-		Datum		datum;
-		bool		isnull;
-		PartitionBoundSpec *boundspec;
-
-		tuple = SearchSysCache1(RELOID, inhrelid);
-		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
-
-		datum = SysCacheGetAttr(RELOID, tuple,
-								Anum_pg_class_relpartbound,
-								&isnull);
-		if (isnull)
-			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = stringToNode(TextDatumGetCString(datum));
-		if (!IsA(boundspec, PartitionBoundSpec))
+		Oid			inhrelid = oids[i];
+		PartitionBoundSpec *spec = boundspecs[i];
+
+		if (!spec)
+			elog(ERROR, "missing relpartbound for relation %u", inhrelid);
+		if (!IsA(spec, PartitionBoundSpec))
 			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
-		 * Sanity check: If the PartitionBoundSpec says this is the default
-		 * partition, its OID should correspond to whatever's stored in
-		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
+		 * If the PartitionBoundSpec says this is the default partition, its
+		 * OID should match pg_partitioned_table.partdefid; if not, the
+		 * catalog is corrupt.
 		 */
-		if (boundspec->is_default)
+		if (spec->is_default)
 		{
 			Oid			partdefid;
 
@@ -94,11 +164,6 @@ RelationBuildPartitionDesc(Relation rel)
 				elog(ERROR, "expected partdefid %u, but got %u",
 					 inhrelid, partdefid);
 		}
-
-		oids[i] = inhrelid;
-		boundspecs[i] = boundspec;
-		++i;
-		ReleaseSysCache(tuple);
 	}
 
 	/* Now build the actual relcache partition descriptor */
-- 
2.17.2 (Apple Git-113)

0001-Move-code-for-managing-PartitionDescs-into-a-new-fil.patchapplication/octet-stream; name=0001-Move-code-for-managing-PartitionDescs-into-a-new-fil.patchDownload

From 4066b913cc6c46ae6ff1eb941eae1cbc8fb7540b Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 26 Nov 2018 14:31:53 -0500
Subject: [PATCH 1/5] Move code for managing PartitionDescs into a new file,
 partdesc.c

---
 src/backend/catalog/heap.c            |   1 +
 src/backend/catalog/partition.c       |  16 --
 src/backend/catalog/pg_constraint.c   |   2 +-
 src/backend/commands/indexcmds.c      |   2 +-
 src/backend/commands/tablecmds.c      |   1 +
 src/backend/commands/trigger.c        |   1 +
 src/backend/executor/execPartition.c  |   1 +
 src/backend/optimizer/util/inherit.c  |   1 +
 src/backend/optimizer/util/plancat.c  |   2 +-
 src/backend/partitioning/Makefile     |   2 +-
 src/backend/partitioning/partbounds.c |   6 +-
 src/backend/partitioning/partdesc.c   | 221 ++++++++++++++++++++++++++
 src/backend/utils/cache/partcache.c   | 124 ---------------
 src/backend/utils/cache/relcache.c    |  57 +------
 src/include/catalog/partition.h       |  15 --
 src/include/partitioning/partdesc.h   |  39 +++++
 src/include/utils/partcache.h         |   1 -
 17 files changed, 274 insertions(+), 218 deletions(-)
 create mode 100644 src/backend/partitioning/partdesc.c
 create mode 100644 src/include/partitioning/partdesc.h

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 06d18a1cfb..6305cbbf06 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -69,6 +69,7 @@
 #include "parser/parse_collate.h"
 #include "parser/parse_expr.h"
 #include "parser/parse_relation.h"
+#include "partitioning/partdesc.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 0d3bc3a2c7..3ccdaff8c4 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -253,22 +253,6 @@ has_partition_attrs(Relation rel, Bitmapset *attnums, bool *used_in_expr)
 	return false;
 }
 
-/*
- * get_default_oid_from_partdesc
- *
- * Given a partition descriptor, return the OID of the default partition, if
- * one exists; else, return InvalidOid.
- */
-Oid
-get_default_oid_from_partdesc(PartitionDesc partdesc)
-{
-	if (partdesc && partdesc->boundinfo &&
-		partition_bound_has_default(partdesc->boundinfo))
-		return partdesc->oids[partdesc->boundinfo->default_index];
-
-	return InvalidOid;
-}
-
 /*
  * get_default_partition_oid
  *
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index 698b493fc4..ea8817e8c8 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -24,12 +24,12 @@
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
-#include "catalog/partition.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablecmds.h"
+#include "partitioning/partdesc.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index bd85099c28..fc176d39ab 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -24,7 +24,6 @@
 #include "catalog/catalog.h"
 #include "catalog/index.h"
 #include "catalog/indexing.h"
-#include "catalog/partition.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_inherits.h"
@@ -46,6 +45,7 @@
 #include "parser/parse_coerce.h"
 #include "parser/parse_func.h"
 #include "parser/parse_oper.h"
+#include "partitioning/partdesc.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 35a9ade059..5646e6e075 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -74,6 +74,7 @@
 #include "parser/parse_utilcmd.h"
 #include "parser/parser.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
 #include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rewriteHandler.h"
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 7b5896b98f..f5c911b5a7 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -42,6 +42,7 @@
 #include "parser/parse_func.h"
 #include "parser/parse_relation.h"
 #include "parser/parsetree.h"
+#include "partitioning/partdesc.h"
 #include "pgstat.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/bufmgr.h"
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 2a7bc01563..58666fcf26 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -24,6 +24,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
 #include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index eaf788e578..faba493200 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -23,6 +23,7 @@
 #include "optimizer/inherit.h"
 #include "optimizer/planner.h"
 #include "optimizer/prep.h"
+#include "partitioning/partdesc.h"
 #include "utils/rel.h"
 
 
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 3efa1bdc1a..eec1e09e35 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -27,7 +27,6 @@
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/heap.h"
-#include "catalog/partition.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_statistic_ext.h"
 #include "foreign/fdwapi.h"
@@ -39,6 +38,7 @@
 #include "optimizer/plancat.h"
 #include "optimizer/prep.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
 #include "parser/parse_relation.h"
 #include "parser/parsetree.h"
 #include "rewrite/rewriteManip.h"
diff --git a/src/backend/partitioning/Makefile b/src/backend/partitioning/Makefile
index 278fac3afa..82093c615f 100644
--- a/src/backend/partitioning/Makefile
+++ b/src/backend/partitioning/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/partitioning
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = partprune.o partbounds.o
+OBJS = partbounds.o partdesc.o partprune.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index d478ae7e19..e71eb3793b 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -10,7 +10,8 @@
  *		  src/backend/partitioning/partbounds.c
  *
  *-------------------------------------------------------------------------
-*/
+ */
+
 #include "postgres.h"
 
 #include "access/heapam.h"
@@ -23,8 +24,9 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "parser/parse_coerce.h"
-#include "partitioning/partprune.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
+#include "partitioning/partprune.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
new file mode 100644
index 0000000000..8a4b63aa26
--- /dev/null
+++ b/src/backend/partitioning/partdesc.c
@@ -0,0 +1,221 @@
+/*-------------------------------------------------------------------------
+ *
+ * partdesc.c
+ *		Support routines for manipulating partition descriptors
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		  src/backend/partitioning/partdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "catalog/partition.h"
+#include "catalog/pg_inherits.h"
+#include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+#include "utils/partcache.h"
+#include "utils/syscache.h"
+
+/*
+ * RelationBuildPartitionDesc
+ *		Form rel's partition descriptor
+ *
+ * Not flushed from the cache by RelationClearRelation() unless changed because
+ * of addition or removal of partition.
+ */
+void
+RelationBuildPartitionDesc(Relation rel)
+{
+	PartitionDesc partdesc;
+	PartitionBoundInfo boundinfo = NULL;
+	List	   *inhoids;
+	PartitionBoundSpec **boundspecs = NULL;
+	Oid		   *oids = NULL;
+	ListCell   *cell;
+	int			i,
+				nparts;
+	PartitionKey key = RelationGetPartitionKey(rel);
+	MemoryContext oldcxt;
+	int		   *mapping;
+
+	/* Get partition oids from pg_inherits */
+	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+	nparts = list_length(inhoids);
+
+	if (nparts > 0)
+	{
+		oids = palloc(nparts * sizeof(Oid));
+		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+	}
+
+	/* Collect bound spec nodes for each partition */
+	i = 0;
+	foreach(cell, inhoids)
+	{
+		Oid			inhrelid = lfirst_oid(cell);
+		HeapTuple	tuple;
+		Datum		datum;
+		bool		isnull;
+		PartitionBoundSpec *boundspec;
+
+		tuple = SearchSysCache1(RELOID, inhrelid);
+		if (!HeapTupleIsValid(tuple))
+			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
+
+		datum = SysCacheGetAttr(RELOID, tuple,
+								Anum_pg_class_relpartbound,
+								&isnull);
+		if (isnull)
+			elog(ERROR, "null relpartbound for relation %u", inhrelid);
+		boundspec = stringToNode(TextDatumGetCString(datum));
+		if (!IsA(boundspec, PartitionBoundSpec))
+			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
+
+		/*
+		 * Sanity check: If the PartitionBoundSpec says this is the default
+		 * partition, its OID should correspond to whatever's stored in
+		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
+		 */
+		if (boundspec->is_default)
+		{
+			Oid			partdefid;
+
+			partdefid = get_default_partition_oid(RelationGetRelid(rel));
+			if (partdefid != inhrelid)
+				elog(ERROR, "expected partdefid %u, but got %u",
+					 inhrelid, partdefid);
+		}
+
+		oids[i] = inhrelid;
+		boundspecs[i] = boundspec;
+		++i;
+		ReleaseSysCache(tuple);
+	}
+
+	/* Now build the actual relcache partition descriptor */
+	rel->rd_pdcxt = AllocSetContextCreate(CacheMemoryContext,
+										  "partition descriptor",
+										  ALLOCSET_DEFAULT_SIZES);
+	MemoryContextCopyAndSetIdentifier(rel->rd_pdcxt,
+									  RelationGetRelationName(rel));
+
+	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
+	partdesc = (PartitionDescData *) palloc0(sizeof(PartitionDescData));
+	partdesc->nparts = nparts;
+	/* oids and boundinfo are allocated below. */
+
+	MemoryContextSwitchTo(oldcxt);
+
+	if (nparts == 0)
+	{
+		rel->rd_partdesc = partdesc;
+		return;
+	}
+
+	/* First create PartitionBoundInfo */
+	boundinfo = partition_bounds_create(boundspecs, nparts, key, &mapping);
+
+	/* Now copy boundinfo and oids into partdesc. */
+	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
+	partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
+	partdesc->oids = (Oid *) palloc(partdesc->nparts * sizeof(Oid));
+	partdesc->is_leaf = (bool *) palloc(partdesc->nparts * sizeof(bool));
+
+	/*
+	 * Now assign OIDs from the original array into mapped indexes of the
+	 * result array.  The order of OIDs in the former is defined by the
+	 * catalog scan that retrieved them, whereas that in the latter is defined
+	 * by canonicalized representation of the partition bounds.
+	 */
+	for (i = 0; i < partdesc->nparts; i++)
+	{
+		int			index = mapping[i];
+
+		partdesc->oids[index] = oids[i];
+		/* Record if the partition is a leaf partition */
+		partdesc->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+	}
+	MemoryContextSwitchTo(oldcxt);
+
+	rel->rd_partdesc = partdesc;
+}
+
+/*
+ * equalPartitionDescs
+ *		Compare two partition descriptors for logical equality
+ */
+bool
+equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
+					PartitionDesc partdesc2)
+{
+	int			i;
+
+	if (partdesc1 != NULL)
+	{
+		if (partdesc2 == NULL)
+			return false;
+		if (partdesc1->nparts != partdesc2->nparts)
+			return false;
+
+		Assert(key != NULL || partdesc1->nparts == 0);
+
+		/*
+		 * Same oids? If the partitioning structure did not change, that is,
+		 * no partitions were added or removed to the relation, the oids array
+		 * should still match element-by-element.
+		 */
+		for (i = 0; i < partdesc1->nparts; i++)
+		{
+			if (partdesc1->oids[i] != partdesc2->oids[i])
+				return false;
+		}
+
+		/*
+		 * Now compare partition bound collections.  The logic to iterate over
+		 * the collections is private to partition.c.
+		 */
+		if (partdesc1->boundinfo != NULL)
+		{
+			if (partdesc2->boundinfo == NULL)
+				return false;
+
+			if (!partition_bounds_equal(key->partnatts, key->parttyplen,
+										key->parttypbyval,
+										partdesc1->boundinfo,
+										partdesc2->boundinfo))
+				return false;
+		}
+		else if (partdesc2->boundinfo != NULL)
+			return false;
+	}
+	else if (partdesc2 != NULL)
+		return false;
+
+	return true;
+}
+
+/*
+ * get_default_oid_from_partdesc
+ *
+ * Given a partition descriptor, return the OID of the default partition, if
+ * one exists; else, return InvalidOid.
+ */
+Oid
+get_default_oid_from_partdesc(PartitionDesc partdesc)
+{
+	if (partdesc && partdesc->boundinfo &&
+		partition_bound_has_default(partdesc->boundinfo))
+		return partdesc->oids[partdesc->boundinfo->default_index];
+
+	return InvalidOid;
+}
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 1b50f283c5..2b55f25e75 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -243,130 +243,6 @@ RelationBuildPartitionKey(Relation relation)
 	relation->rd_partkey = key;
 }
 
-/*
- * RelationBuildPartitionDesc
- *		Form rel's partition descriptor
- *
- * Not flushed from the cache by RelationClearRelation() unless changed because
- * of addition or removal of partition.
- */
-void
-RelationBuildPartitionDesc(Relation rel)
-{
-	PartitionDesc partdesc;
-	PartitionBoundInfo boundinfo = NULL;
-	List	   *inhoids;
-	PartitionBoundSpec **boundspecs = NULL;
-	Oid		   *oids = NULL;
-	ListCell   *cell;
-	int			i,
-				nparts;
-	PartitionKey key = RelationGetPartitionKey(rel);
-	MemoryContext oldcxt;
-	int		   *mapping;
-
-	/* Get partition oids from pg_inherits */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
-	nparts = list_length(inhoids);
-
-	if (nparts > 0)
-	{
-		oids = palloc(nparts * sizeof(Oid));
-		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
-	}
-
-	/* Collect bound spec nodes for each partition */
-	i = 0;
-	foreach(cell, inhoids)
-	{
-		Oid			inhrelid = lfirst_oid(cell);
-		HeapTuple	tuple;
-		Datum		datum;
-		bool		isnull;
-		PartitionBoundSpec *boundspec;
-
-		tuple = SearchSysCache1(RELOID, inhrelid);
-		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
-
-		datum = SysCacheGetAttr(RELOID, tuple,
-								Anum_pg_class_relpartbound,
-								&isnull);
-		if (isnull)
-			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = stringToNode(TextDatumGetCString(datum));
-		if (!IsA(boundspec, PartitionBoundSpec))
-			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
-
-		/*
-		 * Sanity check: If the PartitionBoundSpec says this is the default
-		 * partition, its OID should correspond to whatever's stored in
-		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
-		 */
-		if (boundspec->is_default)
-		{
-			Oid			partdefid;
-
-			partdefid = get_default_partition_oid(RelationGetRelid(rel));
-			if (partdefid != inhrelid)
-				elog(ERROR, "expected partdefid %u, but got %u",
-					 inhrelid, partdefid);
-		}
-
-		oids[i] = inhrelid;
-		boundspecs[i] = boundspec;
-		++i;
-		ReleaseSysCache(tuple);
-	}
-
-	/* Now build the actual relcache partition descriptor */
-	rel->rd_pdcxt = AllocSetContextCreate(CacheMemoryContext,
-										  "partition descriptor",
-										  ALLOCSET_DEFAULT_SIZES);
-	MemoryContextCopyAndSetIdentifier(rel->rd_pdcxt, RelationGetRelationName(rel));
-
-	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
-	partdesc = (PartitionDescData *) palloc0(sizeof(PartitionDescData));
-	partdesc->nparts = nparts;
-	/* oids and boundinfo are allocated below. */
-
-	MemoryContextSwitchTo(oldcxt);
-
-	if (nparts == 0)
-	{
-		rel->rd_partdesc = partdesc;
-		return;
-	}
-
-	/* First create PartitionBoundInfo */
-	boundinfo = partition_bounds_create(boundspecs, nparts, key, &mapping);
-
-	/* Now copy boundinfo and oids into partdesc. */
-	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
-	partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
-	partdesc->oids = (Oid *) palloc(partdesc->nparts * sizeof(Oid));
-	partdesc->is_leaf = (bool *) palloc(partdesc->nparts * sizeof(bool));
-
-	/*
-	 * Now assign OIDs from the original array into mapped indexes of the
-	 * result array.  The order of OIDs in the former is defined by the
-	 * catalog scan that retrieved them, whereas that in the latter is defined
-	 * by canonicalized representation of the partition bounds.
-	 */
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		int			index = mapping[i];
-
-		partdesc->oids[index] = oids[i];
-		/* Record if the partition is a leaf partition */
-		partdesc->is_leaf[index] =
-				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
-	}
-	MemoryContextSwitchTo(oldcxt);
-
-	rel->rd_partdesc = partdesc;
-}
-
 /*
  * RelationGetPartitionQual
  *
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 721c9dab95..54a40ef00b 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partdesc.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -283,8 +284,6 @@ static OpClassCacheEnt *LookupOpclassInfo(Oid operatorClassOid,
 				  StrategyNumber numSupport);
 static void RelationCacheInitFileRemoveInDir(const char *tblspcpath);
 static void unlink_initfile(const char *initfilename, int elevel);
-static bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
-					PartitionDesc partdesc2);
 
 
 /*
@@ -995,60 +994,6 @@ equalRSDesc(RowSecurityDesc *rsdesc1, RowSecurityDesc *rsdesc2)
 	return true;
 }
 
-/*
- * equalPartitionDescs
- *		Compare two partition descriptors for logical equality
- */
-static bool
-equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
-					PartitionDesc partdesc2)
-{
-	int			i;
-
-	if (partdesc1 != NULL)
-	{
-		if (partdesc2 == NULL)
-			return false;
-		if (partdesc1->nparts != partdesc2->nparts)
-			return false;
-
-		Assert(key != NULL || partdesc1->nparts == 0);
-
-		/*
-		 * Same oids? If the partitioning structure did not change, that is,
-		 * no partitions were added or removed to the relation, the oids array
-		 * should still match element-by-element.
-		 */
-		for (i = 0; i < partdesc1->nparts; i++)
-		{
-			if (partdesc1->oids[i] != partdesc2->oids[i])
-				return false;
-		}
-
-		/*
-		 * Now compare partition bound collections.  The logic to iterate over
-		 * the collections is private to partition.c.
-		 */
-		if (partdesc1->boundinfo != NULL)
-		{
-			if (partdesc2->boundinfo == NULL)
-				return false;
-
-			if (!partition_bounds_equal(key->partnatts, key->parttyplen,
-										key->parttypbyval,
-										partdesc1->boundinfo,
-										partdesc2->boundinfo))
-				return false;
-		}
-		else if (partdesc2->boundinfo != NULL)
-			return false;
-	}
-	else if (partdesc2 != NULL)
-		return false;
-
-	return true;
-}
-
 /*
  *		RelationBuildDesc
  *
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 5685d2fd57..d84e325983 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -19,20 +19,6 @@
 /* Seed for the extended hash function */
 #define HASH_PARTITION_SEED UINT64CONST(0x7A5B22367996DCFD)
 
-/*
- * Information about partitions of a partitioned table.
- */
-typedef struct PartitionDescData
-{
-	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* Array of 'nparts' elements containing
-								 * partition OIDs in order of the their bounds */
-	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
-								 * the corresponding 'oids' element belongs to
-								 * a leaf partition or not */
-	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
-} PartitionDescData;
-
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_partition_ancestors(Oid relid);
 extern List *map_partition_varattnos(List *expr, int fromrel_varno,
@@ -41,7 +27,6 @@ extern List *map_partition_varattnos(List *expr, int fromrel_varno,
 extern bool has_partition_attrs(Relation rel, Bitmapset *attnums,
 					bool *used_in_expr);
 
-extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
 extern List *get_proposed_default_constraint(List *new_part_constaints);
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
new file mode 100644
index 0000000000..f72b70dded
--- /dev/null
+++ b/src/include/partitioning/partdesc.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * partdesc.h
+ *
+ * Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/utils/partdesc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PARTDESC_H
+#define PARTDESC_H
+
+#include "partitioning/partdefs.h"
+#include "utils/relcache.h"
+
+/*
+ * Information about partitions of a partitioned table.
+ */
+typedef struct PartitionDescData
+{
+	int			nparts;			/* Number of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
+	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
+} PartitionDescData;
+
+extern void RelationBuildPartitionDesc(Relation rel);
+
+extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
+
+extern bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
+					PartitionDesc partdesc2);
+
+#endif							/* PARTCACHE_H */
diff --git a/src/include/utils/partcache.h b/src/include/utils/partcache.h
index 7c2f973f68..823ad2eeb6 100644
--- a/src/include/utils/partcache.h
+++ b/src/include/utils/partcache.h
@@ -47,7 +47,6 @@ typedef struct PartitionKeyData
 }			PartitionKeyData;
 
 extern void RelationBuildPartitionKey(Relation relation);
-extern void RelationBuildPartitionDesc(Relation rel);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
-- 
2.17.2 (Apple Git-113)

0005-Reduce-the-lock-level-required-to-attach-a-partition.patchapplication/octet-stream; name=0005-Reduce-the-lock-level-required-to-attach-a-partition.patchDownload

From 8b2c163a62cc77de4e5c94c1de5e7e8df618abc4 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 17 Jan 2019 10:25:12 -0500
Subject: [PATCH 5/5] Reduce the lock level required to attach a partition.

Previous work makes this safe (hopefully).
---
 src/backend/commands/tablecmds.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5646e6e075..0ca85f5b19 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3652,6 +3652,9 @@ AlterTableGetLockLevel(List *cmds)
 				break;
 
 			case AT_AttachPartition:
+				cmd_lockmode = ShareUpdateExclusiveLock;
+				break;
+
 			case AT_DetachPartition:
 				cmd_lockmode = AccessExclusiveLock;
 				break;
-- 
2.17.2 (Apple Git-113)

0004-Teach-runtime-partition-pruning-to-cope-with-concurr.patchapplication/octet-stream; name=0004-Teach-runtime-partition-pruning-to-cope-with-concurr.patchDownload

From c5b36d28d33641001ab33f340535d0242cb31336 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 17 Jan 2019 09:11:10 -0500
Subject: [PATCH 4/5] Teach runtime partition pruning to cope with concurrent
 partition adds.

If new partitions were added between plan time and execution time, the
indexes stored in the subplan_map[] and subpart_map[] arrays within
the plan's PartitionedRelPruneInfo would no longer be correct.  Adjust
the code to cope with added partitions.  There does not seem to be
a simple way to cope with partitions that are removed, mostly because
they could then get added back again with different bounds, so don't
try to cope with that situation.
---
 src/backend/executor/execPartition.c | 68 +++++++++++++++++++++++-----
 src/backend/nodes/copyfuncs.c        |  1 +
 src/backend/nodes/outfuncs.c         |  1 +
 src/backend/nodes/readfuncs.c        |  1 +
 src/backend/partitioning/partprune.c |  7 ++-
 src/include/nodes/plannodes.h        |  1 +
 6 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 9124d5e54d..d1cc2e944c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1602,18 +1602,6 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			int			n_steps;
 			ListCell   *lc3;
 
-			/*
-			 * We must copy the subplan_map rather than pointing directly to
-			 * the plan's version, as we may end up making modifications to it
-			 * later.
-			 */
-			pprune->subplan_map = palloc(sizeof(int) * pinfo->nparts);
-			memcpy(pprune->subplan_map, pinfo->subplan_map,
-				   sizeof(int) * pinfo->nparts);
-
-			/* We can use the subpart_map verbatim, since we never modify it */
-			pprune->subpart_map = pinfo->subpart_map;
-
 			/* present_parts is also subject to later modification */
 			pprune->present_parts = bms_copy(pinfo->present_parts);
 
@@ -1628,6 +1616,62 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			partdesc = PartitionDirectoryLookup(estate->es_partition_directory,
 												partrel);
 
+			/*
+			 * Initialize the subplan_map and subpart_map.  Since detaching a
+			 * partition requires AccessExclusiveLock, no partitions can have
+			 * disappeared, nor can the bounds for any partition have changed.
+			 * However, new partitions may have been added.
+			 */
+			Assert(partdesc->nparts >= pinfo->nparts);
+			pprune->subplan_map = palloc(sizeof(int) * partdesc->nparts);
+			if (partdesc->nparts == pinfo->nparts)
+			{
+				/*
+				 * There are no new partitions, so this is simple.  We can
+				 * simply point to the subpart_map from the plan, but we must
+				 * copy the subplan_map since we may change it later.
+				 */
+				pprune->subpart_map = pinfo->subpart_map;
+				memcpy(pprune->subplan_map, pinfo->subplan_map,
+					   sizeof(int) * pinfo->nparts);
+
+				/* Double-check that list of relations has not changed. */
+				Assert(memcmp(partdesc->oids, pinfo->relid_map,
+					   pinfo->nparts * sizeof(Oid)) == 0);
+			}
+			else
+			{
+				int		pd_idx = 0;
+				int		pp_idx;
+
+				/*
+				 * Some new partitions have appeared since plan time, and
+				 * those are reflected in our PartitionDesc but were not
+				 * present in the one used to construct subplan_map and
+				 * subpart_map.  So we must construct new and longer arrays
+				 * where the partitions that were originally present map to the
+				 * same place, and any added indexes map to -1, as if the
+				 * new partitions had been pruned.
+				 */
+				pprune->subpart_map = palloc(sizeof(int) * partdesc->nparts);
+				for (pp_idx = 0; pp_idx < partdesc->nparts; ++pp_idx)
+				{
+					if (pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx])
+					{
+						pprune->subplan_map[pp_idx] = -1;
+						pprune->subpart_map[pp_idx] = -1;
+					}
+					else
+					{
+						pprune->subplan_map[pp_idx] =
+							pinfo->subplan_map[pd_idx];
+						pprune->subpart_map[pp_idx] =
+							pinfo->subpart_map[pd_idx++];
+					}
+				}
+				Assert(pd_idx == pinfo->nparts);
+			}
+
 			n_steps = list_length(pinfo->pruning_steps);
 
 			context->strategy = partkey->strategy;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 807393dfaa..26e6f84253 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1197,6 +1197,7 @@ _copyPartitionedRelPruneInfo(const PartitionedRelPruneInfo *from)
 	COPY_SCALAR_FIELD(nexprs);
 	COPY_POINTER_FIELD(subplan_map, from->nparts * sizeof(int));
 	COPY_POINTER_FIELD(subpart_map, from->nparts * sizeof(int));
+	COPY_POINTER_FIELD(relid_map, from->nparts * sizeof(int));
 	COPY_POINTER_FIELD(hasexecparam, from->nexprs * sizeof(bool));
 	COPY_SCALAR_FIELD(do_initial_prune);
 	COPY_SCALAR_FIELD(do_exec_prune);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 9d44e3e4c6..99a6b50069 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -947,6 +947,7 @@ _outPartitionedRelPruneInfo(StringInfo str, const PartitionedRelPruneInfo *node)
 	WRITE_INT_FIELD(nexprs);
 	WRITE_INT_ARRAY(subplan_map, node->nparts);
 	WRITE_INT_ARRAY(subpart_map, node->nparts);
+	WRITE_OID_ARRAY(relid_map, node->nparts);
 	WRITE_BOOL_ARRAY(hasexecparam, node->nexprs);
 	WRITE_BOOL_FIELD(do_initial_prune);
 	WRITE_BOOL_FIELD(do_exec_prune);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 43491e297b..4433438fb6 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2384,6 +2384,7 @@ _readPartitionedRelPruneInfo(void)
 	READ_INT_FIELD(nexprs);
 	READ_INT_ARRAY(subplan_map, local_node->nparts);
 	READ_INT_ARRAY(subpart_map, local_node->nparts);
+	READ_OID_ARRAY(relid_map, local_node->nparts);
 	READ_BOOL_ARRAY(hasexecparam, local_node->nexprs);
 	READ_BOOL_FIELD(do_initial_prune);
 	READ_BOOL_FIELD(do_exec_prune);
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 8c9721935d..b5c0889935 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -47,8 +47,9 @@
 #include "optimizer/appendinfo.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
-#include "partitioning/partprune.h"
+#include "parser/parsetree.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
 
@@ -359,6 +360,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		int			partnatts = subpart->part_scheme->partnatts;
 		int		   *subplan_map;
 		int		   *subpart_map;
+		Oid		   *relid_map;
 		List	   *partprunequal;
 		List	   *pruning_steps;
 		bool		contradictory;
@@ -434,6 +436,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		 */
 		subplan_map = (int *) palloc(nparts * sizeof(int));
 		subpart_map = (int *) palloc(nparts * sizeof(int));
+		relid_map = (Oid *) palloc(nparts * sizeof(int));
 		present_parts = NULL;
 
 		for (i = 0; i < nparts; i++)
@@ -444,6 +447,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 
 			subplan_map[i] = subplanidx;
 			subpart_map[i] = subpartidx;
+			relid_map[i] = planner_rt_fetch(partrel->relid, root)->relid;
 			if (subplanidx >= 0)
 			{
 				present_parts = bms_add_member(present_parts, i);
@@ -462,6 +466,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		pinfo->nparts = nparts;
 		pinfo->subplan_map = subplan_map;
 		pinfo->subpart_map = subpart_map;
+		pinfo->relid_map = relid_map;
 
 		/* Determine which pruning types should be enabled at this level */
 		doruntimeprune |= analyze_partkey_exprs(pinfo, pruning_steps,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6d087c268f..d66a187a53 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1108,6 +1108,7 @@ typedef struct PartitionedRelPruneInfo
 	int			nexprs;			/* Length of hasexecparam[] */
 	int		   *subplan_map;	/* subplan index by partition index, or -1 */
 	int		   *subpart_map;	/* subpart index by partition index, or -1 */
+	Oid		   *relid_map;		/* relation OID by partition index, or -1 */
 	bool	   *hasexecparam;	/* true if corresponding pruning_step contains
 								 * any PARAM_EXEC Params. */
 	bool		do_initial_prune;	/* true if pruning should be performed
-- 
2.17.2 (Apple Git-113)

0003-Ensure-that-repeated-PartitionDesc-lookups-return-th.patchapplication/octet-stream; name=0003-Ensure-that-repeated-PartitionDesc-lookups-return-th.patchDownload

From fc601ad4178a4a59f25f21312b9ab70458dbfd0a Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 28 Nov 2018 10:15:55 -0500
Subject: [PATCH 3/5] Ensure that repeated PartitionDesc lookups return the
 same answer.

The query planner will get confused if lookup up the PartitionDesc
for a particular relation does not return a consistent answer for
the entire duration of query planning.  Likewise, query execution
will get confused if the same relation seems to have a different
PartitionDesc at different times.  Invent a new PartitionDirectory
concept and use it to ensure consistency.

Note that this only ensures consistency within a single query
planning cycle or a single query execution.  It doesn't guarantee
that the answer can't change between planning and execution, nor
does it change the way a PartitionDesc is constructed in the first
place.

Since this allows pointers to old PartitionDesc entries to survive
even after a relcache rebuild, also postpone removing the old
PartitionDesc entry until we're certain no one is using it.
---
 src/backend/commands/copy.c            |  2 +-
 src/backend/executor/execPartition.c   | 28 ++++++++---
 src/backend/executor/nodeModifyTable.c |  2 +-
 src/backend/optimizer/util/inherit.c   | 68 ++++++++++++++------------
 src/backend/optimizer/util/plancat.c   |  2 +-
 src/backend/partitioning/partdesc.c    | 64 +++++++++++++++++++++++-
 src/backend/utils/cache/relcache.c     | 20 ++++++++
 src/include/executor/execPartition.h   |  3 +-
 src/include/nodes/execnodes.h          |  2 +
 src/include/nodes/pathnodes.h          |  3 ++
 src/include/partitioning/partdefs.h    |  2 +
 src/include/partitioning/partdesc.h    |  3 ++
 12 files changed, 155 insertions(+), 44 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index dbb06397e6..382c966a6e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2559,7 +2559,7 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
+		proute = ExecSetupPartitionTupleRouting(estate, NULL, cstate->rel);
 
 	if (cstate->whereClause)
 		cstate->qualexpr = ExecInitQual(castNode(List, cstate->whereClause),
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 58666fcf26..9124d5e54d 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -167,7 +167,8 @@ static void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					PartitionDispatch dispatch,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+static PartitionDispatch ExecInitPartitionDispatchInfo(EState *estate,
+							  PartitionTupleRouting *proute,
 							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
@@ -204,7 +205,8 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * it should be estate->es_query_cxt.
  */
 PartitionTupleRouting *
-ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
+ExecSetupPartitionTupleRouting(EState *estate, ModifyTableState *mtstate,
+							   Relation rel)
 {
 	PartitionTupleRouting *proute;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
@@ -229,7 +231,8 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 * parent as NULL as we don't need to care about any parent of the target
 	 * partitioned table.
 	 */
-	ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL, 0);
+	ExecInitPartitionDispatchInfo(estate, proute, RelationGetRelid(rel),
+								  NULL, 0);
 
 	/*
 	 * If performing an UPDATE with tuple routing, we can reuse partition
@@ -430,7 +433,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 				 * Create the new PartitionDispatch.  We pass the current one
 				 * in as the parent PartitionDispatch
 				 */
-				subdispatch = ExecInitPartitionDispatchInfo(proute,
+				subdispatch = ExecInitPartitionDispatchInfo(mtstate->ps.state,
+															proute,
 															partdesc->oids[partidx],
 															dispatch, partidx);
 				Assert(dispatch->indexes[partidx] >= 0 &&
@@ -972,7 +976,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
  *		newly created PartitionDispatch later.
  */
 static PartitionDispatch
-ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+ExecInitPartitionDispatchInfo(EState *estate,
+							  PartitionTupleRouting *proute, Oid partoid,
 							  PartitionDispatch parent_pd, int partidx)
 {
 	Relation	rel;
@@ -981,13 +986,17 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	int			dispatchidx;
 	MemoryContext oldcxt;
 
+	if (estate->es_partition_directory == NULL)
+		estate->es_partition_directory =
+			CreatePartitionDirectory(estate->es_query_cxt);
+
 	oldcxt = MemoryContextSwitchTo(proute->memcxt);
 
 	if (partoid != RelationGetRelid(proute->partition_root))
 		rel = table_open(partoid, NoLock);
 	else
 		rel = proute->partition_root;
-	partdesc = RelationGetPartitionDesc(rel);
+	partdesc = PartitionDirectoryLookup(estate->es_partition_directory, rel);
 
 	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes) +
 									partdesc->nparts * sizeof(int));
@@ -1533,6 +1542,10 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 	ListCell   *lc;
 	int			i;
 
+	if (estate->es_partition_directory == NULL)
+		estate->es_partition_directory =
+			CreatePartitionDirectory(estate->es_query_cxt);
+
 	n_part_hierarchies = list_length(partitionpruneinfo->prune_infos);
 	Assert(n_part_hierarchies > 0);
 
@@ -1612,7 +1625,8 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			 */
 			partrel = ExecGetRangeTableRelation(estate, pinfo->rtindex);
 			partkey = RelationGetPartitionKey(partrel);
-			partdesc = RelationGetPartitionDesc(partrel);
+			partdesc = PartitionDirectoryLookup(estate->es_partition_directory,
+												partrel);
 
 			n_steps = list_length(pinfo->pruning_steps);
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 566858c19b..b9ecd8d24e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2229,7 +2229,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
 		(operation == CMD_INSERT || update_tuple_routing_needed))
 		mtstate->mt_partition_tuple_routing =
-			ExecSetupPartitionTupleRouting(mtstate, rel);
+			ExecSetupPartitionTupleRouting(estate, mtstate, rel);
 
 	/*
 	 * Build state for collecting transition tuples.  This requires having a
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index faba493200..04a930d65b 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -124,28 +124,15 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 
 	/*
 	 * The rewriter should already have obtained an appropriate lock on each
-	 * relation named in the query.  However, for each child relation we add
-	 * to the query, we must obtain an appropriate lock, because this will be
-	 * the first use of those relations in the parse/rewrite/plan pipeline.
-	 * Child rels should use the same lockmode as their parent.
+	 * relation named in the query, so we can open the parent relation without
+	 * locking it.  However, for each child relation we add to the query, we
+	 * must obtain an appropriate lock, because this will be the first use of
+	 * those relations in the parse/rewrite/plan pipeline.  Child rels should
+	 * use the same lockmode as their parent.
 	 */
+	oldrelation = table_open(parentOID, NoLock);
 	lockmode = rte->rellockmode;
 
-	/* Scan for all members of inheritance set, acquire needed locks */
-	inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
-
-	/*
-	 * Check that there's at least one descendant, else treat as no-child
-	 * case.  This could happen despite above has_subclass() check, if table
-	 * once had a child but no longer does.
-	 */
-	if (list_length(inhOIDs) < 2)
-	{
-		/* Clear flag before returning */
-		rte->inh = false;
-		return;
-	}
-
 	/*
 	 * If parent relation is selected FOR UPDATE/SHARE, we need to mark its
 	 * PlanRowMark as isParent = true, and generate a new PlanRowMark for each
@@ -155,21 +142,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (oldrc)
 		oldrc->isParent = true;
 
-	/*
-	 * Must open the parent relation to examine its tupdesc.  We need not lock
-	 * it; we assume the rewriter already did.
-	 */
-	oldrelation = table_open(parentOID, NoLock);
-
 	/* Scan the inheritance set and expand it */
-	if (RelationGetPartitionDesc(oldrelation) != NULL)
+	if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
+		if (root->partition_directory == NULL)
+			root->partition_directory =
+				CreatePartitionDirectory(CurrentMemoryContext);
+
 		/*
-		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.  While at it, also
-		 * extract the partition key columns of all the partitioned tables.
+		 * If this table has partitions, recursively expand and lock them.
+		 * While at it, also extract the partition key columns of all the
+		 * partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list);
@@ -180,6 +165,22 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		RangeTblEntry *childrte;
 		Index		childRTindex;
 
+		/* Scan for all members of inheritance set, acquire needed locks */
+		inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+
+		/*
+		 * Check that there's at least one descendant, else treat as no-child
+		 * case.  This could happen despite above has_subclass() check, if the
+		 * table once had a child but no longer does.
+		 */
+		if (list_length(inhOIDs) < 2)
+		{
+			/* Clear flag before returning */
+			rte->inh = false;
+			heap_close(oldrelation, NoLock);
+			return;
+		}
+
 		/*
 		 * This table has no partitions.  Expand any plain inheritance
 		 * children in the order the OIDs were returned by
@@ -249,7 +250,10 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 	int			i;
 	RangeTblEntry *childrte;
 	Index		childRTindex;
-	PartitionDesc partdesc = RelationGetPartitionDesc(parentrel);
+	PartitionDesc partdesc;
+
+	partdesc = PartitionDirectoryLookup(root->partition_directory,
+										parentrel);
 
 	check_stack_depth();
 
@@ -289,8 +293,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		Oid			childOID = partdesc->oids[i];
 		Relation	childrel;
 
-		/* Open rel; we already have required locks */
-		childrel = table_open(childOID, NoLock);
+		/* Open rel, acquiring required locks */
+		childrel = table_open(childOID, lockmode);
 
 		/*
 		 * Temporary partitions belonging to other sessions should have been
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index eec1e09e35..ac5fbc49ea 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1904,7 +1904,7 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
 
 	Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
 
-	partdesc = RelationGetPartitionDesc(relation);
+	partdesc = PartitionDirectoryLookup(root->partition_directory, relation);
 	partkey = RelationGetPartitionKey(relation);
 	rel->part_scheme = find_partition_scheme(root, relation);
 	Assert(partdesc != NULL && rel->part_scheme != NULL);
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 66b1e38527..a207ff35ee 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -21,12 +21,25 @@
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/inval.h"
+#include "utils/hsearch.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/partcache.h"
 #include "utils/syscache.h"
 
+typedef struct PartitionDirectoryData
+{
+	MemoryContext pdir_mcxt;
+	HTAB	   *pdir_hash;
+} PartitionDirectoryData;
+
+typedef struct PartitionDirectoryEntry
+{
+	Oid			reloid;
+	PartitionDesc pd;
+} PartitionDirectoryEntry;
+
 /*
  * RelationBuildPartitionDesc
  *		Form rel's partition descriptor
@@ -208,13 +221,62 @@ RelationBuildPartitionDesc(Relation rel)
 		partdesc->oids[index] = oids[i];
 		/* Record if the partition is a leaf partition */
 		partdesc->is_leaf[index] =
-				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+			(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
 	}
 	MemoryContextSwitchTo(oldcxt);
 
 	rel->rd_partdesc = partdesc;
 }
 
+/*
+ * CreatePartitionDirectory
+ *		Create a new partition directory object.
+ */
+PartitionDirectory
+CreatePartitionDirectory(MemoryContext mcxt)
+{
+	MemoryContext oldcontext = MemoryContextSwitchTo(mcxt);
+	PartitionDirectory pdir;
+	HASHCTL		ctl;
+
+	MemSet(&ctl, 0, sizeof(HASHCTL));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(PartitionDirectoryEntry);
+	ctl.hcxt = mcxt;
+
+	pdir = palloc(sizeof(PartitionDirectoryData));
+	pdir->pdir_mcxt = mcxt;
+	pdir->pdir_hash = hash_create("partition directory", 256, &ctl,
+								  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	MemoryContextSwitchTo(oldcontext);
+	return pdir;
+}
+
+/*
+ * PartitionDirectoryLookup
+ *		Look up the partition descriptor for a relation in the directory.
+ *
+ * The purpose of this function is to ensure that we get the same
+ * PartitionDesc for each relation every time we look it up.  In the
+ * face of current DDL, different PartitionDescs may be constructed with
+ * different views of the catalog state, but any single particular OID
+ * will always get the same PartitionDesc for as long as the same
+ * PartitionDirectory is used.
+ */
+PartitionDesc
+PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel)
+{
+	PartitionDirectoryEntry *pde;
+	Oid			relid = RelationGetRelid(rel);
+	bool		found;
+
+	pde = hash_search(pdir->pdir_hash, &relid, HASH_ENTER, &found);
+	if (!found)
+		pde->pd = RelationGetPartitionDesc(rel);
+	return pde->pd;
+}
+
 /*
  * equalPartitionDescs
  *		Compare two partition descriptors for logical equality
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 54a40ef00b..1495b60d11 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2480,6 +2480,26 @@ RelationClearRelation(Relation relation, bool rebuild)
 			SWAPFIELD(PartitionDesc, rd_partdesc);
 			SWAPFIELD(MemoryContext, rd_pdcxt);
 		}
+		else if (rebuild && newrel->rd_pdcxt != NULL)
+		{
+			/*
+			 * We are rebuilding a partitioned relation with a non-zero
+			 * reference count, so keep the old partition descriptor around,
+			 * in case there's a PartitionDirectory with a pointer to it.
+			 * Attach it to the new rd_pdcxt so that it gets cleaned up
+			 * eventually.  In the case where the reference count is 0, this
+			 * code is not reached, which should be OK because in that case
+			 * there should be no PartitionDirectory with a pointer to the old
+			 * entry.
+			 *
+			 * Note that newrel and relation have already been swapped, so
+			 * the "old" partition descriptor is actually the one hanging off
+			 * of newrel.
+			 */
+			MemoryContextSetParent(newrel->rd_pdcxt, relation->rd_pdcxt);
+			newrel->rd_partdesc = NULL;
+			newrel->rd_pdcxt = NULL;
+		}
 
 #undef SWAPFIELD
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 2048c43c37..b363aba2a5 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -135,7 +135,8 @@ typedef struct PartitionPruneState
 	PartitionPruningData *partprunedata[FLEXIBLE_ARRAY_MEMBER];
 } PartitionPruneState;
 
-extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
+extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(EState *estate,
+							   ModifyTableState *mtstate,
 							   Relation rel);
 extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
 				  ResultRelInfo *rootResultRelInfo,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3b789ee7cf..84de8efeda 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -19,6 +19,7 @@
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "partitioning/partdefs.h"
 #include "utils/hsearch.h"
 #include "utils/queryenvironment.h"
 #include "utils/reltrigger.h"
@@ -515,6 +516,7 @@ typedef struct EState
 	 */
 	ResultRelInfo *es_root_result_relations;	/* array of ResultRelInfos */
 	int			es_num_root_result_relations;	/* length of the array */
+	PartitionDirectory es_partition_directory;	/* for PartitionDesc lookup */
 
 	/*
 	 * The following list contains ResultRelInfos created by the tuple routing
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index d3c477a542..9fc1634c30 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -348,6 +348,9 @@ struct PlannerInfo
 
 	/* Does this query modify any partition key columns? */
 	bool		partColsUpdated;
+
+	/* Directory of partition descriptors. */
+	PartitionDirectory partition_directory;
 };
 
 
diff --git a/src/include/partitioning/partdefs.h b/src/include/partitioning/partdefs.h
index 6e9c128b2c..aec3b3fe63 100644
--- a/src/include/partitioning/partdefs.h
+++ b/src/include/partitioning/partdefs.h
@@ -21,4 +21,6 @@ typedef struct PartitionBoundSpec PartitionBoundSpec;
 
 typedef struct PartitionDescData *PartitionDesc;
 
+typedef struct PartitionDirectoryData *PartitionDirectory;
+
 #endif							/* PARTDEFS_H */
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index f72b70dded..6e384541da 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -31,6 +31,9 @@ typedef struct PartitionDescData
 
 extern void RelationBuildPartitionDesc(Relation rel);
 
+extern PartitionDirectory CreatePartitionDirectory(MemoryContext mcxt);
+extern PartitionDesc PartitionDirectoryLookup(PartitionDirectory, Relation);
+
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 
 extern bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
-- 
2.17.2 (Apple Git-113)

#81

Alvaro Herrera

alvherre@2ndquadrant.com

almost 7 years ago

In reply to: Robert Haas (#80)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2019-Jan-31, Robert Haas wrote:

OK, that seems to be pretty easy. New patch series attached. The
patch with that new logic is 0004. I've consolidated some of the
things I had as separate patches in my last post and rewritten the
commit messages to explain more clearly the purpose of each patch.

Looks awesome.

- For now, I haven't tried to handle the DETACH PARTITION case. I
don't think there's anything preventing someone - possibly even me -
from implementing the counter-based approach that I described in the
previous message, but I think it would be good to have some more
discussion first on whether it's acceptable to make concurrent queries
error out. I think any queries that were already up and running would
be fine, but any that were planned before the DETACH and tried to
execute afterwards would get an ERROR. That's fairly low-probability,
because normally the shared invalidation machinery would cause
replanning, but there's a race condition, so we'd have to document
something like: if you use this feature, it'll probably just work, but
you might get some funny errors in other sessions if you're unlucky.
That kinda sucks but maybe we should just suck it up. Possibly we
should consider making the concurrent behavior optional, so that if
you'd rather take blocking locks than risk errors, you have that
option. Of course I guess you could also just let people do an
explicit LOCK TABLE if that's what they want. Or we could try to
actually make it work in that case, I guess by ignoring the detached
partitions, but that seems a lot harder.

I think telling people to do LOCK TABLE beforehand if they care about
errors is sufficient. On the other hand, I do hope that we're only
going to cause queries to fail if they would affect the partition that's
being detached and not other partitions in the table. Or maybe because
of the replanning on invalidation this doesn't matter as much as I think
it does.

- 0003 doesn't have any handling for parallel query at this point, so
even though within a single backend a single query execution will
always get the same PartitionDesc for the same relation, the answers
might not be consistent across the parallel group.

That doesn't sound good. I think the easiest would be to just serialize
the PartitionDesc and send it to the workers instead of them recomputing
it, but then I worry that this might have bad performance when the
partition desc is large. (Or maybe sending bytes over pqmq is faster
than reading all those catalog entries and so this isn't a concern
anyway.)

- 0003 also changes the order in which locks are acquired. I am not
sure whether we care about this, especially in view of other pending
changes.

Yeah, the drawbacks of the unpredictable locking order are worrisome,
but then the performance gain is hard to dismiss. Not this patch only
but the others too. If we're okay with the others going in, I guess we
don't have concerns about this one either.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#82

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Alvaro Herrera (#81)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Jan 31, 2019 at 6:00 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

- 0003 doesn't have any handling for parallel query at this point, so
even though within a single backend a single query execution will
always get the same PartitionDesc for the same relation, the answers
might not be consistent across the parallel group.

That doesn't sound good. I think the easiest would be to just serialize
the PartitionDesc and send it to the workers instead of them recomputing
it, but then I worry that this might have bad performance when the
partition desc is large. (Or maybe sending bytes over pqmq is faster
than reading all those catalog entries and so this isn't a concern
anyway.)

I don't think we'd be using pqmq here, or shm_mq either, but I think
the bigger issues is that starting a parallel query is already a
pretty heavy operation, and so the added overhead of this is probably
not very noticeable. I agree that it seems a bit expensive, but since
we're already waiting for the postmaster to fork() a new process which
then has to initialize itself, this probably won't break the bank.
What bothers me more is that it's adding a substantial amount of code
that could very well contain bugs to fix something that isn't clearly
a problem in the first place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#83

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Robert Haas (#82)

1 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, Feb 1, 2019 at 9:00 AM Robert Haas <robertmhaas@gmail.com> wrote:

I don't think we'd be using pqmq here, or shm_mq either, but I think
the bigger issues is that starting a parallel query is already a
pretty heavy operation, and so the added overhead of this is probably
not very noticeable. I agree that it seems a bit expensive, but since
we're already waiting for the postmaster to fork() a new process which
then has to initialize itself, this probably won't break the bank.
What bothers me more is that it's adding a substantial amount of code
that could very well contain bugs to fix something that isn't clearly
a problem in the first place.

I spent most of the last 6 hours writing and debugging a substantial
chunk of the code that would be needed. Here's an 0006 patch that
adds functions to serialize and restore PartitionDesc in a manner
similar to what parallel query does for other object types. Since a
PartitionDesc includes a pointer to a PartitionBoundInfo, that meant
also writing functions to serialize and restore those. If we want to
go this route, I think the next thing to do would be to integrate this
into the PartitionDirectory infrastructure.

Basically what I'm imagining we would do there is have a hash table
stored in shared memory to go with the one that is already stored in
backend-private memory. The shared table stores serialized entries,
and the local table stores normal ones. Any lookups try the local
table first, then the shared table. If we get a hit in the shared
table, we deserialize whatever we find there and stash the result in
the local table. If we find it neither place, we generate a new entry
in the local table and then serialize it into the shard table. It's
not quite clear to me at the moment how to solve the concurrency
problems associated with this design, but it's probably not too hard.
I don't have enough mental energy left to figure it out today, though.

After having written this code, I'm still torn about whether to go
further with this design. On the one hand, this is such boilerplate
code that it's kinda hard to imagine it having too many more bugs; on
the other hand, as you can see, it's a non-trivial amount of code to
add without a real clear reason, and I'm not sure we have one, even
though in the abstract it seems like a better way to go.

Still interesting in hearing more opinions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0006-Serialize-and-restore-ParitionDesc-PartitionBound.patchapplication/octet-stream; name=0006-Serialize-and-restore-ParitionDesc-PartitionBound.patchDownload

From 3e4e99a2cc23d0f03be3486ee6aae44c0d552157 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 1 Feb 2019 13:16:40 -0500
Subject: [PATCH 6/6] Serialize and restore ParitionDesc/PartitionBound.

---
 src/backend/partitioning/partbounds.c | 238 ++++++++++++++++++++++++++
 src/backend/partitioning/partdesc.c   |  74 ++++++++
 src/include/partitioning/partbounds.h |   6 +
 src/include/partitioning/partdesc.h   |   6 +
 4 files changed, 324 insertions(+)

diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index e71eb3793b..779cbd83f5 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -859,6 +859,244 @@ partition_bounds_copy(PartitionBoundInfo src,
 	return dest;
 }
 
+/*
+ * Estimate the amount of space required to serialize a PartitionBoundInfo.
+ * As usual, we require the PartitionKey as an additional argument to
+ * properly interpret the stored Datums.
+ */
+Size
+partition_bounds_estimate(PartitionBoundInfo bound, PartitionKey key)
+{
+	Size		sz;
+	int			i;
+	int			datums_per_bound;
+
+	/* Assorted sanity checks on the input data. */
+	Assert(bound->strategy == PARTITION_STRATEGY_RANGE ||
+		   bound->strategy == PARTITION_STRATEGY_LIST ||
+		   bound->strategy == PARTITION_STRATEGY_HASH);
+	Assert((bound->strategy == PARTITION_STRATEGY_RANGE) ==
+		   (bound->kind != NULL));
+	Assert(key->partnatts > 0);
+
+	/* Space for strategy, ndatums, null_index, default_index. */
+	sz = sizeof(char) + 3 * sizeof(int);
+
+	/* Space for kind. */
+	if (bound->strategy == PARTITION_STRATEGY_RANGE)
+		sz = add_size(sz,
+					  mul_size(mul_size(bound->ndatums, key->partnatts),
+							   sizeof(PartitionRangeDatumKind)));
+
+	/* Space for datums. */
+	if (key->strategy == PARTITION_STRATEGY_HASH)
+		datums_per_bound = 2;		/* modulus and remainder */
+	else
+		datums_per_bound = key->partnatts;
+	for (i = 0; i < bound->ndatums; i++)
+	{
+		int			j;
+
+		for (j = 0; j < datums_per_bound; j++)
+		{
+			bool		byval;
+			int			typlen;
+
+			if (bound->strategy == PARTITION_STRATEGY_RANGE &&
+				bound->kind[i][j] != PARTITION_RANGE_DATUM_VALUE)
+					continue;
+
+			if (bound->strategy == PARTITION_STRATEGY_HASH)
+			{
+				typlen = sizeof(int32); /* Always int4 */
+				byval = true;	/* int4 is pass-by-value */
+			}
+			else
+			{
+				byval = key->parttypbyval[j];
+				typlen = key->parttyplen[j];
+			}
+
+			sz = add_size(sz,
+						  datumEstimateSpace(bound->datums[i][j], false,
+											 byval, typlen));
+		}
+	}
+
+	/* Space for indexes. */
+	sz = add_size(sz,
+				  mul_size(get_partition_bound_num_indexes(bound),
+						   sizeof(int)));
+
+	return sz;
+}
+
+/*
+ * Serialize a PartitionBoundInfo; the PartitionKey is required for context
+ * both now and when restoring from the serialized state.  Note that the caller
+ * must ensure that enough space is available.  To find out how much space will
+ * be needed, call partition_bounds_estimate().
+ */
+void
+partition_bounds_serialize(PartitionBoundInfo bound, PartitionKey key,
+						   char *start_address)
+{
+	Size		indexbytes;
+	int			i;
+	int			datums_per_bound;
+
+	/*
+	 * Copy fixed-width fields, remembering that start_address may not be
+	 * aligned.
+	 */
+	memcpy(start_address, &bound->strategy, sizeof(char));
+	start_address += sizeof(char);
+	memcpy(start_address, &bound->ndatums, sizeof(int));
+	start_address += sizeof(int);
+	memcpy(start_address, &bound->null_index, sizeof(int));
+	start_address += sizeof(int);
+	memcpy(start_address, &bound->default_index, sizeof(int));
+	start_address += sizeof(int);
+
+	/* Copy kind, if applicable. */
+	if (key->strategy == PARTITION_STRATEGY_RANGE)
+	{
+		for (i = 0; i < bound->ndatums; ++i)
+		{
+			Size	bytes;
+
+			bytes = sizeof(PartitionRangeDatumKind) * key->partnatts;
+			memcpy(start_address, bound->kind[i], bytes);
+			start_address += bytes;
+		}
+	}
+
+	/* Space for datums. */
+	if (key->strategy == PARTITION_STRATEGY_HASH)
+		datums_per_bound = 2;		/* modulus and remainder */
+	else
+		datums_per_bound = key->partnatts;
+	for (i = 0; i < bound->ndatums; i++)
+	{
+		int			j;
+
+		for (j = 0; j < datums_per_bound; j++)
+		{
+			bool		byval;
+			int			typlen;
+
+			if (bound->strategy == PARTITION_STRATEGY_RANGE &&
+				bound->kind[i][j] != PARTITION_RANGE_DATUM_VALUE)
+					continue;
+
+			if (bound->strategy == PARTITION_STRATEGY_HASH)
+			{
+				typlen = sizeof(int32); /* Always int4 */
+				byval = true;	/* int4 is pass-by-value */
+			}
+			else
+			{
+				byval = key->parttypbyval[j];
+				typlen = key->parttyplen[j];
+			}
+
+			datumSerialize(bound->datums[i][j], false, byval, typlen,
+						   &start_address);
+		}
+	}
+
+	/* Copy indexes. */
+	indexbytes = sizeof(int) * get_partition_bound_num_indexes(bound);
+	memcpy(start_address, bound->indexes, indexbytes);
+	start_address += indexbytes;
+}
+
+/*
+ * Restore a previously-serialized PartitionBoundInfo.
+ *
+ * The result is allocated in CurrentMemoryContext.  *start_address is
+ * incremented based on the number of bytes consumed.
+ */
+PartitionBoundInfo
+partition_bounds_restore(char **start_address, PartitionKey key)
+{
+	PartitionBoundInfo bound;
+	int			datums_per_bound;
+	int			i;
+	Size		indexbytes;
+
+	bound = palloc0(sizeof(PartitionBoundInfoData));
+
+	/*
+	 * Restore fixed-width fields, remembering that start_address may not be
+	 * aligned.
+	 */
+	memcpy(&bound->strategy, *start_address, sizeof(char));
+	*start_address += sizeof(char);
+	memcpy(&bound->ndatums, *start_address, sizeof(int));
+	*start_address += sizeof(int);
+	memcpy(&bound->null_index, *start_address, sizeof(int));
+	*start_address += sizeof(int);
+	memcpy(&bound->default_index, *start_address, sizeof(int));
+	*start_address += sizeof(int);
+
+	/* Restore kind, if applicable. */
+	if (key->strategy == PARTITION_STRATEGY_RANGE)
+	{
+		bound->kind =
+			palloc(bound->ndatums * sizeof(PartitionRangeDatumKind *));
+
+		for (i = 0; i < bound->ndatums; ++i)
+		{
+			Size	bytes;
+
+			bytes = sizeof(PartitionRangeDatumKind) * key->partnatts;
+			bound->kind[i] = palloc(bytes);
+			memcpy(bound->kind[i], *start_address, bytes);
+			*start_address += bytes;
+		}
+	}
+
+	/* Restore datums; note that we must have 'kind' already to do this. */
+	if (key->strategy == PARTITION_STRATEGY_HASH)
+		datums_per_bound = 2;		/* modulus and remainder */
+	else
+		datums_per_bound = key->partnatts;
+	bound->datums = palloc(sizeof(Datum *) * bound->ndatums);
+	for (i = 0; i < bound->ndatums; i++)
+	{
+		int			j;
+
+		bound->datums[i] = palloc0(sizeof(Datum) * datums_per_bound);
+		for (j = 0; j < datums_per_bound; j++)
+		{
+			bool		isnull;
+
+			if (bound->strategy == PARTITION_STRATEGY_RANGE &&
+				bound->kind[i][j] != PARTITION_RANGE_DATUM_VALUE)
+					continue;
+
+			/* datumRestore may palloc */
+			bound->datums[i][j] = datumRestore(start_address, &isnull);
+			Assert(!isnull);
+		}
+	}
+
+	/*
+	 * Restore indexes.
+	 *
+	 * Note that we're calling get_partition_bound_num_indexes on the bound
+	 * object even though it isn't complete yet.  Fortunately it doesn't
+	 * depend on the array we're about to restore, so that's OK.
+	 */
+	indexbytes = sizeof(int) * get_partition_bound_num_indexes(bound);
+	bound->indexes = palloc(indexbytes);
+	memcpy(bound->indexes, *start_address, indexbytes);
+	*start_address += indexbytes;
+
+	return bound;
+}
+
 /*
  * check_new_partition_bound
  *
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index a207ff35ee..84e24646a4 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -331,6 +331,80 @@ equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
 	return true;
 }
 
+/*
+ * Estimate the amount of space required to serialize a PartitionDesc.
+ */
+Size
+EstimatePartitionDesc(PartitionDesc partdesc, PartitionKey key)
+{
+	Size	sz = sizeof(int);		/* for nparts */
+
+	/* Space for oids and is_leaf. */
+	sz = add_size(sz, mul_size(partdesc->nparts, sizeof(Oid) + sizeof(bool)));
+
+	/* Space for boundinfo, if required. */
+	if (partdesc->nparts > 0)
+		sz = add_size(sz, partition_bounds_estimate(partdesc->boundinfo, key));
+
+	return sz;
+}
+
+/*
+ * Serialize a PartitionDesc. The caller must use EstimatePartitionDesc to
+ * determine how much space will be needed and pass a sufficiently-large
+ * buffer to this function.
+ */
+void
+SerializePartitionDesc(PartitionDesc partdesc, PartitionKey key,
+					   char *start_address)
+{
+	Size	oids_bytes = sizeof(Oid) * partdesc->nparts;
+	Size	is_leaf_bytes = sizeof(bool) * partdesc->nparts;
+
+	memcpy(start_address, &partdesc->nparts, sizeof(int));
+	start_address += sizeof(int);
+	if (partdesc->nparts > 0)
+	{
+		memcpy(start_address, partdesc->oids, oids_bytes);
+		start_address += oids_bytes;
+		memcpy(start_address, partdesc->is_leaf, is_leaf_bytes);
+		start_address += is_leaf_bytes;
+		partition_bounds_serialize(partdesc->boundinfo, key, start_address);
+	}
+}
+
+/*
+ * Restore a serialized PartitionDesc into CurrentMemoryContext, advancing
+ * *start_address based on the number of bytes consumed.
+ */
+PartitionDesc
+RestorePartitionDesc(char **start_address, PartitionKey key)
+{
+	PartitionDesc	partdesc;
+
+	partdesc = palloc0(sizeof(PartitionDescData));
+	memcpy(&partdesc->nparts, *start_address, sizeof(int));
+	*start_address += sizeof(int);
+
+	if (partdesc->nparts > 0)
+	{
+		Size	oids_bytes = partdesc->nparts * sizeof(Oid);
+		Size	is_leaf_bytes = partdesc->nparts * sizeof(bool);
+
+		partdesc->oids = palloc(oids_bytes);
+		memcpy(partdesc->oids, *start_address, oids_bytes);
+		*start_address += oids_bytes;
+
+		partdesc->is_leaf = palloc(is_leaf_bytes);
+		memcpy(partdesc->is_leaf, *start_address, is_leaf_bytes);
+		*start_address += is_leaf_bytes;
+
+		partdesc->boundinfo = partition_bounds_restore(start_address, key);
+	}
+
+	return partdesc;
+}
+
 /*
  * get_default_oid_from_partdesc
  *
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index b1ae39ad63..0ac6daf319 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -87,6 +87,12 @@ extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
 					   PartitionBoundInfo b2);
 extern PartitionBoundInfo partition_bounds_copy(PartitionBoundInfo src,
 					  PartitionKey key);
+extern Size partition_bounds_estimate(PartitionBoundInfo bound,
+						  PartitionKey key);
+extern void partition_bounds_serialize(PartitionBoundInfo bound,
+						   PartitionKey key, char *start_address);
+extern PartitionBoundInfo partition_bounds_restore(char **start_address,
+						 PartitionKey key);
 extern void check_new_partition_bound(char *relname, Relation parent,
 						  PartitionBoundSpec *spec);
 extern void check_default_partition_contents(Relation parent,
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index 6e384541da..1e679083d5 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -39,4 +39,10 @@ extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
 					PartitionDesc partdesc2);
 
+extern Size EstimatePartitionDesc(PartitionDesc partdesc, PartitionKey key);
+extern void SerializePartitionDesc(PartitionDesc partdesc, PartitionKey key,
+					   char *start_address);
+extern PartitionDesc RestorePartitionDesc(char **start_address,
+						 PartitionKey key);
+
 #endif							/* PARTCACHE_H */
-- 
2.17.2 (Apple Git-113)

#84

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Robert Haas (#83)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Sat, 2 Feb 2019 at 09:31, Robert Haas <robertmhaas@gmail.com> wrote:

After having written this code, I'm still torn about whether to go
further with this design. On the one hand, this is such boilerplate
code that it's kinda hard to imagine it having too many more bugs; on
the other hand, as you can see, it's a non-trivial amount of code to
add without a real clear reason, and I'm not sure we have one, even
though in the abstract it seems like a better way to go.

I think we do need to ensure that the PartitionDesc matches between
worker and leader. Have a look at choose_next_subplan_for_worker() in
nodeAppend.c. Notice that a call is made to
ExecFindMatchingSubPlans().

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#85

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: David Rowley (#84)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Sat, Feb 2, 2019 at 7:19 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

I think we do need to ensure that the PartitionDesc matches between
worker and leader. Have a look at choose_next_subplan_for_worker() in
nodeAppend.c. Notice that a call is made to
ExecFindMatchingSubPlans().

Thanks for the tip. I see that code, but I'm not sure that I
understand why it matters here. First, if I'm not mistaken, what's
being returned by ExecFindMatchingSubPlans is a BitmapSet of subplan
indexes, not anything that returns to a PartitionDesc directly. And
second, even if it did, it looks like the computation is done
separately in every backend and not shared among backends, so even if
it were directly referring to PartitionDesc indexes, it still won't be
assuming that they're the same in every backend. Can you further
explain your thinking?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#86

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Robert Haas (#85)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Mon, 4 Feb 2019 at 16:45, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Feb 2, 2019 at 7:19 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

I think we do need to ensure that the PartitionDesc matches between
worker and leader. Have a look at choose_next_subplan_for_worker() in
nodeAppend.c. Notice that a call is made to
ExecFindMatchingSubPlans().

Thanks for the tip. I see that code, but I'm not sure that I
understand why it matters here. First, if I'm not mistaken, what's
being returned by ExecFindMatchingSubPlans is a BitmapSet of subplan
indexes, not anything that returns to a PartitionDesc directly. And
second, even if it did, it looks like the computation is done
separately in every backend and not shared among backends, so even if
it were directly referring to PartitionDesc indexes, it still won't be
assuming that they're the same in every backend. Can you further
explain your thinking?

In a Parallel Append, each parallel worker will call ExecInitAppend(),
which calls ExecCreatePartitionPruneState(). That function makes a
call to RelationGetPartitionDesc() and records the partdesc's
boundinfo in context->boundinfo. This means that if we perform any
pruning in the parallel worker in choose_next_subplan_for_worker()
then find_matching_subplans_recurse() will use the PartitionDesc from
the parallel worker to translate the partition indexes into the
Append's subnodes.

If the PartitionDesc from the parallel worker has an extra partition
than what was there when the plan was built then the partition index
to subplan index translation will be incorrect as the
find_matching_subplans_recurse() will call get_matching_partitions()
using the context with the PartitionDesc containing the additional
partition. The return value from get_matching_partitions() is fine,
it's just that the code inside the while ((i =
bms_next_member(partset, i)) >= 0) loop that will do the wrong thing.
It could even crash if partset has an index out of bounds of the
subplan_map or subpart_map arrays.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#87

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: David Rowley (#86)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Mon, Feb 4, 2019 at 12:02 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

If the PartitionDesc from the parallel worker has an extra partition
than what was there when the plan was built then the partition index
to subplan index translation will be incorrect as the
find_matching_subplans_recurse() will call get_matching_partitions()
using the context with the PartitionDesc containing the additional
partition. The return value from get_matching_partitions() is fine,
it's just that the code inside the while ((i =
bms_next_member(partset, i)) >= 0) loop that will do the wrong thing.
It could even crash if partset has an index out of bounds of the
subplan_map or subpart_map arrays.

Is there any chance you've missed the fact that in one of the later
patches in the series I added code to adjust the subplan_map and
subpart_map arrays to compensate for any extra partitions?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#88

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Robert Haas (#87)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Mon, Feb 4, 2019 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Feb 4, 2019 at 12:02 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

If the PartitionDesc from the parallel worker has an extra partition
than what was there when the plan was built then the partition index
to subplan index translation will be incorrect as the
find_matching_subplans_recurse() will call get_matching_partitions()
using the context with the PartitionDesc containing the additional
partition. The return value from get_matching_partitions() is fine,
it's just that the code inside the while ((i =
bms_next_member(partset, i)) >= 0) loop that will do the wrong thing.
It could even crash if partset has an index out of bounds of the
subplan_map or subpart_map arrays.

Is there any chance you've missed the fact that in one of the later
patches in the series I added code to adjust the subplan_map and
subpart_map arrays to compensate for any extra partitions?

In case that wasn't clear enough, my point here is that while the
leader and workers could end up with different ideas about the shape
of the PartitionDesc, each would end up with a subplan_map and
subpart_map array adapted to the view of the PartitionDesc with which
they ended up, and therefore, I think, everything should work. So far
there is, to my knowledge, no situation in which a PartitionDesc index
gets passed between one backend and another, and as long as we don't
do that, it's not really necessary for them to agree; each backend
needs to individually ignore any concurrently added partitions not
contemplated by the plan, but it doesn't matter whether backend A and
backend B agree on which partitions were concurrently added, just that
each ignores the ones it knows about.

Since time is rolling along here, I went ahead and committed 0001
which seems harmless even if somebody finds a huge problem with some
other part of this. If anybody wants to review the approach or the
code before I proceed further, that would be great, but please speak
up soon.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#89

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Robert Haas (#87)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, 5 Feb 2019 at 01:54, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Feb 4, 2019 at 12:02 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

If the PartitionDesc from the parallel worker has an extra partition
than what was there when the plan was built then the partition index
to subplan index translation will be incorrect as the
find_matching_subplans_recurse() will call get_matching_partitions()
using the context with the PartitionDesc containing the additional
partition. The return value from get_matching_partitions() is fine,
it's just that the code inside the while ((i =
bms_next_member(partset, i)) >= 0) loop that will do the wrong thing.
It could even crash if partset has an index out of bounds of the
subplan_map or subpart_map arrays.

Is there any chance you've missed the fact that in one of the later
patches in the series I added code to adjust the subplan_map and
subpart_map arrays to compensate for any extra partitions?

I admit that I hadn't looked at the patch, I was just going on what I
had read here. I wasn't sure how the re-map would have been done as
some of the information is unavailable during execution, but I see now
that you're modified it so we send a list of Oids that we expect and
remap based on if an unexpected Oid is found.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#90

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: David Rowley (#69)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Fri, Dec 21, 2018 at 6:04 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Fri, 21 Dec 2018 at 09:43, Robert Haas <robertmhaas@gmail.com> wrote:

- I refactored expand_inherited_rtentry() to drive partition expansion
entirely off of PartitionDescs. The reason why this is necessary is
that it clearly will not work to have find_all_inheritors() use a
current snapshot to decide what children we have and lock them, and
then consult a different source of truth to decide which relations to
open with NoLock. There's nothing to keep the lists of partitions
from being different in the two cases, and that demonstrably causes
assertion failures if you SELECT with an ATTACH/DETACH loop running in
the background. However, it also changes the order in which tables get
locked. Possibly that could be fixed by teaching
expand_partitioned_rtentry() to qsort() the OIDs the way
find_inheritance_children() does. It also loses the infinite-loop
protection which find_all_inheritors() has. Not sure what to do about
that.

I don't think you need to qsort() the Oids before locking. What the
qsort() does today is ensure we get a consistent locking order. Any
other order would surely do, providing we stick to it consistently. I
think PartitionDesc order is fine, as it's consistent. Having it
locked in PartitionDesc order I think is what's needed for [1] anyway.
[2] proposes to relax the locking order taken during execution.

[1] https://commitfest.postgresql.org/21/1778/
[2] https://commitfest.postgresql.org/21/1887/

Based on this feedback, I went ahead and committed the part of the
previously-posted patch set that makes this change.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#91

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Robert Haas (#80)

4 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Jan 31, 2019 at 1:02 PM Robert Haas <robertmhaas@gmail.com> wrote:

New patch series attached.

And here's yet another new patch series, rebased over today's commit
and with a couple of other fixes:

1. I realized that the PartitionDirectory for the planner ought to be
attached to the PlannerGlobal, not the PlannerInfo; we don't want to
create more than one partition directory per query planning cycle, and
we do want our notion of the PartitionDesc for a given relation to be
stable between the outer query and any subqueries.

2. I discovered - via CLOBBER_CACHE_ALWAYS testing - that the
PartitionDirectory has to hold a reference count on the relcache
entry. In hindsight, this should have been obvious: the planner keeps
the locks when it closes a relation and later reopens it, but it
doesn't keep the relation open, which is what prevents recycling of
the old PartitionDesc. Unfortunately these additional reference count
manipulations are probably not free. I don't know expensive they are,
though; maybe it's not too bad.

Aside from these problems, I think I have spotted a subtle problem in
0001. I'll think about that some more and post another update.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v3-0002-Ensure-that-repeated-PartitionDesc-lookups-return.patchapplication/octet-stream; name=v3-0002-Ensure-that-repeated-PartitionDesc-lookups-return.patchDownload

From 7765ed870f34b0b2b6f0911d4f3ca6bfa129a0fe Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 28 Nov 2018 10:15:55 -0500
Subject: [PATCH v3 2/4] Ensure that repeated PartitionDesc lookups return the
 same answer.

The query planner will get confused if lookup up the PartitionDesc
for a particular relation does not return a consistent answer for
the entire duration of query planning.  Likewise, query execution
will get confused if the same relation seems to have a different
PartitionDesc at different times.  Invent a new PartitionDirectory
concept and use it to ensure consistency.

Note that this only ensures consistency within a single query
planning cycle or a single query execution.  It doesn't guarantee
that the answer can't change between planning and execution, nor
does it change the way a PartitionDesc is constructed in the first
place.

Since this allows pointers to old PartitionDesc entries to survive
even after a relcache rebuild, also postpone removing the old
PartitionDesc entry until we're certain no one is using it.
---
 src/backend/commands/copy.c            |  2 +-
 src/backend/executor/execPartition.c   | 28 ++++++--
 src/backend/executor/execUtils.c       |  8 +++
 src/backend/executor/nodeModifyTable.c |  2 +-
 src/backend/optimizer/plan/planner.c   |  4 ++
 src/backend/optimizer/util/inherit.c   |  9 ++-
 src/backend/optimizer/util/plancat.c   |  3 +-
 src/backend/partitioning/partdesc.c    | 91 +++++++++++++++++++++++++-
 src/backend/utils/cache/relcache.c     | 20 ++++++
 src/include/executor/execPartition.h   |  3 +-
 src/include/nodes/execnodes.h          |  2 +
 src/include/nodes/pathnodes.h          |  2 +
 src/include/partitioning/partdefs.h    |  2 +
 src/include/partitioning/partdesc.h    |  4 ++
 14 files changed, 167 insertions(+), 13 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index dbb06397e6..382c966a6e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2559,7 +2559,7 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
+		proute = ExecSetupPartitionTupleRouting(estate, NULL, cstate->rel);
 
 	if (cstate->whereClause)
 		cstate->qualexpr = ExecInitQual(castNode(List, cstate->whereClause),
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index e121c6c8ff..db133b37a5 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -167,7 +167,8 @@ static void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					PartitionDispatch dispatch,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+static PartitionDispatch ExecInitPartitionDispatchInfo(EState *estate,
+							  PartitionTupleRouting *proute,
 							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
@@ -201,7 +202,8 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * it should be estate->es_query_cxt.
  */
 PartitionTupleRouting *
-ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
+ExecSetupPartitionTupleRouting(EState *estate, ModifyTableState *mtstate,
+							   Relation rel)
 {
 	PartitionTupleRouting *proute;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
@@ -223,7 +225,8 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 * parent as NULL as we don't need to care about any parent of the target
 	 * partitioned table.
 	 */
-	ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL, 0);
+	ExecInitPartitionDispatchInfo(estate, proute, RelationGetRelid(rel),
+								  NULL, 0);
 
 	/*
 	 * If performing an UPDATE with tuple routing, we can reuse partition
@@ -424,7 +427,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 				 * Create the new PartitionDispatch.  We pass the current one
 				 * in as the parent PartitionDispatch
 				 */
-				subdispatch = ExecInitPartitionDispatchInfo(proute,
+				subdispatch = ExecInitPartitionDispatchInfo(mtstate->ps.state,
+															proute,
 															partdesc->oids[partidx],
 															dispatch, partidx);
 				Assert(dispatch->indexes[partidx] >= 0 &&
@@ -964,7 +968,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
  *		PartitionDispatch later.
  */
 static PartitionDispatch
-ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+ExecInitPartitionDispatchInfo(EState *estate,
+							  PartitionTupleRouting *proute, Oid partoid,
 							  PartitionDispatch parent_pd, int partidx)
 {
 	Relation	rel;
@@ -973,6 +978,10 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	int			dispatchidx;
 	MemoryContext oldcxt;
 
+	if (estate->es_partition_directory == NULL)
+		estate->es_partition_directory =
+			CreatePartitionDirectory(estate->es_query_cxt);
+
 	oldcxt = MemoryContextSwitchTo(proute->memcxt);
 
 	/*
@@ -984,7 +993,7 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 		rel = table_open(partoid, RowExclusiveLock);
 	else
 		rel = proute->partition_root;
-	partdesc = RelationGetPartitionDesc(rel);
+	partdesc = PartitionDirectoryLookup(estate->es_partition_directory, rel);
 
 	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes) +
 									partdesc->nparts * sizeof(int));
@@ -1530,6 +1539,10 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 	ListCell   *lc;
 	int			i;
 
+	if (estate->es_partition_directory == NULL)
+		estate->es_partition_directory =
+			CreatePartitionDirectory(estate->es_query_cxt);
+
 	n_part_hierarchies = list_length(partitionpruneinfo->prune_infos);
 	Assert(n_part_hierarchies > 0);
 
@@ -1609,7 +1622,8 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			 */
 			partrel = ExecGetRangeTableRelation(estate, pinfo->rtindex);
 			partkey = RelationGetPartitionKey(partrel);
-			partdesc = RelationGetPartitionDesc(partrel);
+			partdesc = PartitionDirectoryLookup(estate->es_partition_directory,
+												partrel);
 
 			n_steps = list_length(pinfo->pruning_steps);
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 312a0dc805..d19a747788 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -54,6 +54,7 @@
 #include "mb/pg_wchar.h"
 #include "nodes/nodeFuncs.h"
 #include "parser/parsetree.h"
+#include "partitioning/partdesc.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -218,6 +219,13 @@ FreeExecutorState(EState *estate)
 		estate->es_jit = NULL;
 	}
 
+	/* release partition directory, if allocated */
+	if (estate->es_partition_directory)
+	{
+		DestroyPartitionDirectory(estate->es_partition_directory);
+		estate->es_partition_directory = NULL;
+	}
+
 	/*
 	 * Free the per-query memory context, thereby releasing all working
 	 * memory, including the EState node itself.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 566858c19b..b9ecd8d24e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2229,7 +2229,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
 		(operation == CMD_INSERT || update_tuple_routing_needed))
 		mtstate->mt_partition_tuple_routing =
-			ExecSetupPartitionTupleRouting(mtstate, rel);
+			ExecSetupPartitionTupleRouting(estate, mtstate, rel);
 
 	/*
 	 * Build state for collecting transition tuples.  This requires having a
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bc81535905..98dd5281ad 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -56,6 +56,7 @@
 #include "parser/analyze.h"
 #include "parser/parsetree.h"
 #include "parser/parse_agg.h"
+#include "partitioning/partdesc.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/dsm_impl.h"
 #include "utils/rel.h"
@@ -567,6 +568,9 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 			result->jitFlags |= PGJIT_DEFORM;
 	}
 
+	if (glob->partition_directory != NULL)
+		DestroyPartitionDirectory(glob->partition_directory);
+
 	return result;
 }
 
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index a014a12060..1fa154e0cb 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -147,6 +147,10 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	{
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
+		if (root->glob->partition_directory == NULL)
+			root->glob->partition_directory =
+				CreatePartitionDirectory(CurrentMemoryContext);
+
 		/*
 		 * If this table has partitions, recursively expand and lock them.
 		 * While at it, also extract the partition key columns of all the
@@ -246,7 +250,10 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 	int			i;
 	RangeTblEntry *childrte;
 	Index		childRTindex;
-	PartitionDesc partdesc = RelationGetPartitionDesc(parentrel);
+	PartitionDesc partdesc;
+
+	partdesc = PartitionDirectoryLookup(root->glob->partition_directory,
+										parentrel);
 
 	check_stack_depth();
 
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 78a96b4ee2..30f4dc151b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -2086,7 +2086,8 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
 
 	Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
 
-	partdesc = RelationGetPartitionDesc(relation);
+	partdesc = PartitionDirectoryLookup(root->glob->partition_directory,
+										relation);
 	partkey = RelationGetPartitionKey(relation);
 	rel->part_scheme = find_partition_scheme(root, relation);
 	Assert(partdesc != NULL && rel->part_scheme != NULL);
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 66b1e38527..dbbca84dcb 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -21,12 +21,26 @@
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/inval.h"
+#include "utils/hsearch.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/partcache.h"
 #include "utils/syscache.h"
 
+typedef struct PartitionDirectoryData
+{
+	MemoryContext pdir_mcxt;
+	HTAB	   *pdir_hash;
+} PartitionDirectoryData;
+
+typedef struct PartitionDirectoryEntry
+{
+	Oid			reloid;
+	Relation	rel;
+	PartitionDesc pd;
+} PartitionDirectoryEntry;
+
 /*
  * RelationBuildPartitionDesc
  *		Form rel's partition descriptor
@@ -208,13 +222,88 @@ RelationBuildPartitionDesc(Relation rel)
 		partdesc->oids[index] = oids[i];
 		/* Record if the partition is a leaf partition */
 		partdesc->is_leaf[index] =
-				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+			(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
 	}
 	MemoryContextSwitchTo(oldcxt);
 
 	rel->rd_partdesc = partdesc;
 }
 
+/*
+ * CreatePartitionDirectory
+ *		Create a new partition directory object.
+ */
+PartitionDirectory
+CreatePartitionDirectory(MemoryContext mcxt)
+{
+	MemoryContext oldcontext = MemoryContextSwitchTo(mcxt);
+	PartitionDirectory pdir;
+	HASHCTL		ctl;
+
+	MemSet(&ctl, 0, sizeof(HASHCTL));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(PartitionDirectoryEntry);
+	ctl.hcxt = mcxt;
+
+	pdir = palloc(sizeof(PartitionDirectoryData));
+	pdir->pdir_mcxt = mcxt;
+	pdir->pdir_hash = hash_create("partition directory", 256, &ctl,
+								  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	MemoryContextSwitchTo(oldcontext);
+	return pdir;
+}
+
+/*
+ * PartitionDirectoryLookup
+ *		Look up the partition descriptor for a relation in the directory.
+ *
+ * The purpose of this function is to ensure that we get the same
+ * PartitionDesc for each relation every time we look it up.  In the
+ * face of current DDL, different PartitionDescs may be constructed with
+ * different views of the catalog state, but any single particular OID
+ * will always get the same PartitionDesc for as long as the same
+ * PartitionDirectory is used.
+ */
+PartitionDesc
+PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel)
+{
+	PartitionDirectoryEntry *pde;
+	Oid			relid = RelationGetRelid(rel);
+	bool		found;
+
+	pde = hash_search(pdir->pdir_hash, &relid, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * We must keep a reference count on the relation so that the
+		 * PartitionDesc to which we are pointing can't get destroyed.
+		 */
+		RelationIncrementReferenceCount(rel);
+		pde->rel = rel;
+		pde->pd = RelationGetPartitionDesc(rel);
+		Assert(pde->pd != NULL);
+	}
+	return pde->pd;
+}
+
+/*
+ * DestroyPartitionDirectory
+ *		Destroy a partition directory.
+ *
+ * Release the reference counts we're holding.
+ */
+void
+DestroyPartitionDirectory(PartitionDirectory pdir)
+{
+	HASH_SEQ_STATUS	status;
+	PartitionDirectoryEntry *pde;
+
+	hash_seq_init(&status, pdir->pdir_hash);
+	while ((pde = hash_seq_search(&status)) != NULL)
+		RelationDecrementReferenceCount(pde->rel);
+}
+
 /*
  * equalPartitionDescs
  *		Compare two partition descriptors for logical equality
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 54a40ef00b..1495b60d11 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2480,6 +2480,26 @@ RelationClearRelation(Relation relation, bool rebuild)
 			SWAPFIELD(PartitionDesc, rd_partdesc);
 			SWAPFIELD(MemoryContext, rd_pdcxt);
 		}
+		else if (rebuild && newrel->rd_pdcxt != NULL)
+		{
+			/*
+			 * We are rebuilding a partitioned relation with a non-zero
+			 * reference count, so keep the old partition descriptor around,
+			 * in case there's a PartitionDirectory with a pointer to it.
+			 * Attach it to the new rd_pdcxt so that it gets cleaned up
+			 * eventually.  In the case where the reference count is 0, this
+			 * code is not reached, which should be OK because in that case
+			 * there should be no PartitionDirectory with a pointer to the old
+			 * entry.
+			 *
+			 * Note that newrel and relation have already been swapped, so
+			 * the "old" partition descriptor is actually the one hanging off
+			 * of newrel.
+			 */
+			MemoryContextSetParent(newrel->rd_pdcxt, relation->rd_pdcxt);
+			newrel->rd_partdesc = NULL;
+			newrel->rd_pdcxt = NULL;
+		}
 
 #undef SWAPFIELD
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 2048c43c37..b363aba2a5 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -135,7 +135,8 @@ typedef struct PartitionPruneState
 	PartitionPruningData *partprunedata[FLEXIBLE_ARRAY_MEMBER];
 } PartitionPruneState;
 
-extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
+extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(EState *estate,
+							   ModifyTableState *mtstate,
 							   Relation rel);
 extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
 				  ResultRelInfo *rootResultRelInfo,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3b789ee7cf..84de8efeda 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -19,6 +19,7 @@
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "partitioning/partdefs.h"
 #include "utils/hsearch.h"
 #include "utils/queryenvironment.h"
 #include "utils/reltrigger.h"
@@ -515,6 +516,7 @@ typedef struct EState
 	 */
 	ResultRelInfo *es_root_result_relations;	/* array of ResultRelInfos */
 	int			es_num_root_result_relations;	/* length of the array */
+	PartitionDirectory es_partition_directory;	/* for PartitionDesc lookup */
 
 	/*
 	 * The following list contains ResultRelInfos created by the tuple routing
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a008ae07da..7b2cbdbefc 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -144,6 +144,8 @@ typedef struct PlannerGlobal
 	bool		parallelModeNeeded; /* parallel mode actually required? */
 
 	char		maxParallelHazard;	/* worst PROPARALLEL hazard level */
+
+	PartitionDirectory partition_directory; /* partition descriptors */
 } PlannerGlobal;
 
 /* macro for fetching the Plan associated with a SubPlan node */
diff --git a/src/include/partitioning/partdefs.h b/src/include/partitioning/partdefs.h
index 6e9c128b2c..aec3b3fe63 100644
--- a/src/include/partitioning/partdefs.h
+++ b/src/include/partitioning/partdefs.h
@@ -21,4 +21,6 @@ typedef struct PartitionBoundSpec PartitionBoundSpec;
 
 typedef struct PartitionDescData *PartitionDesc;
 
+typedef struct PartitionDirectoryData *PartitionDirectory;
+
 #endif							/* PARTDEFS_H */
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index f72b70dded..da19369e25 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -31,6 +31,10 @@ typedef struct PartitionDescData
 
 extern void RelationBuildPartitionDesc(Relation rel);
 
+extern PartitionDirectory CreatePartitionDirectory(MemoryContext mcxt);
+extern PartitionDesc PartitionDirectoryLookup(PartitionDirectory, Relation);
+extern void DestroyPartitionDirectory(PartitionDirectory pdir);
+
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 
 extern bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
-- 
2.17.2 (Apple Git-113)

v3-0004-Reduce-the-lock-level-required-to-attach-a-partit.patchapplication/octet-stream; name=v3-0004-Reduce-the-lock-level-required-to-attach-a-partit.patchDownload

From 0d377ef247ea4cd7cac9bb6e65f8af303319ea95 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 17 Jan 2019 10:25:12 -0500
Subject: [PATCH v3 4/4] Reduce the lock level required to attach a partition.

Previous work makes this safe (hopefully).
---
 src/backend/commands/tablecmds.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 35bdb0e0c6..fa4634e8dc 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3652,6 +3652,9 @@ AlterTableGetLockLevel(List *cmds)
 				break;
 
 			case AT_AttachPartition:
+				cmd_lockmode = ShareUpdateExclusiveLock;
+				break;
+
 			case AT_DetachPartition:
 				cmd_lockmode = AccessExclusiveLock;
 				break;
-- 
2.17.2 (Apple Git-113)

v3-0003-Teach-runtime-partition-pruning-to-cope-with-conc.patchapplication/octet-stream; name=v3-0003-Teach-runtime-partition-pruning-to-cope-with-conc.patchDownload

From 509121ddd1d72b49773fde19dc89ae1a1295a6f2 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 17 Jan 2019 09:11:10 -0500
Subject: [PATCH v3 3/4] Teach runtime partition pruning to cope with
 concurrent partition adds.

If new partitions were added between plan time and execution time, the
indexes stored in the subplan_map[] and subpart_map[] arrays within
the plan's PartitionedRelPruneInfo would no longer be correct.  Adjust
the code to cope with added partitions.  There does not seem to be
a simple way to cope with partitions that are removed, mostly because
they could then get added back again with different bounds, so don't
try to cope with that situation.
---
 src/backend/executor/execPartition.c | 68 +++++++++++++++++++++++-----
 src/backend/nodes/copyfuncs.c        |  1 +
 src/backend/nodes/outfuncs.c         |  1 +
 src/backend/nodes/readfuncs.c        |  1 +
 src/backend/partitioning/partprune.c |  7 ++-
 src/include/nodes/plannodes.h        |  1 +
 6 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index db133b37a5..de84d03680 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1599,18 +1599,6 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			int			n_steps;
 			ListCell   *lc3;
 
-			/*
-			 * We must copy the subplan_map rather than pointing directly to
-			 * the plan's version, as we may end up making modifications to it
-			 * later.
-			 */
-			pprune->subplan_map = palloc(sizeof(int) * pinfo->nparts);
-			memcpy(pprune->subplan_map, pinfo->subplan_map,
-				   sizeof(int) * pinfo->nparts);
-
-			/* We can use the subpart_map verbatim, since we never modify it */
-			pprune->subpart_map = pinfo->subpart_map;
-
 			/* present_parts is also subject to later modification */
 			pprune->present_parts = bms_copy(pinfo->present_parts);
 
@@ -1625,6 +1613,62 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			partdesc = PartitionDirectoryLookup(estate->es_partition_directory,
 												partrel);
 
+			/*
+			 * Initialize the subplan_map and subpart_map.  Since detaching a
+			 * partition requires AccessExclusiveLock, no partitions can have
+			 * disappeared, nor can the bounds for any partition have changed.
+			 * However, new partitions may have been added.
+			 */
+			Assert(partdesc->nparts >= pinfo->nparts);
+			pprune->subplan_map = palloc(sizeof(int) * partdesc->nparts);
+			if (partdesc->nparts == pinfo->nparts)
+			{
+				/*
+				 * There are no new partitions, so this is simple.  We can
+				 * simply point to the subpart_map from the plan, but we must
+				 * copy the subplan_map since we may change it later.
+				 */
+				pprune->subpart_map = pinfo->subpart_map;
+				memcpy(pprune->subplan_map, pinfo->subplan_map,
+					   sizeof(int) * pinfo->nparts);
+
+				/* Double-check that list of relations has not changed. */
+				Assert(memcmp(partdesc->oids, pinfo->relid_map,
+					   pinfo->nparts * sizeof(Oid)) == 0);
+			}
+			else
+			{
+				int		pd_idx = 0;
+				int		pp_idx;
+
+				/*
+				 * Some new partitions have appeared since plan time, and
+				 * those are reflected in our PartitionDesc but were not
+				 * present in the one used to construct subplan_map and
+				 * subpart_map.  So we must construct new and longer arrays
+				 * where the partitions that were originally present map to the
+				 * same place, and any added indexes map to -1, as if the
+				 * new partitions had been pruned.
+				 */
+				pprune->subpart_map = palloc(sizeof(int) * partdesc->nparts);
+				for (pp_idx = 0; pp_idx < partdesc->nparts; ++pp_idx)
+				{
+					if (pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx])
+					{
+						pprune->subplan_map[pp_idx] = -1;
+						pprune->subpart_map[pp_idx] = -1;
+					}
+					else
+					{
+						pprune->subplan_map[pp_idx] =
+							pinfo->subplan_map[pd_idx];
+						pprune->subpart_map[pp_idx] =
+							pinfo->subpart_map[pd_idx++];
+					}
+				}
+				Assert(pd_idx == pinfo->nparts);
+			}
+
 			n_steps = list_length(pinfo->pruning_steps);
 
 			context->strategy = partkey->strategy;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index e15724bb0e..d5fddce953 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1197,6 +1197,7 @@ _copyPartitionedRelPruneInfo(const PartitionedRelPruneInfo *from)
 	COPY_SCALAR_FIELD(nexprs);
 	COPY_POINTER_FIELD(subplan_map, from->nparts * sizeof(int));
 	COPY_POINTER_FIELD(subpart_map, from->nparts * sizeof(int));
+	COPY_POINTER_FIELD(relid_map, from->nparts * sizeof(int));
 	COPY_POINTER_FIELD(hasexecparam, from->nexprs * sizeof(bool));
 	COPY_SCALAR_FIELD(do_initial_prune);
 	COPY_SCALAR_FIELD(do_exec_prune);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 65302fe65b..65b4a63013 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -947,6 +947,7 @@ _outPartitionedRelPruneInfo(StringInfo str, const PartitionedRelPruneInfo *node)
 	WRITE_INT_FIELD(nexprs);
 	WRITE_INT_ARRAY(subplan_map, node->nparts);
 	WRITE_INT_ARRAY(subpart_map, node->nparts);
+	WRITE_OID_ARRAY(relid_map, node->nparts);
 	WRITE_BOOL_ARRAY(hasexecparam, node->nexprs);
 	WRITE_BOOL_FIELD(do_initial_prune);
 	WRITE_BOOL_FIELD(do_exec_prune);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 5aa42242a9..fc60b0a7c5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2385,6 +2385,7 @@ _readPartitionedRelPruneInfo(void)
 	READ_INT_FIELD(nexprs);
 	READ_INT_ARRAY(subplan_map, local_node->nparts);
 	READ_INT_ARRAY(subpart_map, local_node->nparts);
+	READ_OID_ARRAY(relid_map, local_node->nparts);
 	READ_BOOL_ARRAY(hasexecparam, local_node->nexprs);
 	READ_BOOL_FIELD(do_initial_prune);
 	READ_BOOL_FIELD(do_exec_prune);
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 8c9721935d..b5c0889935 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -47,8 +47,9 @@
 #include "optimizer/appendinfo.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
-#include "partitioning/partprune.h"
+#include "parser/parsetree.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
 
@@ -359,6 +360,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		int			partnatts = subpart->part_scheme->partnatts;
 		int		   *subplan_map;
 		int		   *subpart_map;
+		Oid		   *relid_map;
 		List	   *partprunequal;
 		List	   *pruning_steps;
 		bool		contradictory;
@@ -434,6 +436,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		 */
 		subplan_map = (int *) palloc(nparts * sizeof(int));
 		subpart_map = (int *) palloc(nparts * sizeof(int));
+		relid_map = (Oid *) palloc(nparts * sizeof(int));
 		present_parts = NULL;
 
 		for (i = 0; i < nparts; i++)
@@ -444,6 +447,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 
 			subplan_map[i] = subplanidx;
 			subpart_map[i] = subpartidx;
+			relid_map[i] = planner_rt_fetch(partrel->relid, root)->relid;
 			if (subplanidx >= 0)
 			{
 				present_parts = bms_add_member(present_parts, i);
@@ -462,6 +466,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		pinfo->nparts = nparts;
 		pinfo->subplan_map = subplan_map;
 		pinfo->subpart_map = subpart_map;
+		pinfo->relid_map = relid_map;
 
 		/* Determine which pruning types should be enabled at this level */
 		doruntimeprune |= analyze_partkey_exprs(pinfo, pruning_steps,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6d087c268f..d66a187a53 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1108,6 +1108,7 @@ typedef struct PartitionedRelPruneInfo
 	int			nexprs;			/* Length of hasexecparam[] */
 	int		   *subplan_map;	/* subplan index by partition index, or -1 */
 	int		   *subpart_map;	/* subpart index by partition index, or -1 */
+	Oid		   *relid_map;		/* relation OID by partition index, or -1 */
 	bool	   *hasexecparam;	/* true if corresponding pruning_step contains
 								 * any PARAM_EXEC Params. */
 	bool		do_initial_prune;	/* true if pruning should be performed
-- 
2.17.2 (Apple Git-113)

v3-0001-Ensure-that-RelationBuildPartitionDesc-sees-a-con.patchapplication/octet-stream; name=v3-0001-Ensure-that-RelationBuildPartitionDesc-sees-a-con.patchDownload

From 69849a018ff08be5a2bce8daba8b1a15c0b890e9 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Nov 2018 12:15:44 -0500
Subject: [PATCH v3 1/4] Ensure that RelationBuildPartitionDesc sees a
 consistent view.

If partitions are added or removed concurrently, make sure that we
nevertheless get a view of the partition list and the partition
descriptor for each partition which is consistent with the system
state at some single point in the commit history.

To do this, reuse an idea first invented by Noah Misch back in
commit 4240e429d0c2d889d0cda23c618f94e12c13ade7.

Nothing in this commit permits partitions to be added or removed
concurrently; it just allows RelationBuildPartitionDesc to produce
reasonable results if they do.  It also does not guarantee that
the results produced by RelationBuildPartitionDesc will be stable
from one call to the next; it only tries to make sure that they
will be sane.
---
 src/backend/partitioning/partdesc.c | 137 ++++++++++++++++++++--------
 1 file changed, 101 insertions(+), 36 deletions(-)

diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 8a4b63aa26..66b1e38527 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -18,7 +18,9 @@
 #include "catalog/pg_inherits.h"
 #include "partitioning/partbounds.h"
 #include "partitioning/partdesc.h"
+#include "storage/sinval.h"
 #include "utils/builtins.h"
+#include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -47,45 +49,113 @@ RelationBuildPartitionDesc(Relation rel)
 	MemoryContext oldcxt;
 	int		   *mapping;
 
-	/* Get partition oids from pg_inherits */
-	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
-	nparts = list_length(inhoids);
-
-	if (nparts > 0)
+	/*
+	 * Fetch catalog information.  Since we want to allow partitions to be
+	 * added and removed without holding AccessExclusiveLock on the parent
+	 * table, it's possible that the catalog contents could be changing under
+	 * us.  That means that by by the time we fetch the partition bound for a
+	 * partition returned by find_inheritance_children, it might no longer be
+	 * a partition or might even be a partition of some other table.
+	 *
+	 * To ensure that we get a consistent view of the catalog data, we first
+	 * fetch everything we need and then call AcceptInvalidationMessages. If
+	 * SharedInvalidMessageCounter advances between the time we start fetching
+	 * information and the time AcceptInvalidationMessages() completes, that
+	 * means something may have changed under us, so we start over and do it
+	 * all again.
+	 */
+	for (;;)
 	{
-		oids = palloc(nparts * sizeof(Oid));
-		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		uint64		inval_count = SharedInvalidMessageCounter;
+
+		/* Get partition oids from pg_inherits */
+		inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+		nparts = list_length(inhoids);
+
+		if (nparts > 0)
+		{
+			oids = palloc(nparts * sizeof(Oid));
+			boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
+		}
+
+		/* Collect bound spec nodes for each partition */
+		i = 0;
+		foreach(cell, inhoids)
+		{
+			Oid			inhrelid = lfirst_oid(cell);
+			HeapTuple	tuple;
+			PartitionBoundSpec *boundspec = NULL;
+
+			/*
+			 * Don't put any sanity checks here that might fail as a result of
+			 * concurrent DDL, such as a check that relpartbound is not NULL.
+			 * We could transiently see such states as a result of concurrent
+			 * DDL.  Such checks can be performed only after we're sure we got
+			 * a consistent view of the underlying data.
+			 */
+			tuple = SearchSysCache1(RELOID, inhrelid);
+			if (HeapTupleIsValid(tuple))
+			{
+				Datum		datum;
+				bool		isnull;
+
+				datum = SysCacheGetAttr(RELOID, tuple,
+										Anum_pg_class_relpartbound,
+										&isnull);
+				if (!isnull)
+					boundspec = stringToNode(TextDatumGetCString(datum));
+				ReleaseSysCache(tuple);
+			}
+
+			oids[i] = inhrelid;
+			boundspecs[i] = boundspec;
+			++i;
+		}
+
+		/*
+		 * If no relevant catalog changes have occurred (see comments at the
+		 * top of this loop, then we got a consistent view of our partition
+		 * list and can stop now.
+		 */
+		AcceptInvalidationMessages();
+		if (inval_count == SharedInvalidMessageCounter)
+			break;
+
+		/* Something changed, so retry from the top. */
+		if (oids != NULL)
+		{
+			pfree(oids);
+			oids = NULL;
+		}
+		if (boundspecs != NULL)
+		{
+			pfree(boundspecs);
+			boundspecs = NULL;
+		}
+		if (inhoids != NIL)
+			list_free(inhoids);
 	}
 
-	/* Collect bound spec nodes for each partition */
-	i = 0;
-	foreach(cell, inhoids)
+	/*
+	 * At this point, we should have a consistent view of the data we got from
+	 * pg_inherits and pg_class, so it's safe to perform some sanity checks.
+	 */
+	for (i = 0; i < nparts; ++i)
 	{
-		Oid			inhrelid = lfirst_oid(cell);
-		HeapTuple	tuple;
-		Datum		datum;
-		bool		isnull;
-		PartitionBoundSpec *boundspec;
-
-		tuple = SearchSysCache1(RELOID, inhrelid);
-		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
-
-		datum = SysCacheGetAttr(RELOID, tuple,
-								Anum_pg_class_relpartbound,
-								&isnull);
-		if (isnull)
-			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = stringToNode(TextDatumGetCString(datum));
-		if (!IsA(boundspec, PartitionBoundSpec))
+		Oid			inhrelid = oids[i];
+		PartitionBoundSpec *spec = boundspecs[i];
+
+		if (!spec)
+			elog(ERROR, "missing relpartbound for relation %u", inhrelid);
+		if (!IsA(spec, PartitionBoundSpec))
 			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
-		 * Sanity check: If the PartitionBoundSpec says this is the default
-		 * partition, its OID should correspond to whatever's stored in
-		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
+		 * If the PartitionBoundSpec says this is the default partition, its
+		 * OID should match pg_partitioned_table.partdefid; if not, the
+		 * catalog is corrupt.
 		 */
-		if (boundspec->is_default)
+		if (spec->is_default)
 		{
 			Oid			partdefid;
 
@@ -94,11 +164,6 @@ RelationBuildPartitionDesc(Relation rel)
 				elog(ERROR, "expected partdefid %u, but got %u",
 					 inhrelid, partdefid);
 		}
-
-		oids[i] = inhrelid;
-		boundspecs[i] = boundspec;
-		++i;
-		ReleaseSysCache(tuple);
 	}
 
 	/* Now build the actual relcache partition descriptor */
-- 
2.17.2 (Apple Git-113)

#92

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Robert Haas (#91)

4 attachment(s)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Feb 26, 2019 at 5:10 PM Robert Haas <robertmhaas@gmail.com> wrote:

Aside from these problems, I think I have spotted a subtle problem in
0001. I'll think about that some more and post another update.

0001 turned out to be guarding against the wrong problem. It supposed
that if we didn't get a coherent view of the system catalogs due to
concurrent DDL, we could just AcceptInvalidationMessages() and retry.
But that turns out to be wrong, because there's a (very) narrow window
after a process removes itself from the ProcArray and before it sends
invalidation messages. It wasn't difficult to engineer an alternative
solution that works, but unfortunately it's only good enough to handle
the ATTACH case, so this is another thing that will need more thought
for concurrent DETACH. Anyway, the updated 0001 contains that code
and some explanatory comments. The rest of the series is
substantially unchanged.

I'm not currently aware of any remaining correctness issues with this
code, although certainly there may be some. There has been a certain
dearth of volunteers to review any of this. I do plan to poke at it a
bit to see whether it has any significant performance impact, but not
today.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v4-0004-Reduce-the-lock-level-required-to-attach-a-partit.patchapplication/octet-stream; name=v4-0004-Reduce-the-lock-level-required-to-attach-a-partit.patchDownload

From a7d2b28e4ee251b28bc5552cbcb082683fdcd884 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 17 Jan 2019 10:25:12 -0500
Subject: [PATCH v4 4/4] Reduce the lock level required to attach a partition.

Previous work makes this safe (hopefully).
---
 src/backend/commands/tablecmds.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index a93b13c2fe..d50519dcf0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3652,6 +3652,9 @@ AlterTableGetLockLevel(List *cmds)
 				break;
 
 			case AT_AttachPartition:
+				cmd_lockmode = ShareUpdateExclusiveLock;
+				break;
+
 			case AT_DetachPartition:
 				cmd_lockmode = AccessExclusiveLock;
 				break;
-- 
2.17.2 (Apple Git-113)

v4-0002-Ensure-that-repeated-PartitionDesc-lookups-return.patchapplication/octet-stream; name=v4-0002-Ensure-that-repeated-PartitionDesc-lookups-return.patchDownload

From 9200d45bf656aa006be8d896971802ec2c2f4dff Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 28 Nov 2018 10:15:55 -0500
Subject: [PATCH v4 2/4] Ensure that repeated PartitionDesc lookups return the
 same answer.

The query planner will get confused if lookup up the PartitionDesc
for a particular relation does not return a consistent answer for
the entire duration of query planning.  Likewise, query execution
will get confused if the same relation seems to have a different
PartitionDesc at different times.  Invent a new PartitionDirectory
concept and use it to ensure consistency.

Note that this only ensures consistency within a single query
planning cycle or a single query execution.  It doesn't guarantee
that the answer can't change between planning and execution, nor
does it change the way a PartitionDesc is constructed in the first
place.

Since this allows pointers to old PartitionDesc entries to survive
even after a relcache rebuild, also postpone removing the old
PartitionDesc entry until we're certain no one is using it.
---
 src/backend/commands/copy.c            |  2 +-
 src/backend/executor/execPartition.c   | 28 ++++++--
 src/backend/executor/execUtils.c       |  8 +++
 src/backend/executor/nodeModifyTable.c |  2 +-
 src/backend/optimizer/plan/planner.c   |  4 ++
 src/backend/optimizer/util/inherit.c   |  9 ++-
 src/backend/optimizer/util/plancat.c   |  3 +-
 src/backend/partitioning/partdesc.c    | 91 +++++++++++++++++++++++++-
 src/backend/utils/cache/relcache.c     | 20 ++++++
 src/include/executor/execPartition.h   |  3 +-
 src/include/nodes/execnodes.h          |  2 +
 src/include/nodes/pathnodes.h          |  2 +
 src/include/partitioning/partdefs.h    |  2 +
 src/include/partitioning/partdesc.h    |  4 ++
 14 files changed, 167 insertions(+), 13 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 5dd6fe02c6..12415b4e99 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2556,7 +2556,7 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
+		proute = ExecSetupPartitionTupleRouting(estate, NULL, cstate->rel);
 
 	if (cstate->whereClause)
 		cstate->qualexpr = ExecInitQual(castNode(List, cstate->whereClause),
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index e121c6c8ff..db133b37a5 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -167,7 +167,8 @@ static void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					PartitionDispatch dispatch,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+static PartitionDispatch ExecInitPartitionDispatchInfo(EState *estate,
+							  PartitionTupleRouting *proute,
 							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
@@ -201,7 +202,8 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * it should be estate->es_query_cxt.
  */
 PartitionTupleRouting *
-ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
+ExecSetupPartitionTupleRouting(EState *estate, ModifyTableState *mtstate,
+							   Relation rel)
 {
 	PartitionTupleRouting *proute;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
@@ -223,7 +225,8 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 * parent as NULL as we don't need to care about any parent of the target
 	 * partitioned table.
 	 */
-	ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL, 0);
+	ExecInitPartitionDispatchInfo(estate, proute, RelationGetRelid(rel),
+								  NULL, 0);
 
 	/*
 	 * If performing an UPDATE with tuple routing, we can reuse partition
@@ -424,7 +427,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 				 * Create the new PartitionDispatch.  We pass the current one
 				 * in as the parent PartitionDispatch
 				 */
-				subdispatch = ExecInitPartitionDispatchInfo(proute,
+				subdispatch = ExecInitPartitionDispatchInfo(mtstate->ps.state,
+															proute,
 															partdesc->oids[partidx],
 															dispatch, partidx);
 				Assert(dispatch->indexes[partidx] >= 0 &&
@@ -964,7 +968,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
  *		PartitionDispatch later.
  */
 static PartitionDispatch
-ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+ExecInitPartitionDispatchInfo(EState *estate,
+							  PartitionTupleRouting *proute, Oid partoid,
 							  PartitionDispatch parent_pd, int partidx)
 {
 	Relation	rel;
@@ -973,6 +978,10 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	int			dispatchidx;
 	MemoryContext oldcxt;
 
+	if (estate->es_partition_directory == NULL)
+		estate->es_partition_directory =
+			CreatePartitionDirectory(estate->es_query_cxt);
+
 	oldcxt = MemoryContextSwitchTo(proute->memcxt);
 
 	/*
@@ -984,7 +993,7 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 		rel = table_open(partoid, RowExclusiveLock);
 	else
 		rel = proute->partition_root;
-	partdesc = RelationGetPartitionDesc(rel);
+	partdesc = PartitionDirectoryLookup(estate->es_partition_directory, rel);
 
 	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes) +
 									partdesc->nparts * sizeof(int));
@@ -1530,6 +1539,10 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 	ListCell   *lc;
 	int			i;
 
+	if (estate->es_partition_directory == NULL)
+		estate->es_partition_directory =
+			CreatePartitionDirectory(estate->es_query_cxt);
+
 	n_part_hierarchies = list_length(partitionpruneinfo->prune_infos);
 	Assert(n_part_hierarchies > 0);
 
@@ -1609,7 +1622,8 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			 */
 			partrel = ExecGetRangeTableRelation(estate, pinfo->rtindex);
 			partkey = RelationGetPartitionKey(partrel);
-			partdesc = RelationGetPartitionDesc(partrel);
+			partdesc = PartitionDirectoryLookup(estate->es_partition_directory,
+												partrel);
 
 			n_steps = list_length(pinfo->pruning_steps);
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 5136269348..6661b8908a 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -54,6 +54,7 @@
 #include "mb/pg_wchar.h"
 #include "nodes/nodeFuncs.h"
 #include "parser/parsetree.h"
+#include "partitioning/partdesc.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -215,6 +216,13 @@ FreeExecutorState(EState *estate)
 		estate->es_jit = NULL;
 	}
 
+	/* release partition directory, if allocated */
+	if (estate->es_partition_directory)
+	{
+		DestroyPartitionDirectory(estate->es_partition_directory);
+		estate->es_partition_directory = NULL;
+	}
+
 	/*
 	 * Free the per-query memory context, thereby releasing all working
 	 * memory, including the EState node itself.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 76175aaa6b..f87b22e2c9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2194,7 +2194,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
 		(operation == CMD_INSERT || update_tuple_routing_needed))
 		mtstate->mt_partition_tuple_routing =
-			ExecSetupPartitionTupleRouting(mtstate, rel);
+			ExecSetupPartitionTupleRouting(estate, mtstate, rel);
 
 	/*
 	 * Build state for collecting transition tuples.  This requires having a
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bc81535905..98dd5281ad 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -56,6 +56,7 @@
 #include "parser/analyze.h"
 #include "parser/parsetree.h"
 #include "parser/parse_agg.h"
+#include "partitioning/partdesc.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/dsm_impl.h"
 #include "utils/rel.h"
@@ -567,6 +568,9 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 			result->jitFlags |= PGJIT_DEFORM;
 	}
 
+	if (glob->partition_directory != NULL)
+		DestroyPartitionDirectory(glob->partition_directory);
+
 	return result;
 }
 
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index a014a12060..1fa154e0cb 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -147,6 +147,10 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	{
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
+		if (root->glob->partition_directory == NULL)
+			root->glob->partition_directory =
+				CreatePartitionDirectory(CurrentMemoryContext);
+
 		/*
 		 * If this table has partitions, recursively expand and lock them.
 		 * While at it, also extract the partition key columns of all the
@@ -246,7 +250,10 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 	int			i;
 	RangeTblEntry *childrte;
 	Index		childRTindex;
-	PartitionDesc partdesc = RelationGetPartitionDesc(parentrel);
+	PartitionDesc partdesc;
+
+	partdesc = PartitionDirectoryLookup(root->glob->partition_directory,
+										parentrel);
 
 	check_stack_depth();
 
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 78a96b4ee2..30f4dc151b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -2086,7 +2086,8 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
 
 	Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
 
-	partdesc = RelationGetPartitionDesc(relation);
+	partdesc = PartitionDirectoryLookup(root->glob->partition_directory,
+										relation);
 	partkey = RelationGetPartitionKey(relation);
 	rel->part_scheme = find_partition_scheme(root, relation);
 	Assert(partdesc != NULL && rel->part_scheme != NULL);
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index e89d773261..a4494aca7a 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -27,12 +27,26 @@
 #include "utils/builtins.h"
 #include "utils/inval.h"
 #include "utils/fmgroids.h"
+#include "utils/hsearch.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/partcache.h"
 #include "utils/syscache.h"
 
+typedef struct PartitionDirectoryData
+{
+	MemoryContext pdir_mcxt;
+	HTAB	   *pdir_hash;
+} PartitionDirectoryData;
+
+typedef struct PartitionDirectoryEntry
+{
+	Oid			reloid;
+	Relation	rel;
+	PartitionDesc pd;
+} PartitionDirectoryEntry;
+
 /*
  * RelationBuildPartitionDesc
  *		Form rel's partition descriptor
@@ -201,13 +215,88 @@ RelationBuildPartitionDesc(Relation rel)
 		partdesc->oids[index] = oids[i];
 		/* Record if the partition is a leaf partition */
 		partdesc->is_leaf[index] =
-				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+			(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
 	}
 	MemoryContextSwitchTo(oldcxt);
 
 	rel->rd_partdesc = partdesc;
 }
 
+/*
+ * CreatePartitionDirectory
+ *		Create a new partition directory object.
+ */
+PartitionDirectory
+CreatePartitionDirectory(MemoryContext mcxt)
+{
+	MemoryContext oldcontext = MemoryContextSwitchTo(mcxt);
+	PartitionDirectory pdir;
+	HASHCTL		ctl;
+
+	MemSet(&ctl, 0, sizeof(HASHCTL));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(PartitionDirectoryEntry);
+	ctl.hcxt = mcxt;
+
+	pdir = palloc(sizeof(PartitionDirectoryData));
+	pdir->pdir_mcxt = mcxt;
+	pdir->pdir_hash = hash_create("partition directory", 256, &ctl,
+								  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	MemoryContextSwitchTo(oldcontext);
+	return pdir;
+}
+
+/*
+ * PartitionDirectoryLookup
+ *		Look up the partition descriptor for a relation in the directory.
+ *
+ * The purpose of this function is to ensure that we get the same
+ * PartitionDesc for each relation every time we look it up.  In the
+ * face of current DDL, different PartitionDescs may be constructed with
+ * different views of the catalog state, but any single particular OID
+ * will always get the same PartitionDesc for as long as the same
+ * PartitionDirectory is used.
+ */
+PartitionDesc
+PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel)
+{
+	PartitionDirectoryEntry *pde;
+	Oid			relid = RelationGetRelid(rel);
+	bool		found;
+
+	pde = hash_search(pdir->pdir_hash, &relid, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * We must keep a reference count on the relation so that the
+		 * PartitionDesc to which we are pointing can't get destroyed.
+		 */
+		RelationIncrementReferenceCount(rel);
+		pde->rel = rel;
+		pde->pd = RelationGetPartitionDesc(rel);
+		Assert(pde->pd != NULL);
+	}
+	return pde->pd;
+}
+
+/*
+ * DestroyPartitionDirectory
+ *		Destroy a partition directory.
+ *
+ * Release the reference counts we're holding.
+ */
+void
+DestroyPartitionDirectory(PartitionDirectory pdir)
+{
+	HASH_SEQ_STATUS	status;
+	PartitionDirectoryEntry *pde;
+
+	hash_seq_init(&status, pdir->pdir_hash);
+	while ((pde = hash_seq_search(&status)) != NULL)
+		RelationDecrementReferenceCount(pde->rel);
+}
+
 /*
  * equalPartitionDescs
  *		Compare two partition descriptors for logical equality
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 54a40ef00b..1495b60d11 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2480,6 +2480,26 @@ RelationClearRelation(Relation relation, bool rebuild)
 			SWAPFIELD(PartitionDesc, rd_partdesc);
 			SWAPFIELD(MemoryContext, rd_pdcxt);
 		}
+		else if (rebuild && newrel->rd_pdcxt != NULL)
+		{
+			/*
+			 * We are rebuilding a partitioned relation with a non-zero
+			 * reference count, so keep the old partition descriptor around,
+			 * in case there's a PartitionDirectory with a pointer to it.
+			 * Attach it to the new rd_pdcxt so that it gets cleaned up
+			 * eventually.  In the case where the reference count is 0, this
+			 * code is not reached, which should be OK because in that case
+			 * there should be no PartitionDirectory with a pointer to the old
+			 * entry.
+			 *
+			 * Note that newrel and relation have already been swapped, so
+			 * the "old" partition descriptor is actually the one hanging off
+			 * of newrel.
+			 */
+			MemoryContextSetParent(newrel->rd_pdcxt, relation->rd_pdcxt);
+			newrel->rd_partdesc = NULL;
+			newrel->rd_pdcxt = NULL;
+		}
 
 #undef SWAPFIELD
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 2048c43c37..b363aba2a5 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -135,7 +135,8 @@ typedef struct PartitionPruneState
 	PartitionPruningData *partprunedata[FLEXIBLE_ARRAY_MEMBER];
 } PartitionPruneState;
 
-extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
+extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(EState *estate,
+							   ModifyTableState *mtstate,
 							   Relation rel);
 extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
 				  ResultRelInfo *rootResultRelInfo,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 09f8217c80..22e739d642 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -19,6 +19,7 @@
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "partitioning/partdefs.h"
 #include "utils/hsearch.h"
 #include "utils/queryenvironment.h"
 #include "utils/reltrigger.h"
@@ -520,6 +521,7 @@ typedef struct EState
 	 */
 	ResultRelInfo *es_root_result_relations;	/* array of ResultRelInfos */
 	int			es_num_root_result_relations;	/* length of the array */
+	PartitionDirectory es_partition_directory;	/* for PartitionDesc lookup */
 
 	/*
 	 * The following list contains ResultRelInfos created by the tuple routing
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a008ae07da..7b2cbdbefc 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -144,6 +144,8 @@ typedef struct PlannerGlobal
 	bool		parallelModeNeeded; /* parallel mode actually required? */
 
 	char		maxParallelHazard;	/* worst PROPARALLEL hazard level */
+
+	PartitionDirectory partition_directory; /* partition descriptors */
 } PlannerGlobal;
 
 /* macro for fetching the Plan associated with a SubPlan node */
diff --git a/src/include/partitioning/partdefs.h b/src/include/partitioning/partdefs.h
index 6e9c128b2c..aec3b3fe63 100644
--- a/src/include/partitioning/partdefs.h
+++ b/src/include/partitioning/partdefs.h
@@ -21,4 +21,6 @@ typedef struct PartitionBoundSpec PartitionBoundSpec;
 
 typedef struct PartitionDescData *PartitionDesc;
 
+typedef struct PartitionDirectoryData *PartitionDirectory;
+
 #endif							/* PARTDEFS_H */
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index f72b70dded..da19369e25 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -31,6 +31,10 @@ typedef struct PartitionDescData
 
 extern void RelationBuildPartitionDesc(Relation rel);
 
+extern PartitionDirectory CreatePartitionDirectory(MemoryContext mcxt);
+extern PartitionDesc PartitionDirectoryLookup(PartitionDirectory, Relation);
+extern void DestroyPartitionDirectory(PartitionDirectory pdir);
+
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 
 extern bool equalPartitionDescs(PartitionKey key, PartitionDesc partdesc1,
-- 
2.17.2 (Apple Git-113)

v4-0001-Teach-RelationBuildPartitionDesc-to-cope-with-con.patchapplication/octet-stream; name=v4-0001-Teach-RelationBuildPartitionDesc-to-cope-with-con.patchDownload

From 4e760e497711d0e145500c77334b4ef643762dba Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 14 Nov 2018 12:15:44 -0500
Subject: [PATCH v4 1/4] Teach RelationBuildPartitionDesc to cope with
 concurrent ATTACH.

If a partition is added concurrently, we might see it in the list of
children even though the system cache doesn't know about the updated
pg_class value yet; add logic to handle that case.

This does not guarantee that the results produced by
RelationBuildPartitionDesc will be stable from one call to the next;
it only tries to make sure that they will be sane.
---
 src/backend/partitioning/partdesc.c | 94 +++++++++++++++++++++++------
 1 file changed, 76 insertions(+), 18 deletions(-)

diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 8a4b63aa26..e89d773261 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -14,11 +14,19 @@
 
 #include "postgres.h"
 
+#include "access/genam.h"
+#include "access/htup_details.h"
+#include "access/table.h"
+#include "catalog/indexing.h"
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "partitioning/partbounds.h"
 #include "partitioning/partdesc.h"
+#include "storage/bufmgr.h"
+#include "storage/sinval.h"
 #include "utils/builtins.h"
+#include "utils/inval.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -47,43 +55,93 @@ RelationBuildPartitionDesc(Relation rel)
 	MemoryContext oldcxt;
 	int		   *mapping;
 
-	/* Get partition oids from pg_inherits */
+	/*
+	 * Get partition oids from pg_inherits.  This uses a single snapshot to
+	 * fetch the list of children, so while more children may be getting
+	 * added concurrently, whatever this function returns will be accurate
+	 * as of some well-defined point in time.
+	 */
 	inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
 	nparts = list_length(inhoids);
 
+	/* Allocate arrays for OIDs and boundspecs. */
 	if (nparts > 0)
 	{
 		oids = palloc(nparts * sizeof(Oid));
 		boundspecs = palloc(nparts * sizeof(PartitionBoundSpec *));
 	}
 
-	/* Collect bound spec nodes for each partition */
+	/* Collect bound spec nodes for each partition. */
 	i = 0;
 	foreach(cell, inhoids)
 	{
 		Oid			inhrelid = lfirst_oid(cell);
 		HeapTuple	tuple;
-		Datum		datum;
-		bool		isnull;
-		PartitionBoundSpec *boundspec;
+		PartitionBoundSpec *boundspec = NULL;
 
+		/* Try fetching the tuple from the catcache, for speed. */
 		tuple = SearchSysCache1(RELOID, inhrelid);
-		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "cache lookup failed for relation %u", inhrelid);
-
-		datum = SysCacheGetAttr(RELOID, tuple,
-								Anum_pg_class_relpartbound,
-								&isnull);
-		if (isnull)
-			elog(ERROR, "null relpartbound for relation %u", inhrelid);
-		boundspec = stringToNode(TextDatumGetCString(datum));
+		if (HeapTupleIsValid(tuple))
+		{
+			Datum		datum;
+			bool		isnull;
+
+			datum = SysCacheGetAttr(RELOID, tuple,
+									Anum_pg_class_relpartbound,
+									&isnull);
+			if (!isnull)
+				boundspec = stringToNode(TextDatumGetCString(datum));
+			ReleaseSysCache(tuple);
+		}
+
+		/*
+		 * The system cache may be out of date; if so, we may find no pg_class
+		 * tuple or an old one where relpartbound is NULL.  In that case, try
+		 * the table directly.  We can't just AcceptInvalidationMessages() and
+		 * retry the system cache lookup because it's possible that a
+		 * concurrent ATTACH PARTITION operation has removed itself to the
+		 * ProcArray but yet added invalidation messages to the shared queue;
+		 * InvalidateSystemCaches() would work, but seems excessive.
+		 *
+		 * Note that this algorithm assumes that PartitionBoundSpec we manage
+		 * to fetch is the right one -- so this is only good enough for
+		 * concurrent ATTACH PARTITION, not concurrent DETACH PARTITION
+		 * or some hypothetical operation that changes the partition bounds.
+		 */
+		if (boundspec == NULL)
+		{
+			Relation	pg_class;
+			SysScanDesc	scan;
+			ScanKeyData	key[1];
+			Datum		datum;
+			bool		isnull;
+
+			pg_class = table_open(RelationRelationId, AccessShareLock);
+			ScanKeyInit(&key[0],
+						Anum_pg_class_oid,
+						BTEqualStrategyNumber, F_OIDEQ,
+						ObjectIdGetDatum(inhrelid));
+			scan = systable_beginscan(pg_class, ClassOidIndexId, true,
+									  NULL, 1, key);
+			tuple = systable_getnext(scan);
+			datum = heap_getattr(tuple, Anum_pg_class_relpartbound,
+								 RelationGetDescr(pg_class), &isnull);
+			if (!isnull)
+				boundspec = stringToNode(TextDatumGetCString(datum));
+			systable_endscan(scan);
+			table_close(pg_class, AccessShareLock);
+		}
+
+		/* Sanity checks. */
+		if (!boundspec)
+			elog(ERROR, "missing relpartbound for relation %u", inhrelid);
 		if (!IsA(boundspec, PartitionBoundSpec))
 			elog(ERROR, "invalid relpartbound for relation %u", inhrelid);
 
 		/*
-		 * Sanity check: If the PartitionBoundSpec says this is the default
-		 * partition, its OID should correspond to whatever's stored in
-		 * pg_partitioned_table.partdefid; if not, the catalog is corrupt.
+		 * If the PartitionBoundSpec says this is the default partition, its
+		 * OID should match pg_partitioned_table.partdefid; if not, the
+		 * catalog is corrupt.
 		 */
 		if (boundspec->is_default)
 		{
@@ -95,10 +153,10 @@ RelationBuildPartitionDesc(Relation rel)
 					 inhrelid, partdefid);
 		}
 
+		/* Save results. */
 		oids[i] = inhrelid;
 		boundspecs[i] = boundspec;
 		++i;
-		ReleaseSysCache(tuple);
 	}
 
 	/* Now build the actual relcache partition descriptor */
-- 
2.17.2 (Apple Git-113)

v4-0003-Teach-runtime-partition-pruning-to-cope-with-conc.patchapplication/octet-stream; name=v4-0003-Teach-runtime-partition-pruning-to-cope-with-conc.patchDownload

From 51b277c2bfa4cf4bc44b4b87bbb5e93e107b6cc5 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 17 Jan 2019 09:11:10 -0500
Subject: [PATCH v4 3/4] Teach runtime partition pruning to cope with
 concurrent partition adds.

If new partitions were added between plan time and execution time, the
indexes stored in the subplan_map[] and subpart_map[] arrays within
the plan's PartitionedRelPruneInfo would no longer be correct.  Adjust
the code to cope with added partitions.  There does not seem to be
a simple way to cope with partitions that are removed, mostly because
they could then get added back again with different bounds, so don't
try to cope with that situation.
---
 src/backend/executor/execPartition.c | 68 +++++++++++++++++++++++-----
 src/backend/nodes/copyfuncs.c        |  1 +
 src/backend/nodes/outfuncs.c         |  1 +
 src/backend/nodes/readfuncs.c        |  1 +
 src/backend/partitioning/partprune.c |  7 ++-
 src/include/nodes/plannodes.h        |  1 +
 6 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index db133b37a5..de84d03680 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1599,18 +1599,6 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			int			n_steps;
 			ListCell   *lc3;
 
-			/*
-			 * We must copy the subplan_map rather than pointing directly to
-			 * the plan's version, as we may end up making modifications to it
-			 * later.
-			 */
-			pprune->subplan_map = palloc(sizeof(int) * pinfo->nparts);
-			memcpy(pprune->subplan_map, pinfo->subplan_map,
-				   sizeof(int) * pinfo->nparts);
-
-			/* We can use the subpart_map verbatim, since we never modify it */
-			pprune->subpart_map = pinfo->subpart_map;
-
 			/* present_parts is also subject to later modification */
 			pprune->present_parts = bms_copy(pinfo->present_parts);
 
@@ -1625,6 +1613,62 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			partdesc = PartitionDirectoryLookup(estate->es_partition_directory,
 												partrel);
 
+			/*
+			 * Initialize the subplan_map and subpart_map.  Since detaching a
+			 * partition requires AccessExclusiveLock, no partitions can have
+			 * disappeared, nor can the bounds for any partition have changed.
+			 * However, new partitions may have been added.
+			 */
+			Assert(partdesc->nparts >= pinfo->nparts);
+			pprune->subplan_map = palloc(sizeof(int) * partdesc->nparts);
+			if (partdesc->nparts == pinfo->nparts)
+			{
+				/*
+				 * There are no new partitions, so this is simple.  We can
+				 * simply point to the subpart_map from the plan, but we must
+				 * copy the subplan_map since we may change it later.
+				 */
+				pprune->subpart_map = pinfo->subpart_map;
+				memcpy(pprune->subplan_map, pinfo->subplan_map,
+					   sizeof(int) * pinfo->nparts);
+
+				/* Double-check that list of relations has not changed. */
+				Assert(memcmp(partdesc->oids, pinfo->relid_map,
+					   pinfo->nparts * sizeof(Oid)) == 0);
+			}
+			else
+			{
+				int		pd_idx = 0;
+				int		pp_idx;
+
+				/*
+				 * Some new partitions have appeared since plan time, and
+				 * those are reflected in our PartitionDesc but were not
+				 * present in the one used to construct subplan_map and
+				 * subpart_map.  So we must construct new and longer arrays
+				 * where the partitions that were originally present map to the
+				 * same place, and any added indexes map to -1, as if the
+				 * new partitions had been pruned.
+				 */
+				pprune->subpart_map = palloc(sizeof(int) * partdesc->nparts);
+				for (pp_idx = 0; pp_idx < partdesc->nparts; ++pp_idx)
+				{
+					if (pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx])
+					{
+						pprune->subplan_map[pp_idx] = -1;
+						pprune->subpart_map[pp_idx] = -1;
+					}
+					else
+					{
+						pprune->subplan_map[pp_idx] =
+							pinfo->subplan_map[pd_idx];
+						pprune->subpart_map[pp_idx] =
+							pinfo->subpart_map[pd_idx++];
+					}
+				}
+				Assert(pd_idx == pinfo->nparts);
+			}
+
 			n_steps = list_length(pinfo->pruning_steps);
 
 			context->strategy = partkey->strategy;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index e15724bb0e..d5fddce953 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1197,6 +1197,7 @@ _copyPartitionedRelPruneInfo(const PartitionedRelPruneInfo *from)
 	COPY_SCALAR_FIELD(nexprs);
 	COPY_POINTER_FIELD(subplan_map, from->nparts * sizeof(int));
 	COPY_POINTER_FIELD(subpart_map, from->nparts * sizeof(int));
+	COPY_POINTER_FIELD(relid_map, from->nparts * sizeof(int));
 	COPY_POINTER_FIELD(hasexecparam, from->nexprs * sizeof(bool));
 	COPY_SCALAR_FIELD(do_initial_prune);
 	COPY_SCALAR_FIELD(do_exec_prune);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 65302fe65b..65b4a63013 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -947,6 +947,7 @@ _outPartitionedRelPruneInfo(StringInfo str, const PartitionedRelPruneInfo *node)
 	WRITE_INT_FIELD(nexprs);
 	WRITE_INT_ARRAY(subplan_map, node->nparts);
 	WRITE_INT_ARRAY(subpart_map, node->nparts);
+	WRITE_OID_ARRAY(relid_map, node->nparts);
 	WRITE_BOOL_ARRAY(hasexecparam, node->nexprs);
 	WRITE_BOOL_FIELD(do_initial_prune);
 	WRITE_BOOL_FIELD(do_exec_prune);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 5aa42242a9..fc60b0a7c5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2385,6 +2385,7 @@ _readPartitionedRelPruneInfo(void)
 	READ_INT_FIELD(nexprs);
 	READ_INT_ARRAY(subplan_map, local_node->nparts);
 	READ_INT_ARRAY(subpart_map, local_node->nparts);
+	READ_OID_ARRAY(relid_map, local_node->nparts);
 	READ_BOOL_ARRAY(hasexecparam, local_node->nexprs);
 	READ_BOOL_FIELD(do_initial_prune);
 	READ_BOOL_FIELD(do_exec_prune);
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 8c9721935d..b5c0889935 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -47,8 +47,9 @@
 #include "optimizer/appendinfo.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
-#include "partitioning/partprune.h"
+#include "parser/parsetree.h"
 #include "partitioning/partbounds.h"
+#include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
 
@@ -359,6 +360,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		int			partnatts = subpart->part_scheme->partnatts;
 		int		   *subplan_map;
 		int		   *subpart_map;
+		Oid		   *relid_map;
 		List	   *partprunequal;
 		List	   *pruning_steps;
 		bool		contradictory;
@@ -434,6 +436,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		 */
 		subplan_map = (int *) palloc(nparts * sizeof(int));
 		subpart_map = (int *) palloc(nparts * sizeof(int));
+		relid_map = (Oid *) palloc(nparts * sizeof(int));
 		present_parts = NULL;
 
 		for (i = 0; i < nparts; i++)
@@ -444,6 +447,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 
 			subplan_map[i] = subplanidx;
 			subpart_map[i] = subpartidx;
+			relid_map[i] = planner_rt_fetch(partrel->relid, root)->relid;
 			if (subplanidx >= 0)
 			{
 				present_parts = bms_add_member(present_parts, i);
@@ -462,6 +466,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		pinfo->nparts = nparts;
 		pinfo->subplan_map = subplan_map;
 		pinfo->subpart_map = subpart_map;
+		pinfo->relid_map = relid_map;
 
 		/* Determine which pruning types should be enabled at this level */
 		doruntimeprune |= analyze_partkey_exprs(pinfo, pruning_steps,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6d087c268f..d66a187a53 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1108,6 +1108,7 @@ typedef struct PartitionedRelPruneInfo
 	int			nexprs;			/* Length of hasexecparam[] */
 	int		   *subplan_map;	/* subplan index by partition index, or -1 */
 	int		   *subpart_map;	/* subpart index by partition index, or -1 */
+	Oid		   *relid_map;		/* relation OID by partition index, or -1 */
 	bool	   *hasexecparam;	/* true if corresponding pruning_step contains
 								 * any PARAM_EXEC Params. */
 	bool		do_initial_prune;	/* true if pruning should be performed
-- 
2.17.2 (Apple Git-113)

#93

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Robert Haas (#92)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Feb 28, 2019 at 3:27 PM Robert Haas <robertmhaas@gmail.com> wrote:

I'm not currently aware of any remaining correctness issues with this
code, although certainly there may be some. There has been a certain
dearth of volunteers to review any of this. I do plan to poke at it a
bit to see whether it has any significant performance impact, but not
today.

Today, did some performance testing. I created a table with 100
partitions and randomly selected rows from it using pgbench, with and
without -M prepared. The results show a small regression, but I
believe it's below the noise floor. Five minute test runs.

with prepared queries

master:
tps = 10919.914458 (including connections establishing)
tps = 10876.271217 (including connections establishing)
tps = 10761.586160 (including connections establishing)

concurrent-attach:
tps = 10883.535582 (including connections establishing)
tps = 10868.471805 (including connections establishing)
tps = 10761.586160 (including connections establishing)

with simple queries

master:
tps = 1486.120530 (including connections establishing)
tps = 1486.797251 (including connections establishing)
tps = 1494.129256 (including connections establishing)

concurrent-attach:
tps = 1481.774212 (including connections establishing)
tps = 1472.159016 (including connections establishing)
tps = 1476.444097 (including connections establishing)

Looking at the total of the three results, that's about an 0.8%
regression with simple queries and an 0.2% regression with prepared
queries. Looking at the median, it's about 0.7% and 0.07%. Would
anybody like to argue that's a reason not to commit these patches?

Would anyone like to argue that there is any other reason not to
commit these patches?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#94

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Robert Haas (#93)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Wed, 6 Mar 2019 at 10:13, Robert Haas <robertmhaas@gmail.com> wrote:

Would anyone like to argue that there is any other reason not to
commit these patches?

Hi Robert,

Thanks for working on this. I'm sorry I didn't get a chance to
dedicate some time to look at it.

It looks like you've pushed all of this now. Can the CF entry be
marked as committed?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#95

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: David Rowley (#94)

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Thu, Mar 14, 2019 at 6:12 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Wed, 6 Mar 2019 at 10:13, Robert Haas <robertmhaas@gmail.com> wrote:

Would anyone like to argue that there is any other reason not to
commit these patches?

Hi Robert,

Thanks for working on this. I'm sorry I didn't get a chance to
dedicate some time to look at it.

It looks like you've pushed all of this now. Can the CF entry be
marked as committed?

Yeah, done now, thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company