expanding inheritance in partition bound order
The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.
For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1]/messages/by-id/CA+TgmoajC0J50=2FqnZLvB10roY+68HgFWhso=V_StkC6PWujQ@mail.gmail.com.
So attached are two WIP patches:
0001 implements two interface functions:
List *get_all_partition_oids(Oid, LOCKMODE)
List *get_partition_oids(Oid, LOCKMODE)
that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expect that users call them only for partitioned tables.
Needless to mention, OIDs are returned with canonical order determined by
that of the partition bounds and they way partition tree structure is
traversed (top-down, breadth-first-left-to-right). Patch replaces all the
calls of the old interface functions with the respective new ones for
partitioned table parents. That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.
In its implementation, get_all_partition_oids() calls
RelationGetPartitionDispatchInfo(), which is useful to generate the result
list in the desired partition bound order. But the current interface and
implementation of RelationGetPartitionDispatchInfo() needs some rework,
because it's too closely coupled with the executor's tuple routing code.
Applying just 0001 will satisfy the requirements stated in [1]/messages/by-id/CA+TgmoajC0J50=2FqnZLvB10roY+68HgFWhso=V_StkC6PWujQ@mail.gmail.com, but it
won't look pretty as is for too long.
So, 0002 is a patch to refactor RelationGetPartitionDispatchInfo() and
relevant data structures. For example, PartitionDispatchData has now been
simplified to contain only the partition key, partition descriptor and
indexes array, whereas previously it also stored the relation descriptor,
partition key execution state, tuple table slot, tuple conversion map
which are required for tuple-routing. RelationGetPartitionDispatchInfo()
no longer generates those things, but returns just enough information so
that a caller can generate and manage those things by itself. This
simplification makes it less cumbersome to call
RelationGetPartitionDispatchInfo() in other places.
Thanks,
Amit
[1]: /messages/by-id/CA+TgmoajC0J50=2FqnZLvB10roY+68HgFWhso=V_StkC6PWujQ@mail.gmail.com
/messages/by-id/CA+TgmoajC0J50=2FqnZLvB10roY+68HgFWhso=V_StkC6PWujQ@mail.gmail.com
Attachments:
0001-Add-get_all_partition_oids-and-get_partition_oids-v1.patchtext/plain; charset=UTF-8; name=0001-Add-get_all_partition_oids-and-get_partition_oids-v1.patchDownload
From 9674053fd1e57a480d8a42585cb10421e2c76a70 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 2 Aug 2017 17:14:59 +0900
Subject: [PATCH 1/3] Add get_all_partition_oids and get_partition_oids
Their respective counterparts find_all_inheritors() and
find_inheritance_children() read the pg_inherits catalog directly and
frame the result list in some order determined by the order of OIDs.
get_all_partition_oids() and get_partition_oids() form their result
by reading the partition OIDs from the PartitionDesc contained in the
relcache. Hence, the order of OIDs in the resulting list is based
on that of the partition bounds. In the case of get_all_partition_oids
which traverses the whole-tree, the order is also determined by the
fact that the tree is traversed in a breadth-first manner.
---
contrib/sepgsql/dml.c | 4 +-
src/backend/catalog/partition.c | 84 ++++++++++++++++++++++
src/backend/commands/analyze.c | 8 ++-
src/backend/commands/lockcmds.c | 6 +-
src/backend/commands/publicationcmds.c | 9 ++-
src/backend/commands/tablecmds.c | 124 +++++++++++++++++++++++++--------
src/backend/commands/vacuum.c | 7 +-
src/backend/optimizer/prep/prepunion.c | 6 +-
src/include/catalog/partition.h | 3 +
9 files changed, 213 insertions(+), 38 deletions(-)
diff --git a/contrib/sepgsql/dml.c b/contrib/sepgsql/dml.c
index b643720e36..62d6610c43 100644
--- a/contrib/sepgsql/dml.c
+++ b/contrib/sepgsql/dml.c
@@ -332,8 +332,10 @@ sepgsql_dml_privileges(List *rangeTabls, bool abort_on_violation)
*/
if (!rte->inh)
tableIds = list_make1_oid(rte->relid);
- else
+ else if (rte->relkind != RELKIND_PARTITIONED_TABLE)
tableIds = find_all_inheritors(rte->relid, NoLock, NULL);
+ else
+ tableIds = get_all_partition_oids(rte->relid, NoLock);
foreach(li, tableIds)
{
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index dcc7f8af27..614b2f79f2 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1150,6 +1150,90 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
return pd;
}
+/*
+ * get_all_partition_oids - returns the list of all partitions in the
+ * partition tree rooted at relid
+ *
+ * OIDs in the list are ordered canonically using the partition bound order,
+ * while the tree is being traversed in a breadth-first manner. Actually,
+ * this's just a wrapper on top of RelationGetPartitionDispatchInfo.
+ *
+ * All the partitions are locked with lockmode. We assume that relid has been
+ * locked by the caller with lockmode.
+ */
+List *get_all_partition_oids(Oid relid, int lockmode)
+{
+ List *result = NIL;
+ List *leaf_part_oids = NIL;
+ ListCell *lc;
+ Relation rel;
+ int num_parted;
+ PartitionDispatch *pds;
+ int i;
+
+ /* caller should've locked already */
+ rel = heap_open(relid, NoLock);
+ pds = RelationGetPartitionDispatchInfo(rel, lockmode, &num_parted,
+ &leaf_part_oids);
+
+ /*
+ * First append the OIDs of all the partitions that are partitioned
+ * tables themselves, starting with relid itself.
+ */
+ result = lappend_oid(result, relid);
+ for (i = 1; i < num_parted; i++)
+ {
+ result = lappend_oid(result, RelationGetRelid(pds[i]->reldesc));
+
+ /*
+ * To avoid leaking resources, release them. This is to work around
+ * the existing interface of RelationGetPartitionDispatchInfo() that
+ * acquires these resources at the mercy of the caller to release
+ * them.
+ */
+ heap_close(pds[i]->reldesc, NoLock);
+ if (pds[i]->tupmap)
+ pfree(pds[i]->tupmap);
+ ExecDropSingleTupleTableSlot(pds[i]->tupslot);
+ }
+ heap_close(rel, NoLock);
+
+ /* Leaf partitions were not locked; do so now. */
+ foreach(lc, leaf_part_oids)
+ {
+ if (lockmode != NoLock)
+ LockRelationOid(lfirst_oid(lc), lockmode);
+ }
+
+ /* Return after concatening the leaf partition OIDs. */
+ return list_concat(result, leaf_part_oids);
+}
+
+/*
+ * get_partition_oids - returns a list of OIDs of partitions of relid
+ *
+ * OIDs are returned from the PartitionDesc contained in the relcache, so they
+ * are ordered canonically using partition bound order.
+ */
+List *get_partition_oids(Oid relid, int lockmode)
+{
+ List *result = NIL;
+ Relation rel;
+ int i;
+ PartitionDesc partdesc;
+
+ /* caller should've locked already */
+ rel = heap_open(relid, NoLock);
+ partdesc = RelationGetPartitionDesc(rel);
+ for (i = 0; i < partdesc->nparts; i++)
+ {
+ result = lappend_oid(result, partdesc->oids[i]);
+ }
+ heap_close(rel, NoLock);
+
+ return result;
+}
+
/* Module-local functions */
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b638271b3..f3c1893b12 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1281,8 +1281,12 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
* Find all members of inheritance set. We only need AccessShareLock on
* the children.
*/
- tableOIDs =
- find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, NULL);
+ if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ tableOIDs = find_all_inheritors(RelationGetRelid(onerel),
+ AccessShareLock, NULL);
+ else
+ tableOIDs = get_all_partition_oids(RelationGetRelid(onerel),
+ AccessShareLock);
/*
* Check that there's at least one descendant, else fail. This could
diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index 9fe9e022b0..29a9ef82b2 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -15,6 +15,7 @@
#include "postgres.h"
#include "catalog/namespace.h"
+#include "catalog/partition.h"
#include "catalog/pg_inherits_fn.h"
#include "commands/lockcmds.h"
#include "miscadmin.h"
@@ -112,7 +113,10 @@ LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait)
List *children;
ListCell *lc;
- children = find_inheritance_children(reloid, NoLock);
+ if (get_rel_relkind(reloid) != RELKIND_PARTITIONED_TABLE)
+ children = find_inheritance_children(reloid, NoLock);
+ else
+ children = get_partition_oids(reloid, NoLock);
foreach(lc, children)
{
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 610cb499d2..ab7423577f 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -515,8 +515,13 @@ OpenTableList(List *tables)
ListCell *child;
List *children;
- children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
- NULL);
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ children = find_all_inheritors(myrelid,
+ ShareUpdateExclusiveLock,
+ NULL);
+ else
+ children = get_all_partition_oids(myrelid,
+ ShareUpdateExclusiveLock);
foreach(child, children)
{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7859ef13ac..332697c095 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1231,7 +1231,12 @@ ExecuteTruncate(TruncateStmt *stmt)
ListCell *child;
List *children;
- children = find_all_inheritors(myrelid, AccessExclusiveLock, NULL);
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ children = find_all_inheritors(myrelid,
+ AccessExclusiveLock, NULL);
+ else
+ children = get_all_partition_oids(myrelid,
+ AccessExclusiveLock);
foreach(child, children)
{
@@ -2555,8 +2560,11 @@ renameatt_internal(Oid myrelid,
* calls to renameatt() can determine whether there are any parents
* outside the inheritance hierarchy being processed.
*/
- child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
- &child_numparents);
+ if (targetrelation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
+ &child_numparents);
+ else
+ child_oids = get_all_partition_oids(myrelid, AccessExclusiveLock);
/*
* find_all_inheritors does the recursive search of the inheritance
@@ -2581,6 +2589,10 @@ renameatt_internal(Oid myrelid,
* tables; else the rename would put them out of step.
*
* expected_parents will only be 0 if we are not already recursing.
+ *
+ * We don't bother to distinguish between find_inheritance_children's
+ * and get_partition_oids's results unlike in most other places,
+ * because we're not concerned about the order of OIDs here.
*/
if (expected_parents == 0 &&
find_inheritance_children(myrelid, NoLock) != NIL)
@@ -2765,8 +2777,13 @@ rename_constraint_internal(Oid myrelid,
ListCell *lo,
*li;
- child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
- &child_numparents);
+ Assert(targetrelation != NULL);
+ if (targetrelation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
+ &child_numparents);
+ else
+ child_oids = get_all_partition_oids(myrelid,
+ AccessExclusiveLock);
forboth(lo, child_oids, li, child_numparents)
{
@@ -2781,6 +2798,12 @@ rename_constraint_internal(Oid myrelid,
}
else
{
+ /*
+ * We don't bother to distinguish between
+ * find_inheritance_children's and get_partition_oids's results
+ * unlike in most other places, because we're not concerned about
+ * the order of OIDs here.
+ */
if (expected_parents == 0 &&
find_inheritance_children(myrelid, NoLock) != NIL)
ereport(ERROR,
@@ -4790,7 +4813,10 @@ ATSimpleRecursion(List **wqueue, Relation rel,
ListCell *child;
List *children;
- children = find_all_inheritors(relid, lockmode, NULL);
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ children = find_all_inheritors(relid, lockmode, NULL);
+ else
+ children = get_all_partition_oids(relid, lockmode);
/*
* find_all_inheritors does the recursive search of the inheritance
@@ -5183,6 +5209,10 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
/*
* Cannot add identity column if table has children, because identity does
* not inherit. (Adding column and identity separately will work.)
+ *
+ * We don't bother to distinguish between find_inheritance_children's and
+ * get_partition_oids's results unlike in most other places, because we're
+ * not concerned about the order of OIDs here.
*/
if (colDef->identity &&
recurse &&
@@ -5390,9 +5420,12 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
/*
* Propagate to children as appropriate. Unlike most other ALTER
* routines, we have to do this one level of recursion at a time; we can't
- * use find_all_inheritors to do it in one pass.
+ * use find_all_inheritors or get_all_partition_oids to do it in one pass.
*/
- children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ else
+ children = get_partition_oids(RelationGetRelid(rel), lockmode);
/*
* If we are told not to recurse, there had better not be any child
@@ -6509,9 +6542,12 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
/*
* Propagate to children as appropriate. Unlike most other ALTER
* routines, we have to do this one level of recursion at a time; we can't
- * use find_all_inheritors to do it in one pass.
+ * use find_all_inheritors or get_all_partition_oids to do it in one pass.
*/
- children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ else
+ children = get_partition_oids(RelationGetRelid(rel), lockmode);
if (children)
{
@@ -6943,9 +6979,12 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
/*
* Propagate to children as appropriate. Unlike most other ALTER
* routines, we have to do this one level of recursion at a time; we can't
- * use find_all_inheritors to do it in one pass.
+ * use find_all_inheritors or get_all_partition_oids to do it in one pass.
*/
- children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ else
+ children = get_partition_oids(RelationGetRelid(rel), lockmode);
/*
* Check if ONLY was specified with ALTER TABLE. If so, allow the
@@ -7663,8 +7702,14 @@ ATExecValidateConstraint(Relation rel, char *constrName, bool recurse,
* shouldn't try to look for it in the children.
*/
if (!recursing && !con->connoinherit)
- children = find_all_inheritors(RelationGetRelid(rel),
- lockmode, NULL);
+ {
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ children = find_all_inheritors(RelationGetRelid(rel),
+ lockmode, NULL);
+ else
+ children = get_all_partition_oids(RelationGetRelid(rel),
+ lockmode);
+ }
/*
* For CHECK constraints, we must ensure that we only mark the
@@ -8544,12 +8589,14 @@ ATExecDropConstraint(Relation rel, const char *constrName,
/*
* Propagate to children as appropriate. Unlike most other ALTER
* routines, we have to do this one level of recursion at a time; we can't
- * use find_all_inheritors to do it in one pass.
+ * use find_all_inheritors or get_all_partition_oids to do it in one pass.
*/
- if (!is_no_inherit_constraint)
+ if (is_no_inherit_constraint)
+ children = NIL;
+ else if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
children = find_inheritance_children(RelationGetRelid(rel), lockmode);
else
- children = NIL;
+ children = get_partition_oids(RelationGetRelid(rel), lockmode);
/*
* For a partitioned table, if partitions exist and we are told not to
@@ -8836,7 +8883,10 @@ ATPrepAlterColumnType(List **wqueue,
ListCell *child;
List *children;
- children = find_all_inheritors(relid, lockmode, NULL);
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ children = find_all_inheritors(relid, lockmode, NULL);
+ else
+ children = get_all_partition_oids(relid, lockmode);
/*
* find_all_inheritors does the recursive search of the inheritance
@@ -8886,6 +8936,11 @@ ATPrepAlterColumnType(List **wqueue,
relation_close(childrel, NoLock);
}
}
+ /*
+ * We don't bother to distinguish between find_inheritance_children's and
+ * get_partition_oids's results unlike in most other places, because we're
+ * not concerned about the order of OIDs here.
+ */
else if (!recursing &&
find_inheritance_children(RelationGetRelid(rel), NoLock) != NIL)
ereport(ERROR,
@@ -10996,6 +11051,7 @@ ATExecAddInherit(Relation child_rel, RangeVar *parent, LOCKMODE lockmode)
*
* We use weakest lock we can on child's children, namely AccessShareLock.
*/
+ Assert(child_rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE);
children = find_all_inheritors(RelationGetRelid(child_rel),
AccessShareLock, NULL);
@@ -13421,7 +13477,7 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
{
Relation attachrel,
catalog;
- List *attachrel_children;
+ List *attachrel_children = NIL;
TupleConstr *attachrel_constr;
List *partConstraint,
*existConstraint;
@@ -13501,15 +13557,20 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
* table, nor its partitions. But we cannot risk a deadlock by taking a
* weaker lock now and the stronger one only when needed.
*/
- attachrel_children = find_all_inheritors(RelationGetRelid(attachrel),
- AccessExclusiveLock, NULL);
- if (list_member_oid(attachrel_children, RelationGetRelid(rel)))
- ereport(ERROR,
- (errcode(ERRCODE_DUPLICATE_TABLE),
- errmsg("circular inheritance not allowed"),
- errdetail("\"%s\" is already a child of \"%s\".",
- RelationGetRelationName(rel),
- RelationGetRelationName(attachrel))));
+ if (attachrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ {
+ Oid attachrel_oid = RelationGetRelid(attachrel);
+
+ attachrel_children = get_all_partition_oids(attachrel_oid,
+ AccessExclusiveLock);
+ if (list_member_oid(attachrel_children, RelationGetRelid(rel)))
+ ereport(ERROR,
+ (errcode(ERRCODE_DUPLICATE_TABLE),
+ errmsg("circular inheritance not allowed"),
+ errdetail("\"%s\" is already a child of \"%s\".",
+ RelationGetRelationName(rel),
+ RelationGetRelationName(attachrel))));
+ }
/* Temp parent cannot have a partition that is itself not a temp */
if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP &&
@@ -13707,6 +13768,13 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
/* Constraints proved insufficient, so we need to scan the table. */
ListCell *lc;
+ /*
+ * If attachrel isn't partitioned, attachrel_children would be empty.
+ * We still need to process attachrel itself, so initialize.
+ */
+ if (attachrel_children == NIL)
+ attachrel_children = list_make1_oid(RelationGetRelid(attachrel));
+
foreach(lc, attachrel_children)
{
AlteredTableInfo *tab;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index faa181207a..7bea95d9c5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -31,6 +31,7 @@
#include "access/transam.h"
#include "access/xact.h"
#include "catalog/namespace.h"
+#include "catalog/partition.h"
#include "catalog/pg_database.h"
#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_namespace.h"
@@ -423,14 +424,14 @@ get_rel_oids(Oid relid, const RangeVar *vacrel)
/*
* Make relation list entries for this guy and its partitions, if any.
- * Note that the list returned by find_all_inheritors() include the
- * passed-in OID at its head. Also note that we did not request a
+ * Note that the list returned by get_all_partition_oids() includes
+ * the passed-in OID at its head. Also note that we did not request a
* lock to be taken to match what would be done otherwise.
*/
oldcontext = MemoryContextSwitchTo(vac_context);
if (include_parts)
oid_list = list_concat(oid_list,
- find_all_inheritors(relid, NoLock, NULL));
+ get_all_partition_oids(relid, NoLock));
else
oid_list = lappend_oid(oid_list, relid);
MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index cf46b74782..398bdd598a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
#include "access/heapam.h"
#include "access/htup_details.h"
#include "access/sysattr.h"
+#include "catalog/partition.h"
#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -1418,7 +1419,10 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
lockmode = AccessShareLock;
/* Scan for all members of inheritance set, acquire needed locks */
- inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+ if (rte->relkind != RELKIND_PARTITIONED_TABLE)
+ inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+ else
+ inhOIDs = get_all_partition_oids(parentOID, lockmode);
/*
* Check that there's at least one descendant, else treat as no-child
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 434ded37d7..e6314fbaa2 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -85,6 +85,9 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
+extern List *get_all_partition_oids(Oid relid, int lockmode);
+extern List *get_partition_oids(Oid relid, int lockmode);
+
/* For tuple routing */
extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
int lockmode, int *num_parted,
--
2.11.0
0002-Decouple-RelationGetPartitionDispatchInfo-from-execu-v1.patchtext/plain; charset=UTF-8; name=0002-Decouple-RelationGetPartitionDispatchInfo-from-execu-v1.patchDownload
From f869287c25397a39a50acadd34e5e1677e3ce858 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 2/3] Decouple RelationGetPartitionDispatchInfo() from executor
Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code. In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as relcache references
and tuple table slots. That makes it harder to use in places other
than where it's currently being used.
After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo() and get_all_partition_oids() no
longer needs to do some things that it used to.
---
src/backend/catalog/partition.c | 367 +++++++++++++++++----------------
src/backend/commands/copy.c | 35 ++--
src/backend/executor/execMain.c | 158 ++++++++++++--
src/backend/executor/nodeModifyTable.c | 29 ++-
src/include/catalog/partition.h | 52 ++---
src/include/executor/executor.h | 4 +-
src/include/nodes/execnodes.h | 53 ++++-
7 files changed, 426 insertions(+), 272 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 614b2f79f2..2a6ad70719 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,24 @@ typedef struct PartitionRangeBound
bool lower; /* this is the lower (vs upper) bound */
} PartitionRangeBound;
+/*-----------------------
+ * PartitionDispatchData - information of partitions of one partitioned table
+ * in a partition tree
+ *
+ * partkey Partition key of the table
+ * partdesc Partition descriptor of the table
+ * indexes Array with partdesc->nparts members (for details on what the
+ * individual value represents, see the comments in
+ * RelationGetPartitionDispatchInfo())
+ *-----------------------
+ */
+typedef struct PartitionDispatchData
+{
+ PartitionKey partkey; /* Points into the table's relcache entry */
+ PartitionDesc partdesc; /* Ditto */
+ int *indexes;
+} PartitionDispatchData;
+
static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
void *arg);
static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -976,178 +994,167 @@ get_partition_qual_relid(Oid relid)
}
/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
- do\
- {\
- int i;\
- for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
- {\
- (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
- (parents) = lappend((parents), (rel));\
- }\
- } while(0)
-
-/*
* RelationGetPartitionDispatchInfo
- * Returns information necessary to route tuples down a partition tree
+ * Returns necessary information for each partition in the partition
+ * tree rooted at rel
*
- * All the partitions will be locked with lockmode, unless it is NoLock.
- * A list of the OIDs of all the leaf partitions of rel is returned in
- * *leaf_part_oids.
+ * Information returned includes the following: *ptinfos contains a list of
+ * PartitionedTableInfo objects, one for each partitioned table (with at least
+ * one member, that is, one for the root partitioned table), *leaf_part_oids
+ * contains a list of the OIDs of of all the leaf partitions.
+ *
+ * Note that we lock only those partitions that are partitioned tables, because
+ * we need to look at its relcache entry to get its PartitionKey and its
+ * PartitionDesc. It's the caller's responsibility to lock the leaf partitions
+ * that will actually be accessed during a given query.
*/
-PartitionDispatch *
+void
RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
- int *num_parted, List **leaf_part_oids)
+ List **ptinfos, List **leaf_part_oids)
{
- PartitionDispatchData **pd;
- List *all_parts = NIL,
- *all_parents = NIL,
- *parted_rels,
- *parted_rel_parents;
+ List *all_parts,
+ *all_parents;
ListCell *lc1,
*lc2;
int i,
- k,
offset;
/*
- * Lock partitions and make a list of the partitioned ones to prepare
- * their PartitionDispatch objects below.
+ * We rely on the relcache to traverse the partition tree, building
+ * both the leaf partition OIDs list and the PartitionedTableInfo list.
+ * Starting with the root partitioned table for which we already have the
+ * relcache entry, we look at its partition descriptor to get the
+ * partition OIDs. For partitions that are themselves partitioned tables,
+ * we get their relcache entries after locking them with lockmode and
+ * queue their partitions to be looked at later. Leaf partitions are
+ * added to the result list without locking. For each partitioned table,
+ * we build a PartitionedTableInfo object and add it to the other result
+ * list.
*
- * Cannot use find_all_inheritors() here, because then the order of OIDs
- * in parted_rels list would be unknown, which does not help, because we
- * assign indexes within individual PartitionDispatch in an order that is
- * predetermined (determined by the order of OIDs in individual partition
- * descriptors).
+ * Since RelationBuildPartitionDescriptor() puts partitions in a canonical
+ * order determined by comparing partition bounds, we can rely that
+ * concurrent backends see the partitions in the same order, ensuring that
+ * there are no deadlocks when locking the partitions.
*/
- *num_parted = 1;
- parted_rels = list_make1(rel);
- /* Root partitioned table has no parent, so NULL for parent */
- parted_rel_parents = list_make1(NULL);
- APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
+ i = offset = 0;
+ *ptinfos = *leaf_part_oids = NIL;
+
+ /* Start with the root table. */
+ all_parts = list_make1_oid(RelationGetRelid(rel));
+ all_parents = list_make1_oid(InvalidOid);
forboth(lc1, all_parts, lc2, all_parents)
{
- Relation partrel = heap_open(lfirst_oid(lc1), lockmode);
- Relation parent = lfirst(lc2);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
+ Oid partrelid = lfirst_oid(lc1);
+ Oid parentrelid = lfirst_oid(lc2);
- /*
- * If this partition is a partitioned table, add its children to the
- * end of the list, so that they are processed as well.
- */
- if (partdesc)
+ if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
{
- (*num_parted)++;
- parted_rels = lappend(parted_rels, partrel);
- parted_rel_parents = lappend(parted_rel_parents, parent);
- APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
- }
- else
- heap_close(partrel, NoLock);
+ int j,
+ k;
+ Relation partrel;
+ PartitionKey partkey;
+ PartitionDesc partdesc;
+ PartitionedTableInfo *ptinfo;
+ PartitionDispatch pd;
+
+ if (partrelid != RelationGetRelid(rel))
+ partrel = heap_open(partrelid, lockmode);
+ else
+ partrel = rel;
- /*
- * We keep the partitioned ones open until we're done using the
- * information being collected here (for example, see
- * ExecEndModifyTable).
- */
- }
+ partkey = RelationGetPartitionKey(partrel);
+ partdesc = RelationGetPartitionDesc(partrel);
+
+ ptinfo = (PartitionedTableInfo *)
+ palloc0(sizeof(PartitionedTableInfo));
+ ptinfo->relid = partrelid;
+ ptinfo->parentid = parentrelid;
+
+ ptinfo->pd = pd = (PartitionDispatchData *)
+ palloc0(sizeof(PartitionDispatchData));
+ pd->partkey = partkey;
- /*
- * We want to create two arrays - one for leaf partitions and another for
- * partitioned tables (including the root table and internal partitions).
- * While we only create the latter here, leaf partition array of suitable
- * objects (such as, ResultRelInfo) is created by the caller using the
- * list of OIDs we return. Indexes into these arrays get assigned in a
- * breadth-first manner, whereby partitions of any given level are placed
- * consecutively in the respective arrays.
- */
- pd = (PartitionDispatchData **) palloc(*num_parted *
- sizeof(PartitionDispatchData *));
- *leaf_part_oids = NIL;
- i = k = offset = 0;
- forboth(lc1, parted_rels, lc2, parted_rel_parents)
- {
- Relation partrel = lfirst(lc1);
- Relation parent = lfirst(lc2);
- PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- int j,
- m;
-
- pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
- pd[i]->reldesc = partrel;
- pd[i]->key = partkey;
- pd[i]->keystate = NIL;
- pd[i]->partdesc = partdesc;
- if (parent != NULL)
- {
/*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
- */
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
- }
- else
- {
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
- pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+ * Pin the partition descriptor before stashing the references to the
+ * information contained in it into this PartitionDispatch object.
+ *
+ PinPartitionDesc(partdesc);*/
+ pd->partdesc = partdesc;
- /*
- * Indexes corresponding to the internal partitions are multiplied by
- * -1 to distinguish them from those of leaf partitions. Encountering
- * an index >= 0 means we found a leaf partition, which is immediately
- * returned as the partition we are looking for. A negative index
- * means we found a partitioned table, whose PartitionDispatch object
- * is located at the above index multiplied back by -1. Using the
- * PartitionDispatch object, search is continued further down the
- * partition tree.
- */
- m = 0;
- for (j = 0; j < partdesc->nparts; j++)
- {
- Oid partrelid = partdesc->oids[j];
+ /*
+ * The values contained in the following array correspond to
+ * indexes of this table's partitions in the global sequence of
+ * all the partitions contained in the partition tree rooted at
+ * rel, traversed in a breadh-first manner. The values should be
+ * such that we will be able to distinguish the leaf partitions
+ * from the non-leaf partitions, because they are returned to
+ * to the caller in separate structures from where they will be
+ * accessed. The way that's done is described below:
+ *
+ * Leaf partition OIDs are put into the global leaf_part_oids list,
+ * and for each one, the value stored is its ordinal position in
+ * the list minus 1.
+ *
+ * PartitionedTableInfo objects corresponding to partitions that
+ * are partitioned tables are put into the global ptinfos[] list,
+ * and for each one, the value stored is its ordinal position in
+ * the list multiplied by -1.
+ *
+ * So while looking at the values in the indexes array, if one
+ * gets zero or a positive value, then it's a leaf partition,
+ * Otherwise, it's a partitioned table.
+ */
+ pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
- if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
- {
- *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
- pd[i]->indexes[j] = k++;
- }
- else
+ k = 0;
+ for (j = 0; j < partdesc->nparts; j++)
{
+ Oid partrelid = partdesc->oids[j];
+
/*
- * offset denotes the number of partitioned tables of upper
- * levels including those of the current level. Any partition
- * of this table must belong to the next level and hence will
- * be placed after the last partitioned table of this level.
+ * Queue this partition so that it will be processed later
+ * by the outer loop.
*/
- pd[i]->indexes[j] = -(1 + offset + m);
- m++;
+ all_parts = lappend_oid(all_parts, partrelid);
+ all_parents = lappend_oid(all_parents,
+ RelationGetRelid(partrel));
+
+ if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
+ {
+ *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+ pd->indexes[j] = i++;
+ }
+ else
+ {
+ /*
+ * offset denotes the number of partitioned tables that
+ * we have already processed. k counts the number of
+ * partitions of this table that were found to be
+ * partitioned tables.
+ */
+ pd->indexes[j] = -(1 + offset + k);
+ k++;
+ }
}
- }
- i++;
- /*
- * This counts the number of partitioned tables at upper levels
- * including those of the current level.
- */
- offset += m;
+ offset += k;
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+
+ *ptinfos = lappend(*ptinfos, ptinfo);
+ }
}
- return pd;
+ Assert(i == list_length(*leaf_part_oids));
+ Assert((offset + 1) == list_length(*ptinfos));
}
/*
@@ -1164,45 +1171,38 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
List *get_all_partition_oids(Oid relid, int lockmode)
{
List *result = NIL;
+ List *ptinfo = NIL;
List *leaf_part_oids = NIL;
ListCell *lc;
- Relation rel;
- int num_parted;
- PartitionDispatch *pds;
- int i;
+ Relation rel;
/* caller should've locked already */
rel = heap_open(relid, NoLock);
- pds = RelationGetPartitionDispatchInfo(rel, lockmode, &num_parted,
- &leaf_part_oids);
+
+ /*
+ * Get the information about the partition tree. All the partitioned
+ * tables in the tree are locked, but not the leaf partitions, which
+ * we lock below.
+ */
+ RelationGetPartitionDispatchInfo(rel, lockmode, &ptinfo, &leaf_part_oids);
+ heap_close(rel, NoLock);
/*
* First append the OIDs of all the partitions that are partitioned
- * tables themselves, starting with relid itself.
+ * tables themselves.
*/
- result = lappend_oid(result, relid);
- for (i = 1; i < num_parted; i++)
+ foreach (lc, ptinfo)
{
- result = lappend_oid(result, RelationGetRelid(pds[i]->reldesc));
+ PartitionedTableInfo *ptinfo = lfirst(lc);
- /*
- * To avoid leaking resources, release them. This is to work around
- * the existing interface of RelationGetPartitionDispatchInfo() that
- * acquires these resources at the mercy of the caller to release
- * them.
- */
- heap_close(pds[i]->reldesc, NoLock);
- if (pds[i]->tupmap)
- pfree(pds[i]->tupmap);
- ExecDropSingleTupleTableSlot(pds[i]->tupslot);
+ result = lappend_oid(result, ptinfo->relid);
}
- heap_close(rel, NoLock);
- /* Leaf partitions were not locked; do so now. */
- foreach(lc, leaf_part_oids)
+ /* Lock leaf partitions, if requested. */
+ foreach (lc, leaf_part_oids)
{
if (lockmode != NoLock)
- LockRelationOid(lfirst_oid(lc), lockmode);
+ LockRelationOid(lfirst_oid(lc), lockmode);
}
/* Return after concatening the leaf partition OIDs. */
@@ -1948,7 +1948,7 @@ generate_partition_qual(Relation rel)
* ----------------
*/
void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
@@ -1957,20 +1957,21 @@ FormPartitionKeyDatum(PartitionDispatch pd,
ListCell *partexpr_item;
int i;
- if (pd->key->partexprs != NIL && pd->keystate == NIL)
+ if (keyinfo->key->partexprs != NIL && keyinfo->keystate == NIL)
{
/* Check caller has set up context correctly */
Assert(estate != NULL &&
GetPerTupleExprContext(estate)->ecxt_scantuple == slot);
/* First time through, set up expression evaluation state */
- pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+ keyinfo->keystate = ExecPrepareExprList(keyinfo->key->partexprs,
+ estate);
}
- partexpr_item = list_head(pd->keystate);
- for (i = 0; i < pd->key->partnatts; i++)
+ partexpr_item = list_head(keyinfo->keystate);
+ for (i = 0; i < keyinfo->key->partnatts; i++)
{
- AttrNumber keycol = pd->key->partattrs[i];
+ AttrNumber keycol = keyinfo->key->partattrs[i];
Datum datum;
bool isNull;
@@ -2007,13 +2008,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
* the latter case.
*/
int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot)
{
- PartitionDispatch parent;
+ PartitionTupleRoutingInfo *parent;
Datum values[PARTITION_MAX_KEYS];
bool isnull[PARTITION_MAX_KEYS];
int cur_offset,
@@ -2024,11 +2025,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
/* start with the root partitioned table */
- parent = pd[0];
+ parent = ptrinfos[0];
while (true)
{
- PartitionKey key = parent->key;
- PartitionDesc partdesc = parent->partdesc;
+ PartitionKey key = parent->pd->partkey;
+ PartitionDesc partdesc = parent->pd->partdesc;
TupleTableSlot *myslot = parent->tupslot;
TupleConversionMap *map = parent->tupmap;
@@ -2060,7 +2061,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
* So update ecxt_scantuple accordingly.
*/
ecxt->ecxt_scantuple = slot;
- FormPartitionKeyDatum(parent, slot, estate, values, isnull);
+ FormPartitionKeyDatum(parent->keyinfo, slot, estate, values, isnull);
if (key->strategy == PARTITION_STRATEGY_RANGE)
{
@@ -2131,13 +2132,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
*failed_slot = slot;
break;
}
- else if (parent->indexes[cur_index] >= 0)
+ else if (parent->pd->indexes[cur_index] >= 0)
{
- result = parent->indexes[cur_index];
+ result = parent->pd->indexes[cur_index];
break;
}
else
- parent = pd[-parent->indexes[cur_index]];
+ parent = ptrinfos[-parent->pd->indexes[cur_index]];
}
error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 53e296559a..b3de3de454 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
bool volatile_defexprs; /* is any of defexprs volatile? */
List *range_table;
- PartitionDispatch *partition_dispatch_info;
- int num_dispatch; /* Number of entries in the above array */
+ PartitionTupleRoutingInfo **ptrinfos;
+ int num_parted; /* Number of entries in the above array */
int num_partitions; /* Number of members in the following arrays */
ResultRelInfo *partitions; /* Per partition result relation */
TupleConversionMap **partition_tupconv_maps;
@@ -1425,7 +1425,7 @@ BeginCopy(ParseState *pstate,
/* Initialize state for CopyFrom tuple routing. */
if (is_from && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1434,13 +1434,13 @@ BeginCopy(ParseState *pstate,
ExecSetupPartitionTupleRouting(rel,
1,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- cstate->partition_dispatch_info = partition_dispatch_info;
- cstate->num_dispatch = num_parted;
+ cstate->ptrinfos = ptrinfos;
+ cstate->num_parted = num_parted;
cstate->partitions = partitions;
cstate->num_partitions = num_partitions;
cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2495,7 +2495,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
- cstate->partition_dispatch_info != NULL ||
+ cstate->ptrinfos != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2573,7 +2573,7 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
/* Determine the partition to heap_insert the tuple into */
- if (cstate->partition_dispatch_info)
+ if (cstate->ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -2587,7 +2587,7 @@ CopyFrom(CopyState cstate)
* partition, respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- cstate->partition_dispatch_info,
+ cstate->ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -2818,23 +2818,20 @@ CopyFrom(CopyState cstate)
ExecCloseIndices(resultRelInfo);
- /* Close all the partitioned tables, leaf partitions, and their indices */
- if (cstate->partition_dispatch_info)
+ /* Close all the leaf partitions and their indices */
+ if (cstate->ptrinfos)
{
int i;
/*
- * Remember cstate->partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is
- * the main target table of COPY that will be closed eventually by
- * DoCopy(). Also, tupslot is NULL for the root partitioned table.
+ * cstate->ptrinfo[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot.
*/
- for (i = 1; i < cstate->num_dispatch; i++)
+ for (i = 1; i < cstate->num_parted; i++)
{
- PartitionDispatch pd = cstate->partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
for (i = 0; i < cstate->num_partitions; i++)
{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c11aa4fe21..0379e489d9 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3214,8 +3214,8 @@ EvalPlanQualEnd(EPQState *epqstate)
* tuple routing for partitioned tables
*
* Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- * every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ * entry for each partitioned table in the partition tree
* 'partitions' receives an array of ResultRelInfo objects with one entry for
* every leaf partition in the partition tree
* 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3237,7 +3237,7 @@ EvalPlanQualEnd(EPQState *epqstate)
void
ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
@@ -3245,13 +3245,135 @@ ExecSetupPartitionTupleRouting(Relation rel,
{
TupleDesc tupDesc = RelationGetDescr(rel);
List *leaf_parts;
+ List *ptinfos = NIL;
ListCell *cell;
int i;
ResultRelInfo *leaf_part_rri;
+ Relation parent;
- /* Get the tuple-routing information and lock partitions */
- *pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, num_parted,
- &leaf_parts);
+ /*
+ * Get information about the partition tree. All the partitioned
+ * tables in the tree are locked, but not the leaf partitions. We
+ * lock them while building their ResultRelInfos below.
+ */
+ RelationGetPartitionDispatchInfo(rel, RowExclusiveLock,
+ &ptinfos, &leaf_parts);
+
+ /*
+ * The ptinfos list contains PartitionedTableInfo objects for all the
+ * partitioned tables in the partition tree. Using the information
+ * therein, we construct an array of PartitionTupleRoutingInfo objects
+ * to be used during tuple-routing.
+ */
+ *num_parted = list_length(ptinfos);
+ *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+ sizeof(PartitionTupleRoutingInfo *));
+ /*
+ * Free the ptinfos List structure itself as we go through (open-coded
+ * list_free).
+ */
+ i = 0;
+ cell = list_head(ptinfos);
+ parent = NULL;
+ while (cell)
+ {
+ ListCell *tmp = cell;
+ PartitionedTableInfo *ptinfo = lfirst(tmp),
+ *next_ptinfo;
+ Relation partrel;
+ PartitionTupleRoutingInfo *ptrinfo;
+
+ if (lnext(tmp))
+ next_ptinfo = lfirst(lnext(tmp));
+
+ /* As mentioned above, the partitioned tables have been locked. */
+ if (ptinfo->relid != RelationGetRelid(rel))
+ partrel = heap_open(ptinfo->relid, NoLock);
+ else
+ partrel = rel;
+
+ ptrinfo = (PartitionTupleRoutingInfo *)
+ palloc0(sizeof(PartitionTupleRoutingInfo));
+ ptrinfo->relid = ptinfo->relid;
+
+ /* Stash a reference to this PartitionDispatch. */
+ ptrinfo->pd = ptinfo->pd;
+
+ /* State for extracting partition key from tuples will go here. */
+ ptrinfo->keyinfo = (PartitionKeyInfo *)
+ palloc0(sizeof(PartitionKeyInfo));
+ ptrinfo->keyinfo->key = RelationGetPartitionKey(partrel);
+ ptrinfo->keyinfo->keystate = NIL;
+
+ /*
+ * For every partitioned table other than root, we must store a tuple
+ * table slot initialized with its tuple descriptor and a tuple
+ * conversion map to convert a tuple from its parent's rowtype to its
+ * own. That is to make sure that we are looking at the correct row
+ * using the correct tuple descriptor when computing its partition key
+ * for tuple routing.
+ */
+ if (ptinfo->parentid != InvalidOid)
+ {
+ TupleDesc tupdesc = RelationGetDescr(partrel);
+
+ /* Open the parent relation descriptor if not already done. */
+ if (ptinfo->parentid == RelationGetRelid(rel))
+ {
+ parent = rel;
+ }
+ else if (parent == NULL)
+ {
+ /* Locked by RelationGetPartitionDispatchInfo(). */
+ parent = heap_open(ptinfo->parentid, NoLock);
+ }
+
+ ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+
+ /*
+ * Close the parent descriptor, if the next partitioned table in
+ * the list is not a sibling, because it will have a different
+ * parent if so.
+ */
+ if (parent && parent != rel &&
+ next_ptinfo->parentid != ptinfo->parentid)
+ {
+ heap_close(parent, NoLock);
+ parent = NULL;
+ }
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ ptrinfo->tupslot = NULL;
+ ptrinfo->tupmap = NULL;
+ }
+
+ (*ptrinfos)[i++] = ptrinfo;
+
+ /* Free the ListCell. */
+ cell = lnext(cell);
+ pfree(tmp);
+ }
+
+ /* Free the List itself. */
+ if (ptinfos)
+ pfree(ptinfos);
+
+ /* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
@@ -3274,11 +3396,11 @@ ExecSetupPartitionTupleRouting(Relation rel,
TupleDesc part_tupdesc;
/*
- * We locked all the partitions above including the leaf partitions.
- * Note that each of the relations in *partitions are eventually
- * closed by the caller.
+ * RelationGetPartitionDispatchInfo didn't lock the leaf partitions,
+ * so lock here. Note that each of the relations in *partitions are
+ * eventually closed (when the plan is shut down, for instance).
*/
- partrel = heap_open(lfirst_oid(cell), NoLock);
+ partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
part_tupdesc = RelationGetDescr(partrel);
/*
@@ -3291,7 +3413,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* partition from the parent's type to the partition's.
*/
(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
- gettext_noop("could not convert row type"));
+ gettext_noop("could not convert row type"));
InitResultRelInfo(leaf_part_rri,
partrel,
@@ -3325,11 +3447,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
* by get_partition_for_tuple() unchanged.
*/
int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
- TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+ PartitionTupleRoutingInfo **ptrinfos,
+ TupleTableSlot *slot,
+ EState *estate)
{
int result;
- PartitionDispatchData *failed_at;
+ PartitionTupleRoutingInfo *failed_at;
TupleTableSlot *failed_slot;
/*
@@ -3339,7 +3463,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
if (resultRelInfo->ri_PartitionCheck)
ExecPartitionCheck(resultRelInfo, slot, estate);
- result = get_partition_for_tuple(pd, slot, estate,
+ result = get_partition_for_tuple(ptrinfos, slot, estate,
&failed_at, &failed_slot);
if (result < 0)
{
@@ -3349,9 +3473,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
char *val_desc;
ExprContext *ecxt = GetPerTupleExprContext(estate);
- failed_rel = failed_at->reldesc;
+ failed_rel = heap_open(failed_at->relid, NoLock);
ecxt->ecxt_scantuple = failed_slot;
- FormPartitionKeyDatum(failed_at, failed_slot, estate,
+ FormPartitionKeyDatum(failed_at->keyinfo, failed_slot, estate,
key_values, key_isnull);
val_desc = ExecBuildSlotPartitionKeyDescription(failed_rel,
key_values,
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 30add8e3c7..00cbee4fb6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -277,7 +277,7 @@ ExecInsert(ModifyTableState *mtstate,
resultRelInfo = estate->es_result_relation_info;
/* Determine the partition to heap_insert the tuple into */
- if (mtstate->mt_partition_dispatch_info)
+ if (mtstate->mt_ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -291,7 +291,7 @@ ExecInsert(ModifyTableState *mtstate,
* respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- mtstate->mt_partition_dispatch_info,
+ mtstate->mt_ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -1486,7 +1486,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
int numResultRelInfos;
/* Find the set of partitions so that we can find their TupleDescs. */
- if (mtstate->mt_partition_dispatch_info != NULL)
+ if (mtstate->mt_ptrinfos != NULL)
{
/*
* For INSERT via partitioned table, so we need TupleDescs based
@@ -1910,7 +1910,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
if (operation == CMD_INSERT &&
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1919,13 +1919,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
ExecSetupPartitionTupleRouting(rel,
node->nominalRelation,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- mtstate->mt_partition_dispatch_info = partition_dispatch_info;
- mtstate->mt_num_dispatch = num_parted;
+ mtstate->mt_ptrinfos = ptrinfos;
+ mtstate->mt_num_parted = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2335,19 +2335,16 @@ ExecEndModifyTable(ModifyTableState *node)
}
/*
- * Close all the partitioned tables, leaf partitions, and their indices
+ * Close all the leaf partitions and their indices.
*
- * Remember node->mt_partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is the
- * main target table of the query that will be closed by ExecEndPlan().
- * Also, tupslot is NULL for the root partitioned table.
+ * node->mt_partition_dispatch_info[0] corresponds to the root partitioned
+ * table, for which we didn't create tupslot.
*/
- for (i = 1; i < node->mt_num_dispatch; i++)
+ for (i = 1; i < node->mt_num_parted; i++)
{
- PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
for (i = 0; i < node->mt_num_partitions; i++)
{
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index e6314fbaa2..98dcd246b4 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -39,36 +39,23 @@ typedef struct PartitionDescData
typedef struct PartitionDescData *PartitionDesc;
-/*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
- *
- * reldesc Relation descriptor of the table
- * key Partition key information of the table
- * keystate Execution state required for expressions in the partition key
- * partdesc Partition descriptor of the table
- * tupslot A standalone TupleTableSlot initialized with this table's tuple
- * descriptor
- * tupmap TupleConversionMap to convert from the parent's rowtype to
- * this table's rowtype (when extracting the partition key of a
- * tuple just before routing it through this table)
- * indexes Array with partdesc->nparts members (for details on what
- * individual members represent, see how they are set in
- * RelationGetPartitionDispatchInfo())
- *-----------------------
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * Information about one partitioned table in a given partition tree
*/
-typedef struct PartitionDispatchData
+typedef struct PartitionedTableInfo
{
- Relation reldesc;
- PartitionKey key;
- List *keystate; /* list of ExprState */
- PartitionDesc partdesc;
- TupleTableSlot *tupslot;
- TupleConversionMap *tupmap;
- int *indexes;
-} PartitionDispatchData;
+ Oid relid;
+ Oid parentid;
-typedef struct PartitionDispatchData *PartitionDispatch;
+ /*
+ * This contains information about bounds of the partitions of this
+ * table and about where individual partitions are placed in the global
+ * partition tree.
+ */
+ PartitionDispatch pd;
+} PartitionedTableInfo;
extern void RelationBuildPartitionDesc(Relation relation);
extern bool partition_bounds_equal(PartitionKey key,
@@ -85,21 +72,20 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
+extern void RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
+ List **ptinfos, List **leaf_part_oids);
extern List *get_all_partition_oids(Oid relid, int lockmode);
extern List *get_partition_oids(Oid relid, int lockmode);
/* For tuple routing */
-extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
- int lockmode, int *num_parted,
- List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **pd,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot);
#endif /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 60326f9d03..6e1d3a6d2f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -208,13 +208,13 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
extern void ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
int *num_parted, int *num_partitions);
extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
- PartitionDispatch *pd,
+ PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 35c28a6143..1514d62f52 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,55 @@ typedef struct ResultRelInfo
Relation ri_PartitionRoot;
} ResultRelInfo;
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionKeyData *PartitionKey;
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionKeyInfoData - execution state for the partition key of a
+ * partitioned table
+ *
+ * keystate is the execution state required for expressions contained in the
+ * partition key. It is NIL until initialized by FormPartitionKeyDatum() if
+ * and when it is called; for example, during tuple routing through a given
+ * partitioned table.
+ */
+typedef struct PartitionKeyInfo
+{
+ PartitionKey key; /* Points into the table's relcache entry */
+ List *keystate;
+} PartitionKeyInfo;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ * through one partitioned table in a partition
+ * tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+ /* OID of the table */
+ Oid relid;
+
+ /* Information about the table's partitions */
+ PartitionDispatch pd;
+
+ /* See comment above the definition of PartitionKeyInfo */
+ PartitionKeyInfo *keyinfo;
+
+ /*
+ * A standalone TupleTableSlot initialized with this table's tuple
+ * descriptor
+ */
+ TupleTableSlot *tupslot;
+
+ /*
+ * TupleConversionMap to convert from the parent's rowtype to this table's
+ * rowtype (when extracting the partition key of a tuple just before
+ * routing it through this table)
+ */
+ TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
/* ----------------
* EState information
*
@@ -970,9 +1019,9 @@ typedef struct ModifyTableState
TupleTableSlot *mt_existing; /* slot to store existing target tuple in */
List *mt_excludedtlist; /* the excluded pseudo relation's tlist */
TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */
- struct PartitionDispatchData **mt_partition_dispatch_info;
/* Tuple-routing support info */
- int mt_num_dispatch; /* Number of entries in the above array */
+ struct PartitionTupleRoutingInfo **mt_ptrinfos;
+ int mt_num_parted; /* Number of entries in the above array */
int mt_num_partitions; /* Number of members in the following
* arrays */
ResultRelInfo *mt_partitions; /* Per partition result relation */
--
2.11.0
On Fri, Aug 4, 2017 at 1:08 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].
Thanks a lot for working on this. Partition-wise join can benefit from
this as well. See comment about build_simple_rel's matching algorithm
in [1]/messages/by-id/CA+TgmobeRUTu4osXA_UA4AORho83WxAvFG8n1NQcoFuujbeh7A@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company. It will become O(n) instead of O(n^2).
So attached are two WIP patches:
0001 implements two interface functions:
List *get_all_partition_oids(Oid, LOCKMODE)
List *get_partition_oids(Oid, LOCKMODE)that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expect that users call them only for partitioned tables.
Needless to mention, OIDs are returned with canonical order determined by
that of the partition bounds and they way partition tree structure is
traversed (top-down, breadth-first-left-to-right). Patch replaces all the
calls of the old interface functions with the respective new ones for
partitioned table parents. That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.In its implementation, get_all_partition_oids() calls
RelationGetPartitionDispatchInfo(), which is useful to generate the result
list in the desired partition bound order. But the current interface and
implementation of RelationGetPartitionDispatchInfo() needs some rework,
because it's too closely coupled with the executor's tuple routing code.
May be we want to implement get_all_partition_oids() calling
get_partition_oids() on each new entry that gets added, similar to
find_all_inheritors(). That might avoid changes to DispatchInfo() and
also dependency on that structure.
Also almost every place which called find_all_inheritors() or
find_inheritance_children() is changed to if () else case calling
those functions or the new function as required. May be we should
create macros/functions to do that routing to keep the code readable.
May be find_all_inheritors() and find_inheritance_children()
themselves become the routing function and their original code moves
into new functions get_all_inheritors() and
get_inheritance_children(). We may choose other names for functions.
The idea is to create routing functions/macros instead of sprinkling
code with if () else conditions.
[1]: /messages/by-id/CA+TgmobeRUTu4osXA_UA4AORho83WxAvFG8n1NQcoFuujbeh7A@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].
I guess I don't really understand why we want to change the locking
order. That is bound to make expanding the inheritance hierarchy more
expensive. If we use this approach in all cases, it seems to me we're
bound to reintroduce the problem we fixed in commit
c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
the same vein. But I don't see that there's any necessary relation
between the order of locking and the order of expansion inside the
relcache entry/plan/whatever else -- so I propose that we keep the
existing locking order and only change the other stuff.
While reading related code this morning, I noticed that
RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
*already* changed the locking order for certain operations, because
the PartitionDesc's OID array is bound-ordered not OID-ordered. That
means that when RelationGetPartitionDispatchInfo uses the
PartitionDesc's OID arra to figure out what to lock, it's potentially
going to encounter partitions in a different order than would have
been the case if it had used find_all_inheritors directly. I'm
tempted to think that RelationGetPartitionDispatchInfo shouldn't
really be doing locking at all. The best way to have the locking
always happen in the same order is to have only one piece of code that
determines that order - and I vote for find_all_inheritors. Aside
from the fact that it's the way we've always done it (and still do it
in most other places), that code includes infinite-loop defenses which
the new code you've introduced lacks.
Concretely, my proposal is:
1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).
2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.
4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
elog(ERROR, ...). Might be interesting to stuff that check into the
relation_open(..., NoLock) path, too.
One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row. However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list. And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly. I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID. It
can also return the number of initial elements of its return value
which have that flag set.
Then, in find_all_inheritors, we split rels_list into
partitioned_rels_list and other_rels_list, and process
partitioned_rels_list in its entirety before touching other_rels_list;
they get concatenated at the end.
Now, find_all_inheritors and find_inheritance_children can also grow a
flag bool only_partitioned_children; if set, then we skip the
unpartitioned children entirely.
With all that in place, you can call find_all_inheritors(blah blah,
false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
true) to lock just the partitioned tables in the hierarchy. You get a
consistent lock order either way, and if you start with only the
partitioned tables and later want the leaf partitions too, you just go
through the partitioned children in the order they were returned and
find_inheritance_children(blah blah, false) on each one of them and
the lock order is exactly consistent with what you would have gotten
if you'd done find_all_inheritors(blah blah, false) originally.
Thoughts?
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4 August 2017 at 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
I agree. I think overall, we should keep
RelationGetPartitionDispatchInfo() only for preparing the dispatch
info in the planner, and generate the locked oids (using
find_all_inheritors() or get_partitioned_oids() or whatever) *without*
using RelationGetPartitionDispatchInfo(), since
RelationGetPartitionDispatchInfo() is generating the pd structure
which we don't want in every expansion.
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.
Yes, this way we need to open only the partitioned tables.
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.
True. I also think, RelationGetPartitionDispatchInfo () should be
called while preparing the ModifyTable plan; the PartitionDispatch
data structure returned by RelationGetPartitionDispatchInfo() should
be stored in that plan, and then the execution-time fields in
PartitionDispatch would be populated in ExecInitModifyTable().
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/04 20:28, Ashutosh Bapat wrote:
On Fri, Aug 4, 2017 at 1:08 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].Thanks a lot for working on this. Partition-wise join can benefit from
this as well. See comment about build_simple_rel's matching algorithm
in [1]. It will become O(n) instead of O(n^2).
Nice. It seems that we have a good demand for $subject. :)
So attached are two WIP patches:
0001 implements two interface functions:
List *get_all_partition_oids(Oid, LOCKMODE)
List *get_partition_oids(Oid, LOCKMODE)that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expect that users call them only for partitioned tables.
Needless to mention, OIDs are returned with canonical order determined by
that of the partition bounds and they way partition tree structure is
traversed (top-down, breadth-first-left-to-right). Patch replaces all the
calls of the old interface functions with the respective new ones for
partitioned table parents. That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.In its implementation, get_all_partition_oids() calls
RelationGetPartitionDispatchInfo(), which is useful to generate the result
list in the desired partition bound order. But the current interface and
implementation of RelationGetPartitionDispatchInfo() needs some rework,
because it's too closely coupled with the executor's tuple routing code.May be we want to implement get_all_partition_oids() calling
get_partition_oids() on each new entry that gets added, similar to
find_all_inheritors(). That might avoid changes to DispatchInfo() and
also dependency on that structure.Also almost every place which called find_all_inheritors() or
find_inheritance_children() is changed to if () else case calling
those functions or the new function as required. May be we should
create macros/functions to do that routing to keep the code readable.
May be find_all_inheritors() and find_inheritance_children()
themselves become the routing function and their original code moves
into new functions get_all_inheritors() and
get_inheritance_children(). We may choose other names for functions.
The idea is to create routing functions/macros instead of sprinkling
code with if () else conditions.
Given the Robert's comments [1]/messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com, it seems that we might have to abandon
the idea to separate the interface for partitioned and non-partitioned
inheritance cases. I'm thinking about the issues and alternatives he
mentioned in his email.
Thanks,
Amit
[1]: /messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com
/messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 4, 2017 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].I guess I don't really understand why we want to change the locking
order. That is bound to make expanding the inheritance hierarchy more
expensive. If we use this approach in all cases, it seems to me we're
bound to reintroduce the problem we fixed in commit
c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
the same vein.
I initially didn't understand this, but I think now I understand it.
Establishing the order of children by partition bounds requires
building the relcache entry right now. That's what is expensive and
would introduce the same problems as the commit you have quoted.
But I don't see that there's any necessary relation
between the order of locking and the order of expansion inside the
relcache entry/plan/whatever else -- so I propose that we keep the
existing locking order and only change the other stuff.While reading related code this morning, I noticed that
RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
*already* changed the locking order for certain operations, because
the PartitionDesc's OID array is bound-ordered not OID-ordered. That
means that when RelationGetPartitionDispatchInfo uses the
PartitionDesc's OID arra to figure out what to lock, it's potentially
going to encounter partitions in a different order than would have
been the case if it had used find_all_inheritors directly. I'm
tempted to think that RelationGetPartitionDispatchInfo shouldn't
really be doing locking at all. The best way to have the locking
always happen in the same order is to have only one piece of code that
determines that order - and I vote for find_all_inheritors. Aside
from the fact that it's the way we've always done it (and still do it
in most other places), that code includes infinite-loop defenses which
the new code you've introduced lacks.
+1.
Concretely, my proposal is:
1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
elog(ERROR, ...). Might be interesting to stuff that check into the
relation_open(..., NoLock) path, too.One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row. However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list. And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly. I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID. It
can also return the number of initial elements of its return value
which have that flag set.
I am always uncomfortable, when we save the same information in two
places without having a way to make sure that they are in sync. That
means we have to add explicit code to make sure that that information
is kept in sync. Somebody forgetting to add that code wherever
necessary means we have contradictory information persisted in the
databases without an idea of which of them is correct. Not necessarily
in this case, but usually it is an indication of something going wrong
with the way schema is designed. May be it's better to use your idea
of using get_rel_relkind() or find a way to check that the flag is in
sync with the relkind, like when building the relcache.
Then, in find_all_inheritors, we split rels_list into
partitioned_rels_list and other_rels_list, and process
partitioned_rels_list in its entirety before touching other_rels_list;
they get concatenated at the end.Now, find_all_inheritors and find_inheritance_children can also grow a
flag bool only_partitioned_children; if set, then we skip the
unpartitioned children entirely.With all that in place, you can call find_all_inheritors(blah blah,
false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
true) to lock just the partitioned tables in the hierarchy. You get a
consistent lock order either way, and if you start with only the
partitioned tables and later want the leaf partitions too, you just go
through the partitioned children in the order they were returned and
find_inheritance_children(blah blah, false) on each one of them and
the lock order is exactly consistent with what you would have gotten
if you'd done find_all_inheritors(blah blah, false) originally.Thoughts?
I noticed that find_all_inheritors() uses a hash table to eliminate
duplicates arising out of multiple inheritance. Partition hierarchy is
never going to have multiple inheritance, and doesn't need to
eliminate duplicates and so doesn't need the hash table. It will be
good, if we can eliminate that overhead. But that's separate task than
what this thread is about.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 7, 2017 at 11:18 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row. However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list. And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly. I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID. It
can also return the number of initial elements of its return value
which have that flag set.I am always uncomfortable, when we save the same information in two
places without having a way to make sure that they are in sync. That
means we have to add explicit code to make sure that that information
is kept in sync. Somebody forgetting to add that code wherever
necessary means we have contradictory information persisted in the
databases without an idea of which of them is correct. Not necessarily
in this case, but usually it is an indication of something going wrong
with the way schema is designed. May be it's better to use your idea
of using get_rel_relkind() or find a way to check that the flag is in
sync with the relkind, like when building the relcache.
Said all that, I think we will use this code quite often and so the
performance benefits by replicating the information are worth the
trouble of maintaining code to sync and check the duplicate
information.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/05 2:25, Robert Haas wrote:
On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].I guess I don't really understand why we want to change the locking
order. That is bound to make expanding the inheritance hierarchy more
expensive. If we use this approach in all cases, it seems to me we're
bound to reintroduce the problem we fixed in commit
c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
the same vein. But I don't see that there's any necessary relation
between the order of locking and the order of expansion inside the
relcache entry/plan/whatever else -- so I propose that we keep the
existing locking order and only change the other stuff.
Hmm, okay.
I guess I was trying to fit one solution to what might be two (or worse,
more) problems of the current implementation, which is not good.
While reading related code this morning, I noticed that
RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
*already* changed the locking order for certain operations, because
the PartitionDesc's OID array is bound-ordered not OID-ordered. That
means that when RelationGetPartitionDispatchInfo uses the
PartitionDesc's OID arra to figure out what to lock, it's potentially
going to encounter partitions in a different order than would have
been the case if it had used find_all_inheritors directly.
I think Amit Khandekar mentioned this on the UPDATE partition key thread [1]/messages/by-id/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB=wY9ZLfe8cQOfG4bTqVGyQ@mail.gmail.com.
I'm
tempted to think that RelationGetPartitionDispatchInfo shouldn't
really be doing locking at all. The best way to have the locking
always happen in the same order is to have only one piece of code that
determines that order - and I vote for find_all_inheritors. Aside
from the fact that it's the way we've always done it (and still do it
in most other places), that code includes infinite-loop defenses which
the new code you've introduced lacks.
As long as find_all_inheritors() is a place only to determine the order in
which partitions will be locked, it's fine. My concern is about the time
of actual locking, which in the current planner implementation is too soon
that we end up needlessly locking all the partitions.
(Also in the current implementation, we open all the partitions to
construct Var translation lists, which are actually unused through most of
the planner stages, but admittedly it's a separate issue.)
The locking-partitions-too-soon issue, I think, is an important one and
may need discussing separately, but thought I'd mention it anyway. It
also seems somewhat related to this discussion, but I may be wrong.
Concretely, my proposal is:
1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).
ISTM, we'd want to lock the partitions after we've determined the specific
ones a query needs to scan using the information returned by
RelationGetPartitionDispatchInfo. That means the latter had better locked
the relations whose cached partition descriptors will be used to determine
the result that it produces. One way to do that might be to lock all the
tables in the list returned by find_all_inheritors that are partitioned
tables before calling RelationGetPartitionDispatchInfo. It seems what the
approach you've outlined below will let us do that.
BTW, IIUC, there will be two lists of OIDs we'll have: one in the
find_all_inheritors order, say, L1 and the other determined by using
partitioning-specific information for the given query, say L2.
To lock, we iterate L1 and if a given member is in L2, we lock it. It
might be possible to make it as cheap as O(nlogn).
L2 is the order we put leaf partitions into a given plan.
2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.
That's what the proposed refactoring patch 0002 actually does.
4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
elog(ERROR, ...). Might be interesting to stuff that check into the
relation_open(..., NoLock) path, too.One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row. However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list. And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly. I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID. It
can also return the number of initial elements of its return value
which have that flag set.
Maybe, we can make the initial patch use syscache to get the relkind for a
given child. If the syscache bloat is unbearable, we go with the
denormalization approach.
Then, in find_all_inheritors, we split rels_list into
partitioned_rels_list and other_rels_list, and process
partitioned_rels_list in its entirety before touching other_rels_list;
they get concatenated at the end.Now, find_all_inheritors and find_inheritance_children can also grow a
flag bool only_partitioned_children; if set, then we skip the
unpartitioned children entirely.With all that in place, you can call find_all_inheritors(blah blah,
false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
true) to lock just the partitioned tables in the hierarchy. You get a
consistent lock order either way, and if you start with only the
partitioned tables and later want the leaf partitions too, you just go
through the partitioned children in the order they were returned and
find_inheritance_children(blah blah, false) on each one of them and
the lock order is exactly consistent with what you would have gotten
if you'd done find_all_inheritors(blah blah, false) originally.Thoughts?
So, with this in place:
1. Call find_all_inheritors to lock partitioned tables in the tree in an
order prescribed by OIDs
2. Call RelationGetPartitionDispatchInfo at an opportune time, which will
generate minimal information about the partition tree that it can do
without having to worry about locking anything
3. Determine the list of which leaf partitions will need to be scanned
using the information obtained in 2, if possible to do that at all [2]We can do that in set_append_rel_size(), but not in inheritance_planner().
4. Lock leaf partitions in the find_inheritance_children prescribed order,
but only those that are in the list built in 3.
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.
Thanks.
Regards,
Amit
[1]: /messages/by-id/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB=wY9ZLfe8cQOfG4bTqVGyQ@mail.gmail.com
/messages/by-id/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB=wY9ZLfe8cQOfG4bTqVGyQ@mail.gmail.com
[2]: We can do that in set_append_rel_size(), but not in inheritance_planner()
inheritance_planner()
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 7, 2017 at 1:48 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
with the way schema is designed. May be it's better to use your idea
of using get_rel_relkind() or find a way to check that the flag is in
sync with the relkind, like when building the relcache.
That's got the same problem as building a full relcache entry: cache
bloat. It will have *less* cache bloat, but still some. Maybe it's
little enough to be tolerable; not sure. But we want this system to
scale to LOTS of partitions eventually, so building on a design that
we know has scaling problems seems a bit short-sighted.
I noticed that find_all_inheritors() uses a hash table to eliminate
duplicates arising out of multiple inheritance. Partition hierarchy is
never going to have multiple inheritance, and doesn't need to
eliminate duplicates and so doesn't need the hash table. It will be
good, if we can eliminate that overhead. But that's separate task than
what this thread is about.
I don't want to eliminate that overhead. If the catalog is manually
modified or corrupted, the problem could still occur, and result in
backend crashes or, at best, incomprehensible errors. The comments
allude to this problem.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 7, 2017 at 2:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
I think Amit Khandekar mentioned this on the UPDATE partition key thread [1].
Yes, similar discussion.
As long as find_all_inheritors() is a place only to determine the order in
which partitions will be locked, it's fine. My concern is about the time
of actual locking, which in the current planner implementation is too soon
that we end up needlessly locking all the partitions.
I don't think avoiding that problem is going to be easy. We need a
bunch of per-relation information, like the size of each relation, and
what indexes it has, and how big they are, and the statistics for each
one. It was at one point proposed by someone that every partition
should be required to have the same indexes, but (1) we didn't
implement it like that and (2) if we had done that it wouldn't solve
this problem anyway because the sizes are still going to vary.
Note that I'm not saying this isn't a good problem to solve, just that
it's likely to be a very hard problem to solve.
The locking-partitions-too-soon issue, I think, is an important one and
ISTM, we'd want to lock the partitions after we've determined the specific
ones a query needs to scan using the information returned by
RelationGetPartitionDispatchInfo. That means the latter had better locked
the relations whose cached partition descriptors will be used to determine
the result that it produces. One way to do that might be to lock all the
tables in the list returned by find_all_inheritors that are partitioned
tables before calling RelationGetPartitionDispatchInfo. It seems what the
approach you've outlined below will let us do that.
Yeah, I think so. I think we could possibly open and lock partitioned
children only, then prune away leaf partitions that we can determine
aren't needed, then open and lock the leaf partitions that are needed.
BTW, IIUC, there will be two lists of OIDs we'll have: one in the
find_all_inheritors order, say, L1 and the other determined by using
partitioning-specific information for the given query, say L2.To lock, we iterate L1 and if a given member is in L2, we lock it. It
might be possible to make it as cheap as O(nlogn).
Commonly, we'll prune no partitions or all but one; and we should be
able to make those cases very fast. Other cases can cost a little
more, but I'll certainly complain about anything more than O(n lg n).
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.That's what the proposed refactoring patch 0002 actually does.
Great.
Maybe, we can make the initial patch use syscache to get the relkind for a
given child. If the syscache bloat is unbearable, we go with the
denormalization approach.
Yeah. Maybe if you write that patch, you can also test it to see how
bad the bloat is.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/07 14:37, Amit Khandekar wrote:
On 4 August 2017 at 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.True. I also think, RelationGetPartitionDispatchInfo () should be
called while preparing the ModifyTable plan; the PartitionDispatch
data structure returned by RelationGetPartitionDispatchInfo() should
be stored in that plan, and then the execution-time fields in
PartitionDispatch would be populated in ExecInitModifyTable().
I'm not sure if we could ever store the PartitionDispatch structure itself
in the plan.
Planner would build and use it to put the leaf partition sub-plans in the
canonical order in the resulting plan (Append, ModifyTable, etc.).
Executor will have to rebuild the PartitionDispatch info again if and when
it needs the same (for example, in ExecSetupPartitionTupleRouting for
insert or update tuple routing).
The refactoring patch that I've proposed (0002) makes PartitionDispatch
structure itself contain a lot less information/state than it currently
does. So RelationGetPartitionDispatchInfo's job per the revised patch is
to reveal the partition tree structure and the information of each
partitioned table that the tree contains. The original design whereby it
builds and puts into PartitionDispatchData thing like partition key
execution state (ExprState), TupleTableSlot, TupleConversionMap seems
wrong to me in retrospect and we should somehow revise it. Those things I
mentioned are only needed for tuple-routing, so they should be built and
managed by the executor, not partition.c. Any feedback on the proposed
patch is welcome. :)
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/08 4:34, Robert Haas wrote:
On Mon, Aug 7, 2017 at 2:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:As long as find_all_inheritors() is a place only to determine the order in
which partitions will be locked, it's fine. My concern is about the time
of actual locking, which in the current planner implementation is too soon
that we end up needlessly locking all the partitions.I don't think avoiding that problem is going to be easy. We need a
bunch of per-relation information, like the size of each relation, and
what indexes it has, and how big they are, and the statistics for each
one. It was at one point proposed by someone that every partition
should be required to have the same indexes, but (1) we didn't
implement it like that and (2) if we had done that it wouldn't solve
this problem anyway because the sizes are still going to vary.
Sorry, I didn't mean to say we shouldn't lock and open partitions at all.
We do need their relation descriptors for planning and there is no doubt
about that. I was just saying that we should do that only for the
partitions that are not pruned. But, as you say, I can see that the
planner changes required to be able to do that might be hard.
The locking-partitions-too-soon issue, I think, is an important one and
ISTM, we'd want to lock the partitions after we've determined the specific
ones a query needs to scan using the information returned by
RelationGetPartitionDispatchInfo. That means the latter had better locked
the relations whose cached partition descriptors will be used to determine
the result that it produces. One way to do that might be to lock all the
tables in the list returned by find_all_inheritors that are partitioned
tables before calling RelationGetPartitionDispatchInfo. It seems what the
approach you've outlined below will let us do that.Yeah, I think so. I think we could possibly open and lock partitioned
children only, then prune away leaf partitions that we can determine
aren't needed, then open and lock the leaf partitions that are needed.
Yes.
BTW, IIUC, there will be two lists of OIDs we'll have: one in the
find_all_inheritors order, say, L1 and the other determined by using
partitioning-specific information for the given query, say L2.To lock, we iterate L1 and if a given member is in L2, we lock it. It
might be possible to make it as cheap as O(nlogn).Commonly, we'll prune no partitions or all but one; and we should be
able to make those cases very fast.
Agreed.
Maybe, we can make the initial patch use syscache to get the relkind for a
given child. If the syscache bloat is unbearable, we go with the
denormalization approach.Yeah. Maybe if you write that patch, you can also test it to see how
bad the bloat is.
I will try and see, but maybe the syscache solution doesn't get us past
the proof-of-concept stage.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/05 2:25, Robert Haas wrote:
Concretely, my proposal is:
1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
elog(ERROR, ...). Might be interesting to stuff that check into the
relation_open(..., NoLock) path, too.One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row. However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list. And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly. I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID. It
can also return the number of initial elements of its return value
which have that flag set.Then, in find_all_inheritors, we split rels_list into
partitioned_rels_list and other_rels_list, and process
partitioned_rels_list in its entirety before touching other_rels_list;
they get concatenated at the end.Now, find_all_inheritors and find_inheritance_children can also grow a
flag bool only_partitioned_children; if set, then we skip the
unpartitioned children entirely.With all that in place, you can call find_all_inheritors(blah blah,
false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
true) to lock just the partitioned tables in the hierarchy. You get a
consistent lock order either way, and if you start with only the
partitioned tables and later want the leaf partitions too, you just go
through the partitioned children in the order they were returned and
find_inheritance_children(blah blah, false) on each one of them and
the lock order is exactly consistent with what you would have gotten
if you'd done find_all_inheritors(blah blah, false) originally.
I tried implementing this in the attached set of patches.
[PATCH 2/5] Teach pg_inherits.c a bit about partitioning
Both find_inheritance_children and find_all_inheritors now list
partitioned child tables before non-partitioned ones and return
the number of partitioned tables in an optional output argument
[PATCH 3/5] Relieve RelationGetPartitionDispatchInfo() of doing locking
Anyone who wants to call RelationGetPartitionDispatchInfo() must first
acquire locks using find_all_inheritors.
TODO: Add RelationLockHeldByMe() and put if (!RelationLockHeldByMe())
elog(ERROR, ...) check in RelationGetPartitionDispatchInfo()
[PATCH 4/5] Teach expand_inherited_rtentry to use partition bound order
After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.
[PATCH 5/5] Store in pg_inherits if a child is a partitioned table
Catalog changes so that is_partitioned property of child tables is now
stored in pg_inherits. This avoids consulting syscache to get that
property as is currently implemented in patch 2/5.
I haven't yet done anything about changing the timing of opening and
locking leaf partitions, because it will require some more thinking about
the required planner changes. But the above set of patches will get us
far enough to get leaf partition sub-plans appear in the partition bound
order (same order as what partition tuple-routing uses in the executor).
With the above patches, we get the desired order of child sub-plans in
Append and ModifyTable plans for partitioned tables:
create table p (a int) partition by range (a);
create table p4 partition of p for values from (30) to (40);
create table p3 partition of p for values from (20) to (30);
create table p2 partition of p for values from (10) to (20);
create table p1 partition of p for values from (1) to (10);
create table p0 partition of p for values from (minvalue) to (1) partition
by list (a);
create table p00 partition of p0 for values in (0);
create table p01 partition of p0 for values in (-1);
create table p02 partition of p0 for values in (-2);
explain select count(*) from p;
QUERY PLAN
-------------------------------------------------------------------
Aggregate (cost=293.12..293.13 rows=1 width=8)
-> Append (cost=0.00..248.50 rows=17850 width=0)
-> Seq Scan on p1 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p2 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p3 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p4 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p02 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p01 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p00 (cost=0.00..35.50 rows=2550 width=0)
explain update p set a = a;
QUERY PLAN
--------------------------------------------------------------
Update on p (cost=0.00..248.50 rows=17850 width=10)
Update on p1
Update on p2
Update on p3
Update on p4
Update on p02
Update on p01
Update on p00
-> Seq Scan on p1 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p2 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p3 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p4 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p02 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p01 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p00 (cost=0.00..35.50 rows=2550 width=10)
(15 rows)
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.
I put this patch ahead in the list and so it's now 0001.
Thanks,
Amit
Attachments:
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload
From f511186bfc3be54ce77b27541695c4c609a877a6 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 1/5] Decouple RelationGetPartitionDispatchInfo() from executor
Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code. In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as relcache references
and tuple table slots. That makes it harder to use in places other
than where it's currently being used.
After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo() and get_all_partition_oids() no
longer needs to do some things that it used to.
---
src/backend/catalog/partition.c | 324 +++++++++++++++++----------------
src/backend/commands/copy.c | 35 ++--
src/backend/executor/execMain.c | 158 ++++++++++++++--
src/backend/executor/nodeModifyTable.c | 29 ++-
src/include/catalog/partition.h | 53 ++----
src/include/executor/executor.h | 4 +-
src/include/nodes/execnodes.h | 53 +++++-
7 files changed, 409 insertions(+), 247 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index dcc7f8af27..3d72d08c35 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,24 @@ typedef struct PartitionRangeBound
bool lower; /* this is the lower (vs upper) bound */
} PartitionRangeBound;
+/*-----------------------
+ * PartitionDispatchData - information of partitions of one partitioned table
+ * in a partition tree
+ *
+ * partkey Partition key of the table
+ * partdesc Partition descriptor of the table
+ * indexes Array with partdesc->nparts members (for details on what the
+ * individual value represents, see the comments in
+ * RelationGetPartitionDispatchInfo())
+ *-----------------------
+ */
+typedef struct PartitionDispatchData
+{
+ PartitionKey partkey; /* Points into the table's relcache entry */
+ PartitionDesc partdesc; /* Ditto */
+ int *indexes;
+} PartitionDispatchData;
+
static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
void *arg);
static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -976,178 +994,167 @@ get_partition_qual_relid(Oid relid)
}
/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
- do\
- {\
- int i;\
- for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
- {\
- (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
- (parents) = lappend((parents), (rel));\
- }\
- } while(0)
-
-/*
* RelationGetPartitionDispatchInfo
- * Returns information necessary to route tuples down a partition tree
+ * Returns necessary information for each partition in the partition
+ * tree rooted at rel
*
- * All the partitions will be locked with lockmode, unless it is NoLock.
- * A list of the OIDs of all the leaf partitions of rel is returned in
- * *leaf_part_oids.
+ * Information returned includes the following: *ptinfos contains a list of
+ * PartitionedTableInfo objects, one for each partitioned table (with at least
+ * one member, that is, one for the root partitioned table), *leaf_part_oids
+ * contains a list of the OIDs of of all the leaf partitions.
+ *
+ * Note that we lock only those partitions that are partitioned tables, because
+ * we need to look at its relcache entry to get its PartitionKey and its
+ * PartitionDesc. It's the caller's responsibility to lock the leaf partitions
+ * that will actually be accessed during a given query.
*/
-PartitionDispatch *
+void
RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
- int *num_parted, List **leaf_part_oids)
+ List **ptinfos, List **leaf_part_oids)
{
- PartitionDispatchData **pd;
- List *all_parts = NIL,
- *all_parents = NIL,
- *parted_rels,
- *parted_rel_parents;
+ List *all_parts,
+ *all_parents;
ListCell *lc1,
*lc2;
int i,
- k,
offset;
/*
- * Lock partitions and make a list of the partitioned ones to prepare
- * their PartitionDispatch objects below.
+ * We rely on the relcache to traverse the partition tree, building
+ * both the leaf partition OIDs list and the PartitionedTableInfo list.
+ * Starting with the root partitioned table for which we already have the
+ * relcache entry, we look at its partition descriptor to get the
+ * partition OIDs. For partitions that are themselves partitioned tables,
+ * we get their relcache entries after locking them with lockmode and
+ * queue their partitions to be looked at later. Leaf partitions are
+ * added to the result list without locking. For each partitioned table,
+ * we build a PartitionedTableInfo object and add it to the other result
+ * list.
*
- * Cannot use find_all_inheritors() here, because then the order of OIDs
- * in parted_rels list would be unknown, which does not help, because we
- * assign indexes within individual PartitionDispatch in an order that is
- * predetermined (determined by the order of OIDs in individual partition
- * descriptors).
+ * Since RelationBuildPartitionDescriptor() puts partitions in a canonical
+ * order determined by comparing partition bounds, we can rely that
+ * concurrent backends see the partitions in the same order, ensuring that
+ * there are no deadlocks when locking the partitions.
*/
- *num_parted = 1;
- parted_rels = list_make1(rel);
- /* Root partitioned table has no parent, so NULL for parent */
- parted_rel_parents = list_make1(NULL);
- APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
+ i = offset = 0;
+ *ptinfos = *leaf_part_oids = NIL;
+
+ /* Start with the root table. */
+ all_parts = list_make1_oid(RelationGetRelid(rel));
+ all_parents = list_make1_oid(InvalidOid);
forboth(lc1, all_parts, lc2, all_parents)
{
- Relation partrel = heap_open(lfirst_oid(lc1), lockmode);
- Relation parent = lfirst(lc2);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
+ Oid partrelid = lfirst_oid(lc1);
+ Oid parentrelid = lfirst_oid(lc2);
- /*
- * If this partition is a partitioned table, add its children to the
- * end of the list, so that they are processed as well.
- */
- if (partdesc)
+ if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
{
- (*num_parted)++;
- parted_rels = lappend(parted_rels, partrel);
- parted_rel_parents = lappend(parted_rel_parents, parent);
- APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
- }
- else
- heap_close(partrel, NoLock);
+ int j,
+ k;
+ Relation partrel;
+ PartitionKey partkey;
+ PartitionDesc partdesc;
+ PartitionedTableInfo *ptinfo;
+ PartitionDispatch pd;
+
+ if (partrelid != RelationGetRelid(rel))
+ partrel = heap_open(partrelid, lockmode);
+ else
+ partrel = rel;
- /*
- * We keep the partitioned ones open until we're done using the
- * information being collected here (for example, see
- * ExecEndModifyTable).
- */
- }
+ partkey = RelationGetPartitionKey(partrel);
+ partdesc = RelationGetPartitionDesc(partrel);
+
+ ptinfo = (PartitionedTableInfo *)
+ palloc0(sizeof(PartitionedTableInfo));
+ ptinfo->relid = partrelid;
+ ptinfo->parentid = parentrelid;
+
+ ptinfo->pd = pd = (PartitionDispatchData *)
+ palloc0(sizeof(PartitionDispatchData));
+ pd->partkey = partkey;
- /*
- * We want to create two arrays - one for leaf partitions and another for
- * partitioned tables (including the root table and internal partitions).
- * While we only create the latter here, leaf partition array of suitable
- * objects (such as, ResultRelInfo) is created by the caller using the
- * list of OIDs we return. Indexes into these arrays get assigned in a
- * breadth-first manner, whereby partitions of any given level are placed
- * consecutively in the respective arrays.
- */
- pd = (PartitionDispatchData **) palloc(*num_parted *
- sizeof(PartitionDispatchData *));
- *leaf_part_oids = NIL;
- i = k = offset = 0;
- forboth(lc1, parted_rels, lc2, parted_rel_parents)
- {
- Relation partrel = lfirst(lc1);
- Relation parent = lfirst(lc2);
- PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- int j,
- m;
-
- pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
- pd[i]->reldesc = partrel;
- pd[i]->key = partkey;
- pd[i]->keystate = NIL;
- pd[i]->partdesc = partdesc;
- if (parent != NULL)
- {
/*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
+ * XXX- do we need a pinning mechanism for partition descriptors
+ * so that there references can be managed independently of
+ * the parent relcache entry? Like PinPartitionDesc(partdesc)?
*/
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
- }
- else
- {
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
- pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+ pd->partdesc = partdesc;
- /*
- * Indexes corresponding to the internal partitions are multiplied by
- * -1 to distinguish them from those of leaf partitions. Encountering
- * an index >= 0 means we found a leaf partition, which is immediately
- * returned as the partition we are looking for. A negative index
- * means we found a partitioned table, whose PartitionDispatch object
- * is located at the above index multiplied back by -1. Using the
- * PartitionDispatch object, search is continued further down the
- * partition tree.
- */
- m = 0;
- for (j = 0; j < partdesc->nparts; j++)
- {
- Oid partrelid = partdesc->oids[j];
+ /*
+ * The values contained in the following array correspond to
+ * indexes of this table's partitions in the global sequence of
+ * all the partitions contained in the partition tree rooted at
+ * rel, traversed in a breadh-first manner. The values should be
+ * such that we will be able to distinguish the leaf partitions
+ * from the non-leaf partitions, because they are returned to
+ * to the caller in separate structures from where they will be
+ * accessed. The way that's done is described below:
+ *
+ * Leaf partition OIDs are put into the global leaf_part_oids list,
+ * and for each one, the value stored is its ordinal position in
+ * the list minus 1.
+ *
+ * PartitionedTableInfo objects corresponding to partitions that
+ * are partitioned tables are put into the global ptinfos[] list,
+ * and for each one, the value stored is its ordinal position in
+ * the list multiplied by -1.
+ *
+ * So while looking at the values in the indexes array, if one
+ * gets zero or a positive value, then it's a leaf partition,
+ * Otherwise, it's a partitioned table.
+ */
+ pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
- if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
- {
- *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
- pd[i]->indexes[j] = k++;
- }
- else
+ k = 0;
+ for (j = 0; j < partdesc->nparts; j++)
{
+ Oid partrelid = partdesc->oids[j];
+
/*
- * offset denotes the number of partitioned tables of upper
- * levels including those of the current level. Any partition
- * of this table must belong to the next level and hence will
- * be placed after the last partitioned table of this level.
+ * Queue this partition so that it will be processed later
+ * by the outer loop.
*/
- pd[i]->indexes[j] = -(1 + offset + m);
- m++;
+ all_parts = lappend_oid(all_parts, partrelid);
+ all_parents = lappend_oid(all_parents,
+ RelationGetRelid(partrel));
+
+ if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
+ {
+ *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+ pd->indexes[j] = i++;
+ }
+ else
+ {
+ /*
+ * offset denotes the number of partitioned tables that
+ * we have already processed. k counts the number of
+ * partitions of this table that were found to be
+ * partitioned tables.
+ */
+ pd->indexes[j] = -(1 + offset + k);
+ k++;
+ }
}
- }
- i++;
- /*
- * This counts the number of partitioned tables at upper levels
- * including those of the current level.
- */
- offset += m;
+ offset += k;
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+
+ *ptinfos = lappend(*ptinfos, ptinfo);
+ }
}
- return pd;
+ Assert(i == list_length(*leaf_part_oids));
+ Assert((offset + 1) == list_length(*ptinfos));
}
/* Module-local functions */
@@ -1864,7 +1871,7 @@ generate_partition_qual(Relation rel)
* ----------------
*/
void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
@@ -1873,20 +1880,21 @@ FormPartitionKeyDatum(PartitionDispatch pd,
ListCell *partexpr_item;
int i;
- if (pd->key->partexprs != NIL && pd->keystate == NIL)
+ if (keyinfo->key->partexprs != NIL && keyinfo->keystate == NIL)
{
/* Check caller has set up context correctly */
Assert(estate != NULL &&
GetPerTupleExprContext(estate)->ecxt_scantuple == slot);
/* First time through, set up expression evaluation state */
- pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+ keyinfo->keystate = ExecPrepareExprList(keyinfo->key->partexprs,
+ estate);
}
- partexpr_item = list_head(pd->keystate);
- for (i = 0; i < pd->key->partnatts; i++)
+ partexpr_item = list_head(keyinfo->keystate);
+ for (i = 0; i < keyinfo->key->partnatts; i++)
{
- AttrNumber keycol = pd->key->partattrs[i];
+ AttrNumber keycol = keyinfo->key->partattrs[i];
Datum datum;
bool isNull;
@@ -1923,13 +1931,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
* the latter case.
*/
int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot)
{
- PartitionDispatch parent;
+ PartitionTupleRoutingInfo *parent;
Datum values[PARTITION_MAX_KEYS];
bool isnull[PARTITION_MAX_KEYS];
int cur_offset,
@@ -1940,11 +1948,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
/* start with the root partitioned table */
- parent = pd[0];
+ parent = ptrinfos[0];
while (true)
{
- PartitionKey key = parent->key;
- PartitionDesc partdesc = parent->partdesc;
+ PartitionKey key = parent->pd->partkey;
+ PartitionDesc partdesc = parent->pd->partdesc;
TupleTableSlot *myslot = parent->tupslot;
TupleConversionMap *map = parent->tupmap;
@@ -1976,7 +1984,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
* So update ecxt_scantuple accordingly.
*/
ecxt->ecxt_scantuple = slot;
- FormPartitionKeyDatum(parent, slot, estate, values, isnull);
+ FormPartitionKeyDatum(parent->keyinfo, slot, estate, values, isnull);
if (key->strategy == PARTITION_STRATEGY_RANGE)
{
@@ -2047,13 +2055,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
*failed_slot = slot;
break;
}
- else if (parent->indexes[cur_index] >= 0)
+ else if (parent->pd->indexes[cur_index] >= 0)
{
- result = parent->indexes[cur_index];
+ result = parent->pd->indexes[cur_index];
break;
}
else
- parent = pd[-parent->indexes[cur_index]];
+ parent = ptrinfos[-parent->pd->indexes[cur_index]];
}
error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 53e296559a..b3de3de454 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
bool volatile_defexprs; /* is any of defexprs volatile? */
List *range_table;
- PartitionDispatch *partition_dispatch_info;
- int num_dispatch; /* Number of entries in the above array */
+ PartitionTupleRoutingInfo **ptrinfos;
+ int num_parted; /* Number of entries in the above array */
int num_partitions; /* Number of members in the following arrays */
ResultRelInfo *partitions; /* Per partition result relation */
TupleConversionMap **partition_tupconv_maps;
@@ -1425,7 +1425,7 @@ BeginCopy(ParseState *pstate,
/* Initialize state for CopyFrom tuple routing. */
if (is_from && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1434,13 +1434,13 @@ BeginCopy(ParseState *pstate,
ExecSetupPartitionTupleRouting(rel,
1,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- cstate->partition_dispatch_info = partition_dispatch_info;
- cstate->num_dispatch = num_parted;
+ cstate->ptrinfos = ptrinfos;
+ cstate->num_parted = num_parted;
cstate->partitions = partitions;
cstate->num_partitions = num_partitions;
cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2495,7 +2495,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
- cstate->partition_dispatch_info != NULL ||
+ cstate->ptrinfos != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2573,7 +2573,7 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
/* Determine the partition to heap_insert the tuple into */
- if (cstate->partition_dispatch_info)
+ if (cstate->ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -2587,7 +2587,7 @@ CopyFrom(CopyState cstate)
* partition, respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- cstate->partition_dispatch_info,
+ cstate->ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -2818,23 +2818,20 @@ CopyFrom(CopyState cstate)
ExecCloseIndices(resultRelInfo);
- /* Close all the partitioned tables, leaf partitions, and their indices */
- if (cstate->partition_dispatch_info)
+ /* Close all the leaf partitions and their indices */
+ if (cstate->ptrinfos)
{
int i;
/*
- * Remember cstate->partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is
- * the main target table of COPY that will be closed eventually by
- * DoCopy(). Also, tupslot is NULL for the root partitioned table.
+ * cstate->ptrinfo[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot.
*/
- for (i = 1; i < cstate->num_dispatch; i++)
+ for (i = 1; i < cstate->num_parted; i++)
{
- PartitionDispatch pd = cstate->partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
for (i = 0; i < cstate->num_partitions; i++)
{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c11aa4fe21..0379e489d9 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3214,8 +3214,8 @@ EvalPlanQualEnd(EPQState *epqstate)
* tuple routing for partitioned tables
*
* Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- * every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ * entry for each partitioned table in the partition tree
* 'partitions' receives an array of ResultRelInfo objects with one entry for
* every leaf partition in the partition tree
* 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3237,7 +3237,7 @@ EvalPlanQualEnd(EPQState *epqstate)
void
ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
@@ -3245,13 +3245,135 @@ ExecSetupPartitionTupleRouting(Relation rel,
{
TupleDesc tupDesc = RelationGetDescr(rel);
List *leaf_parts;
+ List *ptinfos = NIL;
ListCell *cell;
int i;
ResultRelInfo *leaf_part_rri;
+ Relation parent;
- /* Get the tuple-routing information and lock partitions */
- *pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, num_parted,
- &leaf_parts);
+ /*
+ * Get information about the partition tree. All the partitioned
+ * tables in the tree are locked, but not the leaf partitions. We
+ * lock them while building their ResultRelInfos below.
+ */
+ RelationGetPartitionDispatchInfo(rel, RowExclusiveLock,
+ &ptinfos, &leaf_parts);
+
+ /*
+ * The ptinfos list contains PartitionedTableInfo objects for all the
+ * partitioned tables in the partition tree. Using the information
+ * therein, we construct an array of PartitionTupleRoutingInfo objects
+ * to be used during tuple-routing.
+ */
+ *num_parted = list_length(ptinfos);
+ *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+ sizeof(PartitionTupleRoutingInfo *));
+ /*
+ * Free the ptinfos List structure itself as we go through (open-coded
+ * list_free).
+ */
+ i = 0;
+ cell = list_head(ptinfos);
+ parent = NULL;
+ while (cell)
+ {
+ ListCell *tmp = cell;
+ PartitionedTableInfo *ptinfo = lfirst(tmp),
+ *next_ptinfo;
+ Relation partrel;
+ PartitionTupleRoutingInfo *ptrinfo;
+
+ if (lnext(tmp))
+ next_ptinfo = lfirst(lnext(tmp));
+
+ /* As mentioned above, the partitioned tables have been locked. */
+ if (ptinfo->relid != RelationGetRelid(rel))
+ partrel = heap_open(ptinfo->relid, NoLock);
+ else
+ partrel = rel;
+
+ ptrinfo = (PartitionTupleRoutingInfo *)
+ palloc0(sizeof(PartitionTupleRoutingInfo));
+ ptrinfo->relid = ptinfo->relid;
+
+ /* Stash a reference to this PartitionDispatch. */
+ ptrinfo->pd = ptinfo->pd;
+
+ /* State for extracting partition key from tuples will go here. */
+ ptrinfo->keyinfo = (PartitionKeyInfo *)
+ palloc0(sizeof(PartitionKeyInfo));
+ ptrinfo->keyinfo->key = RelationGetPartitionKey(partrel);
+ ptrinfo->keyinfo->keystate = NIL;
+
+ /*
+ * For every partitioned table other than root, we must store a tuple
+ * table slot initialized with its tuple descriptor and a tuple
+ * conversion map to convert a tuple from its parent's rowtype to its
+ * own. That is to make sure that we are looking at the correct row
+ * using the correct tuple descriptor when computing its partition key
+ * for tuple routing.
+ */
+ if (ptinfo->parentid != InvalidOid)
+ {
+ TupleDesc tupdesc = RelationGetDescr(partrel);
+
+ /* Open the parent relation descriptor if not already done. */
+ if (ptinfo->parentid == RelationGetRelid(rel))
+ {
+ parent = rel;
+ }
+ else if (parent == NULL)
+ {
+ /* Locked by RelationGetPartitionDispatchInfo(). */
+ parent = heap_open(ptinfo->parentid, NoLock);
+ }
+
+ ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+
+ /*
+ * Close the parent descriptor, if the next partitioned table in
+ * the list is not a sibling, because it will have a different
+ * parent if so.
+ */
+ if (parent && parent != rel &&
+ next_ptinfo->parentid != ptinfo->parentid)
+ {
+ heap_close(parent, NoLock);
+ parent = NULL;
+ }
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ ptrinfo->tupslot = NULL;
+ ptrinfo->tupmap = NULL;
+ }
+
+ (*ptrinfos)[i++] = ptrinfo;
+
+ /* Free the ListCell. */
+ cell = lnext(cell);
+ pfree(tmp);
+ }
+
+ /* Free the List itself. */
+ if (ptinfos)
+ pfree(ptinfos);
+
+ /* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
@@ -3274,11 +3396,11 @@ ExecSetupPartitionTupleRouting(Relation rel,
TupleDesc part_tupdesc;
/*
- * We locked all the partitions above including the leaf partitions.
- * Note that each of the relations in *partitions are eventually
- * closed by the caller.
+ * RelationGetPartitionDispatchInfo didn't lock the leaf partitions,
+ * so lock here. Note that each of the relations in *partitions are
+ * eventually closed (when the plan is shut down, for instance).
*/
- partrel = heap_open(lfirst_oid(cell), NoLock);
+ partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
part_tupdesc = RelationGetDescr(partrel);
/*
@@ -3291,7 +3413,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* partition from the parent's type to the partition's.
*/
(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
- gettext_noop("could not convert row type"));
+ gettext_noop("could not convert row type"));
InitResultRelInfo(leaf_part_rri,
partrel,
@@ -3325,11 +3447,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
* by get_partition_for_tuple() unchanged.
*/
int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
- TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+ PartitionTupleRoutingInfo **ptrinfos,
+ TupleTableSlot *slot,
+ EState *estate)
{
int result;
- PartitionDispatchData *failed_at;
+ PartitionTupleRoutingInfo *failed_at;
TupleTableSlot *failed_slot;
/*
@@ -3339,7 +3463,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
if (resultRelInfo->ri_PartitionCheck)
ExecPartitionCheck(resultRelInfo, slot, estate);
- result = get_partition_for_tuple(pd, slot, estate,
+ result = get_partition_for_tuple(ptrinfos, slot, estate,
&failed_at, &failed_slot);
if (result < 0)
{
@@ -3349,9 +3473,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
char *val_desc;
ExprContext *ecxt = GetPerTupleExprContext(estate);
- failed_rel = failed_at->reldesc;
+ failed_rel = heap_open(failed_at->relid, NoLock);
ecxt->ecxt_scantuple = failed_slot;
- FormPartitionKeyDatum(failed_at, failed_slot, estate,
+ FormPartitionKeyDatum(failed_at->keyinfo, failed_slot, estate,
key_values, key_isnull);
val_desc = ExecBuildSlotPartitionKeyDescription(failed_rel,
key_values,
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 30add8e3c7..00cbee4fb6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -277,7 +277,7 @@ ExecInsert(ModifyTableState *mtstate,
resultRelInfo = estate->es_result_relation_info;
/* Determine the partition to heap_insert the tuple into */
- if (mtstate->mt_partition_dispatch_info)
+ if (mtstate->mt_ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -291,7 +291,7 @@ ExecInsert(ModifyTableState *mtstate,
* respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- mtstate->mt_partition_dispatch_info,
+ mtstate->mt_ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -1486,7 +1486,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
int numResultRelInfos;
/* Find the set of partitions so that we can find their TupleDescs. */
- if (mtstate->mt_partition_dispatch_info != NULL)
+ if (mtstate->mt_ptrinfos != NULL)
{
/*
* For INSERT via partitioned table, so we need TupleDescs based
@@ -1910,7 +1910,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
if (operation == CMD_INSERT &&
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1919,13 +1919,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
ExecSetupPartitionTupleRouting(rel,
node->nominalRelation,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- mtstate->mt_partition_dispatch_info = partition_dispatch_info;
- mtstate->mt_num_dispatch = num_parted;
+ mtstate->mt_ptrinfos = ptrinfos;
+ mtstate->mt_num_parted = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2335,19 +2335,16 @@ ExecEndModifyTable(ModifyTableState *node)
}
/*
- * Close all the partitioned tables, leaf partitions, and their indices
+ * Close all the leaf partitions and their indices.
*
- * Remember node->mt_partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is the
- * main target table of the query that will be closed by ExecEndPlan().
- * Also, tupslot is NULL for the root partitioned table.
+ * node->mt_partition_dispatch_info[0] corresponds to the root partitioned
+ * table, for which we didn't create tupslot.
*/
- for (i = 1; i < node->mt_num_dispatch; i++)
+ for (i = 1; i < node->mt_num_parted; i++)
{
- PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
for (i = 0; i < node->mt_num_partitions; i++)
{
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 434ded37d7..6a0c81b3bd 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -39,36 +39,23 @@ typedef struct PartitionDescData
typedef struct PartitionDescData *PartitionDesc;
-/*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
- *
- * reldesc Relation descriptor of the table
- * key Partition key information of the table
- * keystate Execution state required for expressions in the partition key
- * partdesc Partition descriptor of the table
- * tupslot A standalone TupleTableSlot initialized with this table's tuple
- * descriptor
- * tupmap TupleConversionMap to convert from the parent's rowtype to
- * this table's rowtype (when extracting the partition key of a
- * tuple just before routing it through this table)
- * indexes Array with partdesc->nparts members (for details on what
- * individual members represent, see how they are set in
- * RelationGetPartitionDispatchInfo())
- *-----------------------
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * Information about one partitioned table in a given partition tree
*/
-typedef struct PartitionDispatchData
+typedef struct PartitionedTableInfo
{
- Relation reldesc;
- PartitionKey key;
- List *keystate; /* list of ExprState */
- PartitionDesc partdesc;
- TupleTableSlot *tupslot;
- TupleConversionMap *tupmap;
- int *indexes;
-} PartitionDispatchData;
+ Oid relid;
+ Oid parentid;
-typedef struct PartitionDispatchData *PartitionDispatch;
+ /*
+ * This contains information about bounds of the partitions of this
+ * table and about where individual partitions are placed in the global
+ * partition tree.
+ */
+ PartitionDispatch pd;
+} PartitionedTableInfo;
extern void RelationBuildPartitionDesc(Relation relation);
extern bool partition_bounds_equal(PartitionKey key,
@@ -85,18 +72,18 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
+extern void RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
+ List **ptinfos, List **leaf_part_oids);
+
/* For tuple routing */
-extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
- int lockmode, int *num_parted,
- List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **pd,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot);
#endif /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 60326f9d03..6e1d3a6d2f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -208,13 +208,13 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
extern void ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
int *num_parted, int *num_partitions);
extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
- PartitionDispatch *pd,
+ PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 35c28a6143..1514d62f52 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,55 @@ typedef struct ResultRelInfo
Relation ri_PartitionRoot;
} ResultRelInfo;
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionKeyData *PartitionKey;
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionKeyInfoData - execution state for the partition key of a
+ * partitioned table
+ *
+ * keystate is the execution state required for expressions contained in the
+ * partition key. It is NIL until initialized by FormPartitionKeyDatum() if
+ * and when it is called; for example, during tuple routing through a given
+ * partitioned table.
+ */
+typedef struct PartitionKeyInfo
+{
+ PartitionKey key; /* Points into the table's relcache entry */
+ List *keystate;
+} PartitionKeyInfo;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ * through one partitioned table in a partition
+ * tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+ /* OID of the table */
+ Oid relid;
+
+ /* Information about the table's partitions */
+ PartitionDispatch pd;
+
+ /* See comment above the definition of PartitionKeyInfo */
+ PartitionKeyInfo *keyinfo;
+
+ /*
+ * A standalone TupleTableSlot initialized with this table's tuple
+ * descriptor
+ */
+ TupleTableSlot *tupslot;
+
+ /*
+ * TupleConversionMap to convert from the parent's rowtype to this table's
+ * rowtype (when extracting the partition key of a tuple just before
+ * routing it through this table)
+ */
+ TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
/* ----------------
* EState information
*
@@ -970,9 +1019,9 @@ typedef struct ModifyTableState
TupleTableSlot *mt_existing; /* slot to store existing target tuple in */
List *mt_excludedtlist; /* the excluded pseudo relation's tlist */
TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */
- struct PartitionDispatchData **mt_partition_dispatch_info;
/* Tuple-routing support info */
- int mt_num_dispatch; /* Number of entries in the above array */
+ struct PartitionTupleRoutingInfo **mt_ptrinfos;
+ int mt_num_parted; /* Number of entries in the above array */
int mt_num_partitions; /* Number of members in the following
* arrays */
ResultRelInfo *mt_partitions; /* Per partition result relation */
--
2.11.0
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patchtext/plain; charset=UTF-8; name=0002-Teach-pg_inherits.c-a-bit-about-partitioning.patchDownload
From b7ec1ddc2e26e75e0ab092c36461c09e9ca0a9d8 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Tue, 8 Aug 2017 18:42:30 +0900
Subject: [PATCH 2/5] Teach pg_inherits.c a bit about partitioning
Both find_inheritance_children and find_all_inheritors now list
partitioned child tables before non-partitioned ones and return
the number of partitioned tables in an optional output argument
---
contrib/sepgsql/dml.c | 2 +-
src/backend/catalog/partition.c | 2 +-
src/backend/catalog/pg_inherits.c | 157 ++++++++++++++++++++++++++-------
src/backend/commands/analyze.c | 3 +-
src/backend/commands/lockcmds.c | 2 +-
src/backend/commands/publicationcmds.c | 2 +-
src/backend/commands/tablecmds.c | 39 ++++----
src/backend/commands/vacuum.c | 3 +-
src/backend/optimizer/prep/prepunion.c | 2 +-
src/include/catalog/pg_inherits_fn.h | 5 +-
10 files changed, 162 insertions(+), 55 deletions(-)
diff --git a/contrib/sepgsql/dml.c b/contrib/sepgsql/dml.c
index b643720e36..6fc279805c 100644
--- a/contrib/sepgsql/dml.c
+++ b/contrib/sepgsql/dml.c
@@ -333,7 +333,7 @@ sepgsql_dml_privileges(List *rangeTabls, bool abort_on_violation)
if (!rte->inh)
tableIds = list_make1_oid(rte->relid);
else
- tableIds = find_all_inheritors(rte->relid, NoLock, NULL);
+ tableIds = find_all_inheritors(rte->relid, NoLock, NULL, NULL);
foreach(li, tableIds)
{
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 3d72d08c35..465e4fc097 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -196,7 +196,7 @@ RelationBuildPartitionDesc(Relation rel)
return;
/* Get partition oids from pg_inherits */
- inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+ inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock, NULL);
/* Collect bound spec nodes in a list */
i = 0;
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 245a374fc9..99b1e70de6 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -30,9 +30,12 @@
#include "utils/builtins.h"
#include "utils/fmgroids.h"
#include "utils/memutils.h"
+#include "utils/lsyscache.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
+static int32 inhchildinfo_cmp(const void *p1, const void *p2);
+
/*
* Entry of a hash table used in find_all_inheritors. See below.
*/
@@ -42,6 +45,30 @@ typedef struct SeenRelsEntry
ListCell *numparents_cell; /* corresponding list cell */
} SeenRelsEntry;
+/* Information about one inheritance child table. */
+typedef struct InhChildInfo
+{
+ Oid relid;
+ bool is_partitioned;
+} InhChildInfo;
+
+#define OID_CMP(o1, o2) \
+ ((o1) < (o2) ? -1 : ((o1) > (o2) ? 1 : 0));
+
+static int32
+inhchildinfo_cmp(const void *p1, const void *p2)
+{
+ InhChildInfo c1 = *((const InhChildInfo *) p1);
+ InhChildInfo c2 = *((const InhChildInfo *) p2);
+
+ if (c1.is_partitioned && !c2.is_partitioned)
+ return -1;
+ if (!c1.is_partitioned && c2.is_partitioned)
+ return 1;
+
+ return OID_CMP(c1.relid, c2.relid);
+}
+
/*
* find_inheritance_children
*
@@ -54,7 +81,8 @@ typedef struct SeenRelsEntry
* against possible DROPs of child relations.
*/
List *
-find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
+find_inheritance_children(Oid parentrelId, LOCKMODE lockmode,
+ int *num_partitioned_children)
{
List *list = NIL;
Relation relation;
@@ -62,9 +90,10 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
ScanKeyData key[1];
HeapTuple inheritsTuple;
Oid inhrelid;
- Oid *oidarr;
- int maxoids,
- numoids,
+ InhChildInfo *inhchildren;
+ int maxchildren,
+ numchildren,
+ my_num_partitioned_children,
i;
/*
@@ -77,9 +106,10 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
/*
* Scan pg_inherits and build a working array of subclass OIDs.
*/
- maxoids = 32;
- oidarr = (Oid *) palloc(maxoids * sizeof(Oid));
- numoids = 0;
+ maxchildren = 32;
+ inhchildren = (InhChildInfo *) palloc(maxchildren * sizeof(InhChildInfo));
+ numchildren = 0;
+ my_num_partitioned_children = 0;
relation = heap_open(InheritsRelationId, AccessShareLock);
@@ -94,33 +124,47 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
while ((inheritsTuple = systable_getnext(scan)) != NULL)
{
inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
- if (numoids >= maxoids)
+ if (numchildren >= maxchildren)
+ {
+ maxchildren *= 2;
+ inhchildren = (InhChildInfo *) repalloc(inhchildren,
+ maxchildren * sizeof(InhChildInfo));
+ }
+ inhchildren[numchildren].relid = inhrelid;
+
+ if (get_rel_relkind(inhrelid) == RELKIND_PARTITIONED_TABLE)
{
- maxoids *= 2;
- oidarr = (Oid *) repalloc(oidarr, maxoids * sizeof(Oid));
+ inhchildren[numchildren].is_partitioned = true;
+ my_num_partitioned_children++;
}
- oidarr[numoids++] = inhrelid;
+ else
+ inhchildren[numchildren].is_partitioned = false;
+ numchildren++;
}
systable_endscan(scan);
heap_close(relation, AccessShareLock);
+ if (num_partitioned_children)
+ *num_partitioned_children = my_num_partitioned_children;
+
/*
* If we found more than one child, sort them by OID. This ensures
* reasonably consistent behavior regardless of the vagaries of an
* indexscan. This is important since we need to be sure all backends
* lock children in the same order to avoid needless deadlocks.
*/
- if (numoids > 1)
- qsort(oidarr, numoids, sizeof(Oid), oid_cmp);
+ if (numchildren > 1)
+ qsort(inhchildren, numchildren, sizeof(InhChildInfo),
+ inhchildinfo_cmp);
/*
* Acquire locks and build the result list.
*/
- for (i = 0; i < numoids; i++)
+ for (i = 0; i < numchildren; i++)
{
- inhrelid = oidarr[i];
+ inhrelid = inhchildren[i].relid;
if (lockmode != NoLock)
{
@@ -144,7 +188,7 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
list = lappend_oid(list, inhrelid);
}
- pfree(oidarr);
+ pfree(inhchildren);
return list;
}
@@ -159,18 +203,28 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
* given rel.
*
* The specified lock type is acquired on all child relations (but not on the
- * given rel; caller should already have locked it). If lockmode is NoLock
- * then no locks are acquired, but caller must beware of race conditions
- * against possible DROPs of child relations.
+ * given rel; caller should already have locked it), unless
+ * lock_only_partitioned_children is specified, in which case, only the
+ * child relations that are partitioned tables are locked. If lockmode is
+ * NoLock then no locks are acquired, but caller must beware of race
+ * conditions against possible DROPs of child relations.
+ *
+ * Returned list of OIDs is such that all the partitioned tables in the tree
+ * appear at the head of the list. If num_partitioned_children is non-NULL,
+ * *num_partitioned_children returns the number of partitioned child table
+ * OIDs at the head of the list.
*/
List *
-find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
+find_all_inheritors(Oid parentrelId, LOCKMODE lockmode,
+ List **numparents, int *num_partitioned_children)
{
/* hash table for O(1) rel_oid -> rel_numparents cell lookup */
HTAB *seen_rels;
HASHCTL ctl;
List *rels_list,
- *rel_numparents;
+ *rel_numparents,
+ *partitioned_rels_list,
+ *other_rels_list;
ListCell *l;
memset(&ctl, 0, sizeof(ctl));
@@ -185,31 +239,71 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
/*
* We build a list starting with the given rel and adding all direct and
- * indirect children. We can use a single list as both the record of
- * already-found rels and the agenda of rels yet to be scanned for more
- * children. This is a bit tricky but works because the foreach() macro
- * doesn't fetch the next list element until the bottom of the loop.
+ * indirect children. We can use a single list (rels_list) as both the
+ * record of already-found rels and the agenda of rels yet to be scanned
+ * for more children. This is a bit tricky but works because the foreach()
+ * macro doesn't fetch the next list element until the bottom of the loop.
+ *
+ * partitioned_child_rels will contain the OIDs of the partitioned child
+ * tables and other_rels_list will contain the OIDs of the non-partitioned
+ * child tables. Result list will be generated by concatening the two
+ * lists together with partitioned_child_rels appearing first.
*/
rels_list = list_make1_oid(parentrelId);
+ partitioned_rels_list = list_make1_oid(parentrelId);
+ other_rels_list = NIL;
rel_numparents = list_make1_int(0);
+ if (num_partitioned_children)
+ *num_partitioned_children = 0;
+
foreach(l, rels_list)
{
Oid currentrel = lfirst_oid(l);
List *currentchildren;
- ListCell *lc;
+ ListCell *lc,
+ *first_nonpartitioned_child;
+ int cur_num_partitioned_children = 0,
+ i;
/* Get the direct children of this rel */
- currentchildren = find_inheritance_children(currentrel, lockmode);
+ currentchildren = find_inheritance_children(currentrel, lockmode,
+ &cur_num_partitioned_children);
+
+ if (num_partitioned_children)
+ *num_partitioned_children += cur_num_partitioned_children;
+
+ /*
+ * Append partitioned children to rels_list and partitioned_rels_list.
+ * We know for sure that partitioned children don't need the
+ * the de-duplication logic in the following loop, because partitioned
+ * tables are not allowed to partiticipate in multiple inheritance.
+ */
+ i = 0;
+ foreach(lc, currentchildren)
+ {
+ if (i < cur_num_partitioned_children)
+ {
+ Oid child_oid = lfirst_oid(lc);
+
+ rels_list = lappend_oid(rels_list, child_oid);
+ partitioned_rels_list = lappend_oid(partitioned_rels_list,
+ child_oid);
+ }
+ else
+ break;
+ i++;
+ }
+ first_nonpartitioned_child = lc;
/*
* Add to the queue only those children not already seen. This avoids
* making duplicate entries in case of multiple inheritance paths from
* the same parent. (It'll also keep us from getting into an infinite
* loop, though theoretically there can't be any cycles in the
- * inheritance graph anyway.)
+ * inheritance graph anyway.) Also, add them to the other_rels_list.
*/
- foreach(lc, currentchildren)
+ for_each_cell(lc, first_nonpartitioned_child)
{
Oid child_oid = lfirst_oid(lc);
bool found;
@@ -225,6 +319,7 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
{
/* if it's not there, add it. expect 1 parent, initially. */
rels_list = lappend_oid(rels_list, child_oid);
+ other_rels_list = lappend_oid(other_rels_list, child_oid);
rel_numparents = lappend_int(rel_numparents, 1);
hash_entry->numparents_cell = rel_numparents->tail;
}
@@ -237,8 +332,10 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
list_free(rel_numparents);
hash_destroy(seen_rels);
+ list_free(rels_list);
- return rels_list;
+ /* List partitioned child tables before non-partitioned ones. */
+ return list_concat(partitioned_rels_list, other_rels_list);
}
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b638271b3..ae8ce71e1c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1282,7 +1282,8 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
* the children.
*/
tableOIDs =
- find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, NULL);
+ find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, NULL,
+ NULL);
/*
* Check that there's at least one descendant, else fail. This could
diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index 9fe9e022b0..529f244f7e 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -112,7 +112,7 @@ LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait)
List *children;
ListCell *lc;
- children = find_inheritance_children(reloid, NoLock);
+ children = find_inheritance_children(reloid, NoLock, NULL);
foreach(lc, children)
{
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 610cb499d2..64179ea3ef 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -516,7 +516,7 @@ OpenTableList(List *tables)
List *children;
children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
- NULL);
+ NULL, NULL);
foreach(child, children)
{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1b8d4b3d17..14bac087d9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1231,7 +1231,8 @@ ExecuteTruncate(TruncateStmt *stmt)
ListCell *child;
List *children;
- children = find_all_inheritors(myrelid, AccessExclusiveLock, NULL);
+ children = find_all_inheritors(myrelid, AccessExclusiveLock, NULL,
+ NULL);
foreach(child, children)
{
@@ -2556,7 +2557,7 @@ renameatt_internal(Oid myrelid,
* outside the inheritance hierarchy being processed.
*/
child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
- &child_numparents);
+ &child_numparents, NULL);
/*
* find_all_inheritors does the recursive search of the inheritance
@@ -2583,7 +2584,7 @@ renameatt_internal(Oid myrelid,
* expected_parents will only be 0 if we are not already recursing.
*/
if (expected_parents == 0 &&
- find_inheritance_children(myrelid, NoLock) != NIL)
+ find_inheritance_children(myrelid, NoLock, NULL) != NIL)
ereport(ERROR,
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
errmsg("inherited column \"%s\" must be renamed in child tables too",
@@ -2766,7 +2767,7 @@ rename_constraint_internal(Oid myrelid,
*li;
child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
- &child_numparents);
+ &child_numparents, NULL);
forboth(lo, child_oids, li, child_numparents)
{
@@ -2782,7 +2783,7 @@ rename_constraint_internal(Oid myrelid,
else
{
if (expected_parents == 0 &&
- find_inheritance_children(myrelid, NoLock) != NIL)
+ find_inheritance_children(myrelid, NoLock, NULL) != NIL)
ereport(ERROR,
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
errmsg("inherited constraint \"%s\" must be renamed in child tables too",
@@ -4790,7 +4791,7 @@ ATSimpleRecursion(List **wqueue, Relation rel,
ListCell *child;
List *children;
- children = find_all_inheritors(relid, lockmode, NULL);
+ children = find_all_inheritors(relid, lockmode, NULL, NULL);
/*
* find_all_inheritors does the recursive search of the inheritance
@@ -5186,7 +5187,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
*/
if (colDef->identity &&
recurse &&
- find_inheritance_children(myrelid, NoLock) != NIL)
+ find_inheritance_children(myrelid, NoLock, NULL) != NIL)
ereport(ERROR,
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
errmsg("cannot recursively add identity column to table that has child tables")));
@@ -5392,7 +5393,8 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
* routines, we have to do this one level of recursion at a time; we can't
* use find_all_inheritors to do it in one pass.
*/
- children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ children = find_inheritance_children(RelationGetRelid(rel), lockmode,
+ NULL);
/*
* If we are told not to recurse, there had better not be any child
@@ -6511,7 +6513,8 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
* routines, we have to do this one level of recursion at a time; we can't
* use find_all_inheritors to do it in one pass.
*/
- children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ children = find_inheritance_children(RelationGetRelid(rel), lockmode,
+ NULL);
if (children)
{
@@ -6945,7 +6948,8 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
* routines, we have to do this one level of recursion at a time; we can't
* use find_all_inheritors to do it in one pass.
*/
- children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ children = find_inheritance_children(RelationGetRelid(rel), lockmode,
+ NULL);
/*
* Check if ONLY was specified with ALTER TABLE. If so, allow the
@@ -7664,7 +7668,7 @@ ATExecValidateConstraint(Relation rel, char *constrName, bool recurse,
*/
if (!recursing && !con->connoinherit)
children = find_all_inheritors(RelationGetRelid(rel),
- lockmode, NULL);
+ lockmode, NULL, NULL);
/*
* For CHECK constraints, we must ensure that we only mark the
@@ -8547,7 +8551,8 @@ ATExecDropConstraint(Relation rel, const char *constrName,
* use find_all_inheritors to do it in one pass.
*/
if (!is_no_inherit_constraint)
- children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ children = find_inheritance_children(RelationGetRelid(rel), lockmode,
+ NULL);
else
children = NIL;
@@ -8836,7 +8841,7 @@ ATPrepAlterColumnType(List **wqueue,
ListCell *child;
List *children;
- children = find_all_inheritors(relid, lockmode, NULL);
+ children = find_all_inheritors(relid, lockmode, NULL, NULL);
/*
* find_all_inheritors does the recursive search of the inheritance
@@ -8887,7 +8892,8 @@ ATPrepAlterColumnType(List **wqueue,
}
}
else if (!recursing &&
- find_inheritance_children(RelationGetRelid(rel), NoLock) != NIL)
+ find_inheritance_children(RelationGetRelid(rel),
+ NoLock, NULL) != NIL)
ereport(ERROR,
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
errmsg("type of inherited column \"%s\" must be changed in child tables too",
@@ -10997,7 +11003,7 @@ ATExecAddInherit(Relation child_rel, RangeVar *parent, LOCKMODE lockmode)
* We use weakest lock we can on child's children, namely AccessShareLock.
*/
children = find_all_inheritors(RelationGetRelid(child_rel),
- AccessShareLock, NULL);
+ AccessShareLock, NULL, NULL);
if (list_member_oid(children, RelationGetRelid(parent_rel)))
ereport(ERROR,
@@ -13503,7 +13509,8 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
* weaker lock now and the stronger one only when needed.
*/
attachrel_children = find_all_inheritors(RelationGetRelid(attachrel),
- AccessExclusiveLock, NULL);
+ AccessExclusiveLock, NULL,
+ NULL);
if (list_member_oid(attachrel_children, RelationGetRelid(rel)))
ereport(ERROR,
(errcode(ERRCODE_DUPLICATE_TABLE),
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index faa181207a..e2e5ffce42 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -430,7 +430,8 @@ get_rel_oids(Oid relid, const RangeVar *vacrel)
oldcontext = MemoryContextSwitchTo(vac_context);
if (include_parts)
oid_list = list_concat(oid_list,
- find_all_inheritors(relid, NoLock, NULL));
+ find_all_inheritors(relid, NoLock, NULL,
+ NULL));
else
oid_list = lappend_oid(oid_list, relid);
MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index cf46b74782..09e45c2982 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1418,7 +1418,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
lockmode = AccessShareLock;
/* Scan for all members of inheritance set, acquire needed locks */
- inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+ inhOIDs = find_all_inheritors(parentOID, lockmode, NULL, NULL);
/*
* Check that there's at least one descendant, else treat as no-child
diff --git a/src/include/catalog/pg_inherits_fn.h b/src/include/catalog/pg_inherits_fn.h
index 7743388899..8f371acae7 100644
--- a/src/include/catalog/pg_inherits_fn.h
+++ b/src/include/catalog/pg_inherits_fn.h
@@ -17,9 +17,10 @@
#include "nodes/pg_list.h"
#include "storage/lock.h"
-extern List *find_inheritance_children(Oid parentrelId, LOCKMODE lockmode);
+extern List *find_inheritance_children(Oid parentrelId, LOCKMODE lockmode,
+ int *num_partitioned_children);
extern List *find_all_inheritors(Oid parentrelId, LOCKMODE lockmode,
- List **parents);
+ List **parents, int *num_partitioned_children);
extern bool has_subclass(Oid relationId);
extern bool has_superclass(Oid relationId);
extern bool typeInheritsFrom(Oid subclassTypeId, Oid superclassTypeId);
--
2.11.0
0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchtext/plain; charset=UTF-8; name=0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchDownload
From 6ae18ec3456b2a3fedd239059687873ae91ddbee Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 9 Aug 2017 15:34:45 +0900
Subject: [PATCH 3/5] Relieve RelationGetPartitionDispatchInfo() of doing any
locking
Anyone who wants to call RelationGetPartitionDispatchInfo() must first
acquire locks using find_all_inheritors.
---
src/backend/catalog/partition.c | 42 ++++++++++++++++++++---------------------
src/backend/executor/execMain.c | 20 +++++++++++---------
src/include/catalog/partition.h | 4 ++--
3 files changed, 33 insertions(+), 33 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 465e4fc097..4c16bf143b 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1003,14 +1003,12 @@ get_partition_qual_relid(Oid relid)
* one member, that is, one for the root partitioned table), *leaf_part_oids
* contains a list of the OIDs of of all the leaf partitions.
*
- * Note that we lock only those partitions that are partitioned tables, because
- * we need to look at its relcache entry to get its PartitionKey and its
- * PartitionDesc. It's the caller's responsibility to lock the leaf partitions
- * that will actually be accessed during a given query.
+ * It is assumed that the caller has locked at least all the partitioned tables
+ * in the tree, because we need to look at their relcache entries.
*/
void
-RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
- List **ptinfos, List **leaf_part_oids)
+RelationGetPartitionDispatchInfo(Relation rel, List **ptinfos,
+ List **leaf_part_oids)
{
List *all_parts,
*all_parents;
@@ -1025,16 +1023,10 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
* Starting with the root partitioned table for which we already have the
* relcache entry, we look at its partition descriptor to get the
* partition OIDs. For partitions that are themselves partitioned tables,
- * we get their relcache entries after locking them with lockmode and
- * queue their partitions to be looked at later. Leaf partitions are
- * added to the result list without locking. For each partitioned table,
- * we build a PartitionedTableInfo object and add it to the other result
- * list.
- *
- * Since RelationBuildPartitionDescriptor() puts partitions in a canonical
- * order determined by comparing partition bounds, we can rely that
- * concurrent backends see the partitions in the same order, ensuring that
- * there are no deadlocks when locking the partitions.
+ * we get their relcache entries and queue their partitions to be looked at
+ * later. For each leaf partition, we simply add its OID to the result
+ * list and for each partitioned table, we build a PartitionedTableInfo
+ * object and add it to the other result list.
*/
i = offset = 0;
*ptinfos = *leaf_part_oids = NIL;
@@ -1057,8 +1049,14 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
PartitionedTableInfo *ptinfo;
PartitionDispatch pd;
+ /*
+ * All the relations in the partition tree must be locked
+ * by the caller.
+ *
+ * XXX - Add RelationLockHeldByMe(partrelid) check here!
+ */
if (partrelid != RelationGetRelid(rel))
- partrel = heap_open(partrelid, lockmode);
+ partrel = heap_open(partrelid, NoLock);
else
partrel = rel;
@@ -1077,7 +1075,8 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
/*
* XXX- do we need a pinning mechanism for partition descriptors
* so that there references can be managed independently of
- * the parent relcache entry? Like PinPartitionDesc(partdesc)?
+ * the fate of parent relcache entry?
+ * Like PinPartitionDesc(partdesc)?
*/
pd->partdesc = partdesc;
@@ -1141,10 +1140,9 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
/*
* Release the relation descriptor. Lock that we have on the
- * table will keep the PartitionDesc that is pointing into
- * RelationData intact, a pointer to which hope to keep
- * through this transaction's commit.
- * (XXX - how true is that?)
+ * table will keep PartitionDesc (that is pointing into
+ * RelationData) intact, a reference to which want to keep through
+ * this transaction's commit. (XXX - how true is that?)
*/
if (partrel != rel)
heap_close(partrel, NoLock);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0379e489d9..3dd620fc8a 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -43,6 +43,7 @@
#include "access/xact.h"
#include "catalog/namespace.h"
#include "catalog/partition.h"
+#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_publication.h"
#include "commands/matview.h"
#include "commands/trigger.h"
@@ -3250,14 +3251,16 @@ ExecSetupPartitionTupleRouting(Relation rel,
int i;
ResultRelInfo *leaf_part_rri;
Relation parent;
+ List *all_parts;
/*
- * Get information about the partition tree. All the partitioned
- * tables in the tree are locked, but not the leaf partitions. We
- * lock them while building their ResultRelInfos below.
+ * Get information about the partition tree. First lock all the
+ * partitions using find_all_inheritors().
*/
- RelationGetPartitionDispatchInfo(rel, RowExclusiveLock,
- &ptinfos, &leaf_parts);
+ all_parts = find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock,
+ NULL, NULL);
+ list_free(all_parts);
+ RelationGetPartitionDispatchInfo(rel, &ptinfos, &leaf_parts);
/*
* The ptinfos list contains PartitionedTableInfo objects for all the
@@ -3396,11 +3399,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
TupleDesc part_tupdesc;
/*
- * RelationGetPartitionDispatchInfo didn't lock the leaf partitions,
- * so lock here. Note that each of the relations in *partitions are
- * eventually closed (when the plan is shut down, for instance).
+ * Note that each of the relations in *partitions are eventually
+ * closed (when the plan is shut down, for instance).
*/
- partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
+ partrel = heap_open(lfirst_oid(cell), NoLock); /* already locked */
part_tupdesc = RelationGetDescr(partrel);
/*
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 6a0c81b3bd..9e63020c82 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -72,8 +72,8 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
-extern void RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
- List **ptinfos, List **leaf_part_oids);
+extern void RelationGetPartitionDispatchInfo(Relation rel, List **ptinfos,
+ List **leaf_part_oids);
/* For tuple routing */
extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
--
2.11.0
0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patchtext/plain; charset=UTF-8; name=0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patchDownload
From f09d3f00861b47fcb36e20f43be0b718e3350ab5 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 9 Aug 2017 15:52:36 +0900
Subject: [PATCH 4/5] Teach expand_inherited_rtentry to use partition bound
order
After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.
---
src/backend/optimizer/prep/prepunion.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 09e45c2982..71a0daa1b0 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
#include "access/heapam.h"
#include "access/htup_details.h"
#include "access/sysattr.h"
+#include "catalog/partition.h"
#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -1446,6 +1447,37 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
*/
oldrelation = heap_open(parentOID, NoLock);
+ /*
+ * For partitioned tables, we arrange the child table OIDs such that they
+ * appear in the partition bound order.
+ */
+ if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+ {
+ List *ptinfos,
+ *leaf_part_oids;
+
+ /* Discard the original list. */
+ list_free(inhOIDs);
+ inhOIDs = NIL;
+
+ /* Request partitioning information. */
+ RelationGetPartitionDispatchInfo(oldrelation, &ptinfos,
+ &leaf_part_oids);
+ /*
+ * First collect the partitioned child table OIDs, which includes the
+ * root parent at the head.
+ */
+ foreach(l, ptinfos)
+ {
+ PartitionedTableInfo *ptinfo = lfirst(l);
+
+ inhOIDs = lappend_oid(inhOIDs, ptinfo->relid);
+ }
+
+ /* Concatenate the leaf partition OIDs. */
+ inhOIDs = list_concat(inhOIDs, leaf_part_oids);
+ }
+
/* Scan the inheritance set and expand it */
appinfos = NIL;
need_append = false;
--
2.11.0
0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patchtext/plain; charset=UTF-8; name=0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patchDownload
From 704b0877170757deae269b6bababbb2487693a4b Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 9 Aug 2017 16:53:47 +0900
Subject: [PATCH 5/5] Store in pg_inherits if a child is a partitioned table
---
doc/src/sgml/catalogs.sgml | 10 ++++++++++
src/backend/catalog/pg_inherits.c | 14 +++++++-------
src/backend/commands/tablecmds.c | 17 +++++++++++------
src/include/catalog/pg_inherits.h | 4 +++-
4 files changed, 31 insertions(+), 14 deletions(-)
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 97e5ecf686..eae9b77ccb 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -3896,6 +3896,16 @@ SCRAM-SHA-256$<replaceable><iteration count></>:<replaceable><salt><
inherited columns are to be arranged. The count starts at 1.
</entry>
</row>
+
+ <row>
+ <entry><structfield>inhchildparted</structfield></entry>
+ <entry><type>bool</type></entry>
+ <entry></entry>
+ <entry>
+ This is <literal>true</> if the child table is a partitioned table,
+ <literal>false</> otherwise
+ </entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 99b1e70de6..0285bc3c33 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -30,7 +30,6 @@
#include "utils/builtins.h"
#include "utils/fmgroids.h"
#include "utils/memutils.h"
-#include "utils/lsyscache.h"
#include "utils/syscache.h"
#include "utils/tqual.h"
@@ -123,7 +122,12 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode,
while ((inheritsTuple = systable_getnext(scan)) != NULL)
{
+ bool is_partitioned;
+
inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
+ is_partitioned = ((Form_pg_inherits)
+ GETSTRUCT(inheritsTuple))->inhchildparted;
+
if (numchildren >= maxchildren)
{
maxchildren *= 2;
@@ -131,14 +135,10 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode,
maxchildren * sizeof(InhChildInfo));
}
inhchildren[numchildren].relid = inhrelid;
+ inhchildren[numchildren].is_partitioned = is_partitioned;
- if (get_rel_relkind(inhrelid) == RELKIND_PARTITIONED_TABLE)
- {
- inhchildren[numchildren].is_partitioned = true;
+ if (is_partitioned)
my_num_partitioned_children++;
- }
- else
- inhchildren[numchildren].is_partitioned = false;
numchildren++;
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 14bac087d9..ab3cbbcdba 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -299,10 +299,10 @@ static bool MergeCheckConstraint(List *constraints, char *name, Node *expr);
static void MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel);
static void MergeConstraintsIntoExisting(Relation child_rel, Relation parent_rel);
static void StoreCatalogInheritance(Oid relationId, List *supers,
- bool child_is_partition);
+ bool child_is_partition, bool child_is_partitioned);
static void StoreCatalogInheritance1(Oid relationId, Oid parentOid,
int16 seqNumber, Relation inhRelation,
- bool child_is_partition);
+ bool child_is_partition, bool child_is_partitioned);
static int findAttrByName(const char *attributeName, List *schema);
static void AlterIndexNamespaces(Relation classRel, Relation rel,
Oid oldNspOid, Oid newNspOid, ObjectAddresses *objsMoved);
@@ -746,7 +746,8 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
typaddress);
/* Store inheritance information for new rel. */
- StoreCatalogInheritance(relationId, inheritOids, stmt->partbound != NULL);
+ StoreCatalogInheritance(relationId, inheritOids, stmt->partbound != NULL,
+ relkind == RELKIND_PARTITIONED_TABLE);
/*
* We must bump the command counter to make the newly-created relation
@@ -2298,7 +2299,7 @@ MergeCheckConstraint(List *constraints, char *name, Node *expr)
*/
static void
StoreCatalogInheritance(Oid relationId, List *supers,
- bool child_is_partition)
+ bool child_is_partition, bool child_is_partitioned)
{
Relation relation;
int16 seqNumber;
@@ -2329,7 +2330,7 @@ StoreCatalogInheritance(Oid relationId, List *supers,
Oid parentOid = lfirst_oid(entry);
StoreCatalogInheritance1(relationId, parentOid, seqNumber, relation,
- child_is_partition);
+ child_is_partition, child_is_partitioned);
seqNumber++;
}
@@ -2343,7 +2344,7 @@ StoreCatalogInheritance(Oid relationId, List *supers,
static void
StoreCatalogInheritance1(Oid relationId, Oid parentOid,
int16 seqNumber, Relation inhRelation,
- bool child_is_partition)
+ bool child_is_partition, bool child_is_partitioned)
{
TupleDesc desc = RelationGetDescr(inhRelation);
Datum values[Natts_pg_inherits];
@@ -2358,6 +2359,8 @@ StoreCatalogInheritance1(Oid relationId, Oid parentOid,
values[Anum_pg_inherits_inhrelid - 1] = ObjectIdGetDatum(relationId);
values[Anum_pg_inherits_inhparent - 1] = ObjectIdGetDatum(parentOid);
values[Anum_pg_inherits_inhseqno - 1] = Int16GetDatum(seqNumber);
+ values[Anum_pg_inherits_inhchildparted - 1] =
+ BoolGetDatum(child_is_partitioned);
memset(nulls, 0, sizeof(nulls));
@@ -11112,6 +11115,8 @@ CreateInheritance(Relation child_rel, Relation parent_rel)
inhseqno + 1,
catalogRelation,
parent_rel->rd_rel->relkind ==
+ RELKIND_PARTITIONED_TABLE,
+ child_rel->rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE);
/* Now we're done with pg_inherits */
diff --git a/src/include/catalog/pg_inherits.h b/src/include/catalog/pg_inherits.h
index 26bfab5db6..2c4ef246a4 100644
--- a/src/include/catalog/pg_inherits.h
+++ b/src/include/catalog/pg_inherits.h
@@ -33,6 +33,7 @@ CATALOG(pg_inherits,2611) BKI_WITHOUT_OIDS
Oid inhrelid;
Oid inhparent;
int32 inhseqno;
+ bool inhchildparted;
} FormData_pg_inherits;
/* ----------------
@@ -46,10 +47,11 @@ typedef FormData_pg_inherits *Form_pg_inherits;
* compiler constants for pg_inherits
* ----------------
*/
-#define Natts_pg_inherits 3
+#define Natts_pg_inherits 4
#define Anum_pg_inherits_inhrelid 1
#define Anum_pg_inherits_inhparent 2
#define Anum_pg_inherits_inhseqno 3
+#define Anum_pg_inherits_inhchildparted 4
/* ----------------
* pg_inherits has no initial contents
--
2.11.0
Hi Amit,
On Thu, Aug 10, 2017 at 7:41 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/08/05 2:25, Robert Haas wrote:
Concretely, my proposal is:
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.I put this patch ahead in the list and so it's now 0001.
FYI, 0001 patch throws the warning:
execMain.c: In function ‘ExecSetupPartitionTupleRouting’:
execMain.c:3342:16: warning: ‘next_ptinfo’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
next_ptinfo->parentid != ptinfo->parentid)
--
Beena Emerson
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 9, 2017 at 10:11 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.I put this patch ahead in the list and so it's now 0001.
I think what you've currently got as
0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patch is a
bug fix that probably needs to be back-patched into v10, so it should
come first.
I think 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch and
0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patch should
be merged into one patch and that should come next, followed by
0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patch and
finally what you now have as
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patch.
This patch series is blocking a bunch of other things, so it would be
nice if you could press forward with this quickly.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/10 18:52, Beena Emerson wrote:
Hi Amit,
On Thu, Aug 10, 2017 at 7:41 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:On 2017/08/05 2:25, Robert Haas wrote:
Concretely, my proposal is:
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.I put this patch ahead in the list and so it's now 0001.
FYI, 0001 patch throws the warning:
execMain.c: In function ‘ExecSetupPartitionTupleRouting’:
execMain.c:3342:16: warning: ‘next_ptinfo’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
next_ptinfo->parentid != ptinfo->parentid)
Thanks for the review. Will fix in the updated version of the patch I
will post sometime later today.
Regards,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks for the review.
On 2017/08/16 2:27, Robert Haas wrote:
On Wed, Aug 9, 2017 at 10:11 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.I put this patch ahead in the list and so it's now 0001.
I think what you've currently got as
0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patch is a
bug fix that probably needs to be back-patched into v10, so it should
come first.
That makes sense. That patch is now 0001. Checked that it can be
back-patched to REL_10_STABLE.
I think 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch and
0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patch should
be merged into one patch and that should come next,
Merged the two into one: attached 0002.
followed by
0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patch and
This one is now 0003.
finally what you now have as
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patch.
And 0004.
This patch series is blocking a bunch of other things, so it would be
nice if you could press forward with this quickly.
Attached updated patches.
Thanks,
Amit
Attachments:
0001-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchtext/plain; charset=UTF-8; name=0001-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchDownload
From 23a3e291001394ffa2b79b34b32c582cb4898e87 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 16 Aug 2017 11:36:14 +0900
Subject: [PATCH 1/4] Relieve RelationGetPartitionDispatchInfo() of doing any
locking
Anyone who wants to call RelationGetPartitionDispatchInfo() must first
acquire locks using find_all_inheritors.
Doing it this way gets rid of the possibility of a deadlock when partitions
are concurrently locked, because RelationGetPartitionDispatchInfo would lock
the partitions in one order and find_all_inheritors would in another.
Reported-by: Amit Khandekar, Robert Haas
Reports: https://postgr.es/m/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB%3DwY9ZLfe8cQOfG4bTqVGyQ%40mail.gmail.com
https://postgr.es/m/CA%2BTgmobwbh12OJerqAGyPEjb_%2B2y7T0nqRKTcjed6L4NTET6Fg%40mail.gmail.com
---
src/backend/catalog/partition.c | 55 ++++++++++++++++++++++-------------------
src/backend/executor/execMain.c | 18 +++++++++-----
src/include/catalog/partition.h | 3 +--
3 files changed, 42 insertions(+), 34 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index c1a307c8d3..96a64ce6b2 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -999,12 +999,16 @@ get_partition_qual_relid(Oid relid)
* RelationGetPartitionDispatchInfo
* Returns information necessary to route tuples down a partition tree
*
- * All the partitions will be locked with lockmode, unless it is NoLock.
- * A list of the OIDs of all the leaf partitions of rel is returned in
- * *leaf_part_oids.
+ * The number of elements in the returned array (that is, the number of
+ * PartitionDispatch objects for the partitioned tables in the partition tree)
+ * is returned in *num_parted and a list of the OIDs of all the leaf
+ * partitions of rel is returned in *leaf_part_oids.
+ *
+ * All the relations in the partition tree (including 'rel') must have been
+ * locked (using at least the AccessShareLock) by the caller.
*/
PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
+RelationGetPartitionDispatchInfo(Relation rel,
int *num_parted, List **leaf_part_oids)
{
PartitionDispatchData **pd;
@@ -1019,14 +1023,18 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
offset;
/*
- * Lock partitions and make a list of the partitioned ones to prepare
- * their PartitionDispatch objects below.
+ * We rely on the relcache to traverse the partition tree to build both
+ * the leaf partition OIDs list and the array of PartitionDispatch objects
+ * for the partitioned tables in the tree. That means every partitioned
+ * table in the tree must be locked, which is fine since we require the
+ * caller to lock all the partitions anyway.
*
- * Cannot use find_all_inheritors() here, because then the order of OIDs
- * in parted_rels list would be unknown, which does not help, because we
- * assign indexes within individual PartitionDispatch in an order that is
- * predetermined (determined by the order of OIDs in individual partition
- * descriptors).
+ * For every partitioned table in the tree, starting with the root
+ * partitioned table, add its relcache entry to parted_rels, while also
+ * queuing its partitions (in the order in which they appear in the
+ * partition descriptor) to be looked at later in the same loop. This is
+ * a bit tricky but works because the foreach() macro doesn't fetch the
+ * next list element until the bottom of the loop.
*/
*num_parted = 1;
parted_rels = list_make1(rel);
@@ -1035,29 +1043,24 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
forboth(lc1, all_parts, lc2, all_parents)
{
- Relation partrel = heap_open(lfirst_oid(lc1), lockmode);
+ Oid partrelid = lfirst_oid(lc1);
Relation parent = lfirst(lc2);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- /*
- * If this partition is a partitioned table, add its children to the
- * end of the list, so that they are processed as well.
- */
- if (partdesc)
+ if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
{
+ /*
+ * Already locked by the caller. Note that it is the
+ * responsibility of the caller to close the below relcache entry,
+ * once done using the information being collected here (for
+ * example, in ExecEndModifyTable).
+ */
+ Relation partrel = heap_open(partrelid, NoLock);
+
(*num_parted)++;
parted_rels = lappend(parted_rels, partrel);
parted_rel_parents = lappend(parted_rel_parents, parent);
APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
}
- else
- heap_close(partrel, NoLock);
-
- /*
- * We keep the partitioned ones open until we're done using the
- * information being collected here (for example, see
- * ExecEndModifyTable).
- */
}
/*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 6671a25ffb..eeadd8bec5 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -43,6 +43,7 @@
#include "access/xact.h"
#include "catalog/namespace.h"
#include "catalog/partition.h"
+#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_publication.h"
#include "commands/matview.h"
#include "commands/trigger.h"
@@ -3248,10 +3249,16 @@ ExecSetupPartitionTupleRouting(Relation rel,
ListCell *cell;
int i;
ResultRelInfo *leaf_part_rri;
+ List *all_parts;
- /* Get the tuple-routing information and lock partitions */
- *pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, num_parted,
- &leaf_parts);
+ /*
+ * Get the information about the partition tree after locking all the
+ * partitions.
+ */
+ all_parts = find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock,
+ NULL);
+ list_free(all_parts);
+ *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
@@ -3274,9 +3281,8 @@ ExecSetupPartitionTupleRouting(Relation rel,
TupleDesc part_tupdesc;
/*
- * We locked all the partitions above including the leaf partitions.
- * Note that each of the relations in *partitions are eventually
- * closed by the caller.
+ * All the partitions were locked above. Note that the relcache
+ * entries will be closed by ExecEndModifyTable().
*/
partrel = heap_open(lfirst_oid(cell), NoLock);
part_tupdesc = RelationGetDescr(partrel);
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index bef7a0f5fb..2283c675e9 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -88,8 +88,7 @@ extern Expr *get_partition_qual_relid(Oid relid);
/* For tuple routing */
extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
- int lockmode, int *num_parted,
- List **leaf_part_oids);
+ int *num_parted, List **leaf_part_oids);
extern void FormPartitionKeyDatum(PartitionDispatch pd,
TupleTableSlot *slot,
EState *estate,
--
2.11.0
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patchtext/plain; charset=UTF-8; name=0002-Teach-pg_inherits.c-a-bit-about-partitioning.patchDownload
From e0ffad29a97f8ab2c2ee9bff1a4c1c6168c08532 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Tue, 8 Aug 2017 18:42:30 +0900
Subject: [PATCH 2/4] Teach pg_inherits.c a bit about partitioning
Both find_inheritance_children and find_all_inheritors now list
partitioned child tables before non-partitioned ones and return
the number of partitioned tables in an optional output argument
We also now store in pg_inherits, when adding a new child, if the
child is a partitioned table.
Per design idea from Robert Haas
---
contrib/sepgsql/dml.c | 2 +-
doc/src/sgml/catalogs.sgml | 10 +++
src/backend/catalog/partition.c | 2 +-
src/backend/catalog/pg_inherits.c | 157 ++++++++++++++++++++++++++-------
src/backend/commands/analyze.c | 3 +-
src/backend/commands/lockcmds.c | 2 +-
src/backend/commands/publicationcmds.c | 2 +-
src/backend/commands/tablecmds.c | 56 +++++++-----
src/backend/commands/vacuum.c | 3 +-
src/backend/executor/execMain.c | 2 +-
src/backend/optimizer/prep/prepunion.c | 2 +-
src/include/catalog/pg_inherits.h | 4 +-
src/include/catalog/pg_inherits_fn.h | 5 +-
13 files changed, 187 insertions(+), 63 deletions(-)
diff --git a/contrib/sepgsql/dml.c b/contrib/sepgsql/dml.c
index b643720e36..6fc279805c 100644
--- a/contrib/sepgsql/dml.c
+++ b/contrib/sepgsql/dml.c
@@ -333,7 +333,7 @@ sepgsql_dml_privileges(List *rangeTabls, bool abort_on_violation)
if (!rte->inh)
tableIds = list_make1_oid(rte->relid);
else
- tableIds = find_all_inheritors(rte->relid, NoLock, NULL);
+ tableIds = find_all_inheritors(rte->relid, NoLock, NULL, NULL);
foreach(li, tableIds)
{
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index ef7054cf26..c1d5a75020 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -3894,6 +3894,16 @@ SCRAM-SHA-256$<replaceable><iteration count></>:<replaceable><salt><
inherited columns are to be arranged. The count starts at 1.
</entry>
</row>
+
+ <row>
+ <entry><structfield>inhchildparted</structfield></entry>
+ <entry><type>bool</type></entry>
+ <entry></entry>
+ <entry>
+ This is <literal>true</> if the child table is a partitioned table,
+ <literal>false</> otherwise
+ </entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 96a64ce6b2..efc025ec42 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -178,7 +178,7 @@ RelationBuildPartitionDesc(Relation rel)
return;
/* Get partition oids from pg_inherits */
- inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+ inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock, NULL);
/* Collect bound spec nodes in a list */
i = 0;
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 245a374fc9..0285bc3c33 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -33,6 +33,8 @@
#include "utils/syscache.h"
#include "utils/tqual.h"
+static int32 inhchildinfo_cmp(const void *p1, const void *p2);
+
/*
* Entry of a hash table used in find_all_inheritors. See below.
*/
@@ -42,6 +44,30 @@ typedef struct SeenRelsEntry
ListCell *numparents_cell; /* corresponding list cell */
} SeenRelsEntry;
+/* Information about one inheritance child table. */
+typedef struct InhChildInfo
+{
+ Oid relid;
+ bool is_partitioned;
+} InhChildInfo;
+
+#define OID_CMP(o1, o2) \
+ ((o1) < (o2) ? -1 : ((o1) > (o2) ? 1 : 0));
+
+static int32
+inhchildinfo_cmp(const void *p1, const void *p2)
+{
+ InhChildInfo c1 = *((const InhChildInfo *) p1);
+ InhChildInfo c2 = *((const InhChildInfo *) p2);
+
+ if (c1.is_partitioned && !c2.is_partitioned)
+ return -1;
+ if (!c1.is_partitioned && c2.is_partitioned)
+ return 1;
+
+ return OID_CMP(c1.relid, c2.relid);
+}
+
/*
* find_inheritance_children
*
@@ -54,7 +80,8 @@ typedef struct SeenRelsEntry
* against possible DROPs of child relations.
*/
List *
-find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
+find_inheritance_children(Oid parentrelId, LOCKMODE lockmode,
+ int *num_partitioned_children)
{
List *list = NIL;
Relation relation;
@@ -62,9 +89,10 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
ScanKeyData key[1];
HeapTuple inheritsTuple;
Oid inhrelid;
- Oid *oidarr;
- int maxoids,
- numoids,
+ InhChildInfo *inhchildren;
+ int maxchildren,
+ numchildren,
+ my_num_partitioned_children,
i;
/*
@@ -77,9 +105,10 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
/*
* Scan pg_inherits and build a working array of subclass OIDs.
*/
- maxoids = 32;
- oidarr = (Oid *) palloc(maxoids * sizeof(Oid));
- numoids = 0;
+ maxchildren = 32;
+ inhchildren = (InhChildInfo *) palloc(maxchildren * sizeof(InhChildInfo));
+ numchildren = 0;
+ my_num_partitioned_children = 0;
relation = heap_open(InheritsRelationId, AccessShareLock);
@@ -93,34 +122,49 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
while ((inheritsTuple = systable_getnext(scan)) != NULL)
{
+ bool is_partitioned;
+
inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
- if (numoids >= maxoids)
+ is_partitioned = ((Form_pg_inherits)
+ GETSTRUCT(inheritsTuple))->inhchildparted;
+
+ if (numchildren >= maxchildren)
{
- maxoids *= 2;
- oidarr = (Oid *) repalloc(oidarr, maxoids * sizeof(Oid));
+ maxchildren *= 2;
+ inhchildren = (InhChildInfo *) repalloc(inhchildren,
+ maxchildren * sizeof(InhChildInfo));
}
- oidarr[numoids++] = inhrelid;
+ inhchildren[numchildren].relid = inhrelid;
+ inhchildren[numchildren].is_partitioned = is_partitioned;
+
+ if (is_partitioned)
+ my_num_partitioned_children++;
+ numchildren++;
}
systable_endscan(scan);
heap_close(relation, AccessShareLock);
+ if (num_partitioned_children)
+ *num_partitioned_children = my_num_partitioned_children;
+
/*
* If we found more than one child, sort them by OID. This ensures
* reasonably consistent behavior regardless of the vagaries of an
* indexscan. This is important since we need to be sure all backends
* lock children in the same order to avoid needless deadlocks.
*/
- if (numoids > 1)
- qsort(oidarr, numoids, sizeof(Oid), oid_cmp);
+ if (numchildren > 1)
+ qsort(inhchildren, numchildren, sizeof(InhChildInfo),
+ inhchildinfo_cmp);
/*
* Acquire locks and build the result list.
*/
- for (i = 0; i < numoids; i++)
+ for (i = 0; i < numchildren; i++)
{
- inhrelid = oidarr[i];
+ inhrelid = inhchildren[i].relid;
if (lockmode != NoLock)
{
@@ -144,7 +188,7 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
list = lappend_oid(list, inhrelid);
}
- pfree(oidarr);
+ pfree(inhchildren);
return list;
}
@@ -159,18 +203,28 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
* given rel.
*
* The specified lock type is acquired on all child relations (but not on the
- * given rel; caller should already have locked it). If lockmode is NoLock
- * then no locks are acquired, but caller must beware of race conditions
- * against possible DROPs of child relations.
+ * given rel; caller should already have locked it), unless
+ * lock_only_partitioned_children is specified, in which case, only the
+ * child relations that are partitioned tables are locked. If lockmode is
+ * NoLock then no locks are acquired, but caller must beware of race
+ * conditions against possible DROPs of child relations.
+ *
+ * Returned list of OIDs is such that all the partitioned tables in the tree
+ * appear at the head of the list. If num_partitioned_children is non-NULL,
+ * *num_partitioned_children returns the number of partitioned child table
+ * OIDs at the head of the list.
*/
List *
-find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
+find_all_inheritors(Oid parentrelId, LOCKMODE lockmode,
+ List **numparents, int *num_partitioned_children)
{
/* hash table for O(1) rel_oid -> rel_numparents cell lookup */
HTAB *seen_rels;
HASHCTL ctl;
List *rels_list,
- *rel_numparents;
+ *rel_numparents,
+ *partitioned_rels_list,
+ *other_rels_list;
ListCell *l;
memset(&ctl, 0, sizeof(ctl));
@@ -185,31 +239,71 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
/*
* We build a list starting with the given rel and adding all direct and
- * indirect children. We can use a single list as both the record of
- * already-found rels and the agenda of rels yet to be scanned for more
- * children. This is a bit tricky but works because the foreach() macro
- * doesn't fetch the next list element until the bottom of the loop.
+ * indirect children. We can use a single list (rels_list) as both the
+ * record of already-found rels and the agenda of rels yet to be scanned
+ * for more children. This is a bit tricky but works because the foreach()
+ * macro doesn't fetch the next list element until the bottom of the loop.
+ *
+ * partitioned_child_rels will contain the OIDs of the partitioned child
+ * tables and other_rels_list will contain the OIDs of the non-partitioned
+ * child tables. Result list will be generated by concatening the two
+ * lists together with partitioned_child_rels appearing first.
*/
rels_list = list_make1_oid(parentrelId);
+ partitioned_rels_list = list_make1_oid(parentrelId);
+ other_rels_list = NIL;
rel_numparents = list_make1_int(0);
+ if (num_partitioned_children)
+ *num_partitioned_children = 0;
+
foreach(l, rels_list)
{
Oid currentrel = lfirst_oid(l);
List *currentchildren;
- ListCell *lc;
+ ListCell *lc,
+ *first_nonpartitioned_child;
+ int cur_num_partitioned_children = 0,
+ i;
/* Get the direct children of this rel */
- currentchildren = find_inheritance_children(currentrel, lockmode);
+ currentchildren = find_inheritance_children(currentrel, lockmode,
+ &cur_num_partitioned_children);
+
+ if (num_partitioned_children)
+ *num_partitioned_children += cur_num_partitioned_children;
+
+ /*
+ * Append partitioned children to rels_list and partitioned_rels_list.
+ * We know for sure that partitioned children don't need the
+ * the de-duplication logic in the following loop, because partitioned
+ * tables are not allowed to partiticipate in multiple inheritance.
+ */
+ i = 0;
+ foreach(lc, currentchildren)
+ {
+ if (i < cur_num_partitioned_children)
+ {
+ Oid child_oid = lfirst_oid(lc);
+
+ rels_list = lappend_oid(rels_list, child_oid);
+ partitioned_rels_list = lappend_oid(partitioned_rels_list,
+ child_oid);
+ }
+ else
+ break;
+ i++;
+ }
+ first_nonpartitioned_child = lc;
/*
* Add to the queue only those children not already seen. This avoids
* making duplicate entries in case of multiple inheritance paths from
* the same parent. (It'll also keep us from getting into an infinite
* loop, though theoretically there can't be any cycles in the
- * inheritance graph anyway.)
+ * inheritance graph anyway.) Also, add them to the other_rels_list.
*/
- foreach(lc, currentchildren)
+ for_each_cell(lc, first_nonpartitioned_child)
{
Oid child_oid = lfirst_oid(lc);
bool found;
@@ -225,6 +319,7 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
{
/* if it's not there, add it. expect 1 parent, initially. */
rels_list = lappend_oid(rels_list, child_oid);
+ other_rels_list = lappend_oid(other_rels_list, child_oid);
rel_numparents = lappend_int(rel_numparents, 1);
hash_entry->numparents_cell = rel_numparents->tail;
}
@@ -237,8 +332,10 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
list_free(rel_numparents);
hash_destroy(seen_rels);
+ list_free(rels_list);
- return rels_list;
+ /* List partitioned child tables before non-partitioned ones. */
+ return list_concat(partitioned_rels_list, other_rels_list);
}
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b638271b3..ae8ce71e1c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1282,7 +1282,8 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
* the children.
*/
tableOIDs =
- find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, NULL);
+ find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, NULL,
+ NULL);
/*
* Check that there's at least one descendant, else fail. This could
diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index 9fe9e022b0..529f244f7e 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -112,7 +112,7 @@ LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait)
List *children;
ListCell *lc;
- children = find_inheritance_children(reloid, NoLock);
+ children = find_inheritance_children(reloid, NoLock, NULL);
foreach(lc, children)
{
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 610cb499d2..64179ea3ef 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -516,7 +516,7 @@ OpenTableList(List *tables)
List *children;
children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
- NULL);
+ NULL, NULL);
foreach(child, children)
{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 513a9ec485..a35d7810f2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -299,10 +299,10 @@ static bool MergeCheckConstraint(List *constraints, char *name, Node *expr);
static void MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel);
static void MergeConstraintsIntoExisting(Relation child_rel, Relation parent_rel);
static void StoreCatalogInheritance(Oid relationId, List *supers,
- bool child_is_partition);
+ bool child_is_partition, bool child_is_partitioned);
static void StoreCatalogInheritance1(Oid relationId, Oid parentOid,
int16 seqNumber, Relation inhRelation,
- bool child_is_partition);
+ bool child_is_partition, bool child_is_partitioned);
static int findAttrByName(const char *attributeName, List *schema);
static void AlterIndexNamespaces(Relation classRel, Relation rel,
Oid oldNspOid, Oid newNspOid, ObjectAddresses *objsMoved);
@@ -746,7 +746,8 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
typaddress);
/* Store inheritance information for new rel. */
- StoreCatalogInheritance(relationId, inheritOids, stmt->partbound != NULL);
+ StoreCatalogInheritance(relationId, inheritOids, stmt->partbound != NULL,
+ relkind == RELKIND_PARTITIONED_TABLE);
/*
* We must bump the command counter to make the newly-created relation
@@ -1231,7 +1232,8 @@ ExecuteTruncate(TruncateStmt *stmt)
ListCell *child;
List *children;
- children = find_all_inheritors(myrelid, AccessExclusiveLock, NULL);
+ children = find_all_inheritors(myrelid, AccessExclusiveLock, NULL,
+ NULL);
foreach(child, children)
{
@@ -2297,7 +2299,7 @@ MergeCheckConstraint(List *constraints, char *name, Node *expr)
*/
static void
StoreCatalogInheritance(Oid relationId, List *supers,
- bool child_is_partition)
+ bool child_is_partition, bool child_is_partitioned)
{
Relation relation;
int16 seqNumber;
@@ -2328,7 +2330,7 @@ StoreCatalogInheritance(Oid relationId, List *supers,
Oid parentOid = lfirst_oid(entry);
StoreCatalogInheritance1(relationId, parentOid, seqNumber, relation,
- child_is_partition);
+ child_is_partition, child_is_partitioned);
seqNumber++;
}
@@ -2342,7 +2344,7 @@ StoreCatalogInheritance(Oid relationId, List *supers,
static void
StoreCatalogInheritance1(Oid relationId, Oid parentOid,
int16 seqNumber, Relation inhRelation,
- bool child_is_partition)
+ bool child_is_partition, bool child_is_partitioned)
{
TupleDesc desc = RelationGetDescr(inhRelation);
Datum values[Natts_pg_inherits];
@@ -2357,6 +2359,8 @@ StoreCatalogInheritance1(Oid relationId, Oid parentOid,
values[Anum_pg_inherits_inhrelid - 1] = ObjectIdGetDatum(relationId);
values[Anum_pg_inherits_inhparent - 1] = ObjectIdGetDatum(parentOid);
values[Anum_pg_inherits_inhseqno - 1] = Int16GetDatum(seqNumber);
+ values[Anum_pg_inherits_inhchildparted - 1] =
+ BoolGetDatum(child_is_partitioned);
memset(nulls, 0, sizeof(nulls));
@@ -2556,7 +2560,7 @@ renameatt_internal(Oid myrelid,
* outside the inheritance hierarchy being processed.
*/
child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
- &child_numparents);
+ &child_numparents, NULL);
/*
* find_all_inheritors does the recursive search of the inheritance
@@ -2583,7 +2587,7 @@ renameatt_internal(Oid myrelid,
* expected_parents will only be 0 if we are not already recursing.
*/
if (expected_parents == 0 &&
- find_inheritance_children(myrelid, NoLock) != NIL)
+ find_inheritance_children(myrelid, NoLock, NULL) != NIL)
ereport(ERROR,
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
errmsg("inherited column \"%s\" must be renamed in child tables too",
@@ -2766,7 +2770,7 @@ rename_constraint_internal(Oid myrelid,
*li;
child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
- &child_numparents);
+ &child_numparents, NULL);
forboth(lo, child_oids, li, child_numparents)
{
@@ -2782,7 +2786,7 @@ rename_constraint_internal(Oid myrelid,
else
{
if (expected_parents == 0 &&
- find_inheritance_children(myrelid, NoLock) != NIL)
+ find_inheritance_children(myrelid, NoLock, NULL) != NIL)
ereport(ERROR,
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
errmsg("inherited constraint \"%s\" must be renamed in child tables too",
@@ -4790,7 +4794,7 @@ ATSimpleRecursion(List **wqueue, Relation rel,
ListCell *child;
List *children;
- children = find_all_inheritors(relid, lockmode, NULL);
+ children = find_all_inheritors(relid, lockmode, NULL, NULL);
/*
* find_all_inheritors does the recursive search of the inheritance
@@ -5199,7 +5203,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
*/
if (colDef->identity &&
recurse &&
- find_inheritance_children(myrelid, NoLock) != NIL)
+ find_inheritance_children(myrelid, NoLock, NULL) != NIL)
ereport(ERROR,
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
errmsg("cannot recursively add identity column to table that has child tables")));
@@ -5405,7 +5409,8 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
* routines, we have to do this one level of recursion at a time; we can't
* use find_all_inheritors to do it in one pass.
*/
- children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ children = find_inheritance_children(RelationGetRelid(rel), lockmode,
+ NULL);
/*
* If we are told not to recurse, there had better not be any child
@@ -6524,7 +6529,8 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
* routines, we have to do this one level of recursion at a time; we can't
* use find_all_inheritors to do it in one pass.
*/
- children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ children = find_inheritance_children(RelationGetRelid(rel), lockmode,
+ NULL);
if (children)
{
@@ -6958,7 +6964,8 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
* routines, we have to do this one level of recursion at a time; we can't
* use find_all_inheritors to do it in one pass.
*/
- children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ children = find_inheritance_children(RelationGetRelid(rel), lockmode,
+ NULL);
/*
* Check if ONLY was specified with ALTER TABLE. If so, allow the
@@ -7677,7 +7684,7 @@ ATExecValidateConstraint(Relation rel, char *constrName, bool recurse,
*/
if (!recursing && !con->connoinherit)
children = find_all_inheritors(RelationGetRelid(rel),
- lockmode, NULL);
+ lockmode, NULL, NULL);
/*
* For CHECK constraints, we must ensure that we only mark the
@@ -8560,7 +8567,8 @@ ATExecDropConstraint(Relation rel, const char *constrName,
* use find_all_inheritors to do it in one pass.
*/
if (!is_no_inherit_constraint)
- children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+ children = find_inheritance_children(RelationGetRelid(rel), lockmode,
+ NULL);
else
children = NIL;
@@ -8849,7 +8857,7 @@ ATPrepAlterColumnType(List **wqueue,
ListCell *child;
List *children;
- children = find_all_inheritors(relid, lockmode, NULL);
+ children = find_all_inheritors(relid, lockmode, NULL, NULL);
/*
* find_all_inheritors does the recursive search of the inheritance
@@ -8900,7 +8908,8 @@ ATPrepAlterColumnType(List **wqueue,
}
}
else if (!recursing &&
- find_inheritance_children(RelationGetRelid(rel), NoLock) != NIL)
+ find_inheritance_children(RelationGetRelid(rel),
+ NoLock, NULL) != NIL)
ereport(ERROR,
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
errmsg("type of inherited column \"%s\" must be changed in child tables too",
@@ -11010,7 +11019,7 @@ ATExecAddInherit(Relation child_rel, RangeVar *parent, LOCKMODE lockmode)
* We use weakest lock we can on child's children, namely AccessShareLock.
*/
children = find_all_inheritors(RelationGetRelid(child_rel),
- AccessShareLock, NULL);
+ AccessShareLock, NULL, NULL);
if (list_member_oid(children, RelationGetRelid(parent_rel)))
ereport(ERROR,
@@ -11119,6 +11128,8 @@ CreateInheritance(Relation child_rel, Relation parent_rel)
inhseqno + 1,
catalogRelation,
parent_rel->rd_rel->relkind ==
+ RELKIND_PARTITIONED_TABLE,
+ child_rel->rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE);
/* Now we're done with pg_inherits */
@@ -13516,7 +13527,8 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
* weaker lock now and the stronger one only when needed.
*/
attachrel_children = find_all_inheritors(RelationGetRelid(attachrel),
- AccessExclusiveLock, NULL);
+ AccessExclusiveLock, NULL,
+ NULL);
if (list_member_oid(attachrel_children, RelationGetRelid(rel)))
ereport(ERROR,
(errcode(ERRCODE_DUPLICATE_TABLE),
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index faa181207a..e2e5ffce42 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -430,7 +430,8 @@ get_rel_oids(Oid relid, const RangeVar *vacrel)
oldcontext = MemoryContextSwitchTo(vac_context);
if (include_parts)
oid_list = list_concat(oid_list,
- find_all_inheritors(relid, NoLock, NULL));
+ find_all_inheritors(relid, NoLock, NULL,
+ NULL));
else
oid_list = lappend_oid(oid_list, relid);
MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index eeadd8bec5..3db8b6f971 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3256,7 +3256,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* partitions.
*/
all_parts = find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock,
- NULL);
+ NULL, NULL);
list_free(all_parts);
*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
*num_partitions = list_length(leaf_parts);
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 6d8f8938b2..a59081103a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1424,7 +1424,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
lockmode = AccessShareLock;
/* Scan for all members of inheritance set, acquire needed locks */
- inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+ inhOIDs = find_all_inheritors(parentOID, lockmode, NULL, NULL);
/*
* Check that there's at least one descendant, else treat as no-child
diff --git a/src/include/catalog/pg_inherits.h b/src/include/catalog/pg_inherits.h
index 26bfab5db6..2c4ef246a4 100644
--- a/src/include/catalog/pg_inherits.h
+++ b/src/include/catalog/pg_inherits.h
@@ -33,6 +33,7 @@ CATALOG(pg_inherits,2611) BKI_WITHOUT_OIDS
Oid inhrelid;
Oid inhparent;
int32 inhseqno;
+ bool inhchildparted;
} FormData_pg_inherits;
/* ----------------
@@ -46,10 +47,11 @@ typedef FormData_pg_inherits *Form_pg_inherits;
* compiler constants for pg_inherits
* ----------------
*/
-#define Natts_pg_inherits 3
+#define Natts_pg_inherits 4
#define Anum_pg_inherits_inhrelid 1
#define Anum_pg_inherits_inhparent 2
#define Anum_pg_inherits_inhseqno 3
+#define Anum_pg_inherits_inhchildparted 4
/* ----------------
* pg_inherits has no initial contents
diff --git a/src/include/catalog/pg_inherits_fn.h b/src/include/catalog/pg_inherits_fn.h
index 7743388899..8f371acae7 100644
--- a/src/include/catalog/pg_inherits_fn.h
+++ b/src/include/catalog/pg_inherits_fn.h
@@ -17,9 +17,10 @@
#include "nodes/pg_list.h"
#include "storage/lock.h"
-extern List *find_inheritance_children(Oid parentrelId, LOCKMODE lockmode);
+extern List *find_inheritance_children(Oid parentrelId, LOCKMODE lockmode,
+ int *num_partitioned_children);
extern List *find_all_inheritors(Oid parentrelId, LOCKMODE lockmode,
- List **parents);
+ List **parents, int *num_partitioned_children);
extern bool has_subclass(Oid relationId);
extern bool has_superclass(Oid relationId);
extern bool typeInheritsFrom(Oid subclassTypeId, Oid superclassTypeId);
--
2.11.0
0003-Teach-expand_inherited_rtentry-to-use-partition-boun.patchtext/plain; charset=UTF-8; name=0003-Teach-expand_inherited_rtentry-to-use-partition-boun.patchDownload
From 928eabebed8806f2ead413744ac196bb9caef646 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 9 Aug 2017 15:52:36 +0900
Subject: [PATCH 3/4] Teach expand_inherited_rtentry to use partition bound
order
After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.
---
src/backend/optimizer/prep/prepunion.c | 51 ++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index a59081103a..734a7e55df 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
#include "access/heapam.h"
#include "access/htup_details.h"
#include "access/sysattr.h"
+#include "catalog/partition.h"
#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -1452,6 +1453,56 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
*/
oldrelation = heap_open(parentOID, NoLock);
+ /*
+ * For partitioned tables, we arrange the child table OIDs such that they
+ * appear in the partition bound order.
+ */
+ if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+ {
+ List *leaf_part_oids;
+ int num_parted,
+ i;
+ PartitionDispatch *pds;
+
+ /* Discard the original list. */
+ list_free(inhOIDs);
+ inhOIDs = NIL;
+
+ /* Request partitioning information. */
+ pds = RelationGetPartitionDispatchInfo(oldrelation, &num_parted,
+ &leaf_part_oids);
+
+ /*
+ * First collect the partitioned child table OIDs, which includes the
+ * root parent at the head.
+ */
+ for (i = 0; i < num_parted; i++)
+ {
+ PartitionDispatch pd = pds[i];
+
+ inhOIDs = lappend_oid(inhOIDs, RelationGetRelid(pd->reldesc));
+ }
+
+ /* Concatenate the leaf partition OIDs. */
+ inhOIDs = list_concat(inhOIDs, leaf_part_oids);
+
+ /*
+ * Release the resources that RelationGetPartitionDispatchInfo
+ * acquired for us but we don't really need in this case. Note that
+ * we don't touch the root partitioned table itself by starting the
+ * loop with 1, not 0.
+ */
+ for (i = 1; i < num_parted; i++)
+ {
+ PartitionDispatch pd = pds[i];
+
+ heap_close(pd->reldesc, NoLock);
+ ExecDropSingleTupleTableSlot(pd->tupslot);
+ if (pd->tupmap)
+ pfree(pd->tupmap);
+ }
+ }
+
/* Scan the inheritance set and expand it */
appinfos = NIL;
has_child = false;
--
2.11.0
0004-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0004-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload
From b2e3f1508534ddc49f192437b44810a6f0a0f1b4 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 4/4] Decouple RelationGetPartitionDispatchInfo() from executor
Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code. In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as relcache references
and tuple table slots. That makes it harder to use in places other
than where it's currently being used.
After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo() and expand_inherited_rtentry() no
longer needs to do some things that it used to.
---
src/backend/catalog/partition.c | 309 +++++++++++++++++----------------
src/backend/commands/copy.c | 35 ++--
src/backend/executor/execMain.c | 146 ++++++++++++++--
src/backend/executor/nodeModifyTable.c | 29 ++--
src/backend/optimizer/prep/prepunion.c | 32 +---
src/include/catalog/partition.h | 52 +++---
src/include/executor/executor.h | 4 +-
src/include/nodes/execnodes.h | 53 +++++-
8 files changed, 399 insertions(+), 261 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index efc025ec42..36f5c80b4f 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,24 @@ typedef struct PartitionRangeBound
bool lower; /* this is the lower (vs upper) bound */
} PartitionRangeBound;
+/*-----------------------
+ * PartitionDispatchData - information of partitions of one partitioned table
+ * in a partition tree
+ *
+ * partkey Partition key of the table
+ * partdesc Partition descriptor of the table
+ * indexes Array with partdesc->nparts members (for details on what the
+ * individual value represents, see the comments in
+ * RelationGetPartitionDispatchInfo())
+ *-----------------------
+ */
+typedef struct PartitionDispatchData
+{
+ PartitionKey partkey; /* Points into the table's relcache entry */
+ PartitionDesc partdesc; /* Ditto */
+ int *indexes;
+} PartitionDispatchData;
+
static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
void *arg);
static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -981,181 +999,165 @@ get_partition_qual_relid(Oid relid)
}
/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
- do\
- {\
- int i;\
- for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
- {\
- (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
- (parents) = lappend((parents), (rel));\
- }\
- } while(0)
-
-/*
* RelationGetPartitionDispatchInfo
- * Returns information necessary to route tuples down a partition tree
+ * Returns necessary information for each partition in the partition
+ * tree rooted at rel
*
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
+ * Information returned includes the following: *ptinfos contains a list of
+ * PartitionedTableInfo objects, one for each partitioned table (with at least
+ * one member, that is, one for the root partitioned table), *leaf_part_oids
+ * contains a list of the OIDs of of all the leaf partitions.
*
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
+ * We require that the caller has locked at least the partitioned tables in the
+ * partition tree (including 'rel') using at least the AccessShareLock,
+ * because we need to look at their relcache entries to get PartitionKey and
+ * PartitionDesc.
*/
-PartitionDispatch *
+void
RelationGetPartitionDispatchInfo(Relation rel,
- int *num_parted, List **leaf_part_oids)
+ List **ptinfos, List **leaf_part_oids)
{
- PartitionDispatchData **pd;
- List *all_parts = NIL,
- *all_parents = NIL,
- *parted_rels,
- *parted_rel_parents;
+ List *all_parts,
+ *all_parents;
ListCell *lc1,
*lc2;
int i,
- k,
offset;
/*
* We rely on the relcache to traverse the partition tree to build both
- * the leaf partition OIDs list and the array of PartitionDispatch objects
- * for the partitioned tables in the tree. That means every partitioned
- * table in the tree must be locked, which is fine since we require the
- * caller to lock all the partitions anyway.
+ * the leaf partition OIDs list and the list of PartitionedTableInfo
+ * objects for partitioned tables. That means every partitioned table in
+ * the tree must be locked, which is fine since the callers must have done
+ * that already.
*
* For every partitioned table in the tree, starting with the root
* partitioned table, add its relcache entry to parted_rels, while also
* queuing its partitions (in the order in which they appear in the
* partition descriptor) to be looked at later in the same loop. This is
* a bit tricky but works because the foreach() macro doesn't fetch the
- * next list element until the bottom of the loop.
+ * next list element until the bottom of the loop. Non-partitioned tables
+ * are simply added to the leaf partitions list.
*/
- *num_parted = 1;
- parted_rels = list_make1(rel);
- /* Root partitioned table has no parent, so NULL for parent */
- parted_rel_parents = list_make1(NULL);
- APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
+ i = offset = 0;
+ *ptinfos = *leaf_part_oids = NIL;
+
+ /* Start with the root table. */
+ all_parts = list_make1_oid(RelationGetRelid(rel));
+ all_parents = list_make1_oid(InvalidOid);
forboth(lc1, all_parts, lc2, all_parents)
{
- Oid partrelid = lfirst_oid(lc1);
- Relation parent = lfirst(lc2);
+ Oid partrelid = lfirst_oid(lc1);
+ Oid parentrelid = lfirst_oid(lc2);
if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
{
- /*
- * Already locked by the caller. Note that it is the
- * responsibility of the caller to close the below relcache entry,
- * once done using the information being collected here (for
- * example, in ExecEndModifyTable).
- */
- Relation partrel = heap_open(partrelid, NoLock);
+ int j,
+ k;
+ Relation partrel;
+ PartitionKey partkey;
+ PartitionDesc partdesc;
+ PartitionedTableInfo *ptinfo;
+ PartitionDispatch pd;
+
+ if (partrelid != RelationGetRelid(rel))
+ partrel = heap_open(partrelid, NoLock);
+ else
+ partrel = rel;
- (*num_parted)++;
- parted_rels = lappend(parted_rels, partrel);
- parted_rel_parents = lappend(parted_rel_parents, parent);
- APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
- }
- }
+ partkey = RelationGetPartitionKey(partrel);
+ partdesc = RelationGetPartitionDesc(partrel);
+
+ ptinfo = (PartitionedTableInfo *)
+ palloc0(sizeof(PartitionedTableInfo));
+ ptinfo->relid = partrelid;
+ ptinfo->parentid = parentrelid;
+
+ ptinfo->pd = pd = (PartitionDispatchData *)
+ palloc0(sizeof(PartitionDispatchData));
+ pd->partkey = partkey;
- /*
- * We want to create two arrays - one for leaf partitions and another for
- * partitioned tables (including the root table and internal partitions).
- * While we only create the latter here, leaf partition array of suitable
- * objects (such as, ResultRelInfo) is created by the caller using the
- * list of OIDs we return. Indexes into these arrays get assigned in a
- * breadth-first manner, whereby partitions of any given level are placed
- * consecutively in the respective arrays.
- */
- pd = (PartitionDispatchData **) palloc(*num_parted *
- sizeof(PartitionDispatchData *));
- *leaf_part_oids = NIL;
- i = k = offset = 0;
- forboth(lc1, parted_rels, lc2, parted_rel_parents)
- {
- Relation partrel = lfirst(lc1);
- Relation parent = lfirst(lc2);
- PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- int j,
- m;
-
- pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
- pd[i]->reldesc = partrel;
- pd[i]->key = partkey;
- pd[i]->keystate = NIL;
- pd[i]->partdesc = partdesc;
- if (parent != NULL)
- {
/*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
+ * XXX- do we need a pinning mechanism for partition descriptors
+ * so that there references can be managed independently of
+ * the parent relcache entry? Like PinPartitionDesc(partdesc)?
*/
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
- }
- else
- {
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
- pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+ pd->partdesc = partdesc;
- /*
- * Indexes corresponding to the internal partitions are multiplied by
- * -1 to distinguish them from those of leaf partitions. Encountering
- * an index >= 0 means we found a leaf partition, which is immediately
- * returned as the partition we are looking for. A negative index
- * means we found a partitioned table, whose PartitionDispatch object
- * is located at the above index multiplied back by -1. Using the
- * PartitionDispatch object, search is continued further down the
- * partition tree.
- */
- m = 0;
- for (j = 0; j < partdesc->nparts; j++)
- {
- Oid partrelid = partdesc->oids[j];
+ /*
+ * The values contained in the following array correspond to
+ * indexes of this table's partitions in the global sequence of
+ * all the partitions contained in the partition tree rooted at
+ * rel, traversed in a breadh-first manner. The values should be
+ * such that we will be able to distinguish the leaf partitions
+ * from the non-leaf partitions, because they are returned to
+ * to the caller in separate structures from where they will be
+ * accessed. The way that's done is described below:
+ *
+ * Leaf partition OIDs are put into the global leaf_part_oids list,
+ * and for each one, the value stored is its ordinal position in
+ * the list minus 1.
+ *
+ * PartitionedTableInfo objects corresponding to partitions that
+ * are partitioned tables are put into the global ptinfos[] list,
+ * and for each one, the value stored is its ordinal position in
+ * the list multiplied by -1.
+ *
+ * So while looking at the values in the indexes array, if one
+ * gets zero or a positive value, then it's a leaf partition,
+ * Otherwise, it's a partitioned table.
+ */
+ pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
- if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
- {
- *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
- pd[i]->indexes[j] = k++;
- }
- else
+ k = 0;
+ for (j = 0; j < partdesc->nparts; j++)
{
+ Oid partrelid = partdesc->oids[j];
+
/*
- * offset denotes the number of partitioned tables of upper
- * levels including those of the current level. Any partition
- * of this table must belong to the next level and hence will
- * be placed after the last partitioned table of this level.
+ * Queue this partition so that it will be processed later
+ * by the outer loop.
*/
- pd[i]->indexes[j] = -(1 + offset + m);
- m++;
+ all_parts = lappend_oid(all_parts, partrelid);
+ all_parents = lappend_oid(all_parents,
+ RelationGetRelid(partrel));
+
+ if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
+ {
+ *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+ pd->indexes[j] = i++;
+ }
+ else
+ {
+ /*
+ * offset denotes the number of partitioned tables that
+ * we have already processed. k counts the number of
+ * partitions of this table that were found to be
+ * partitioned tables.
+ */
+ pd->indexes[j] = -(1 + offset + k);
+ k++;
+ }
}
- }
- i++;
- /*
- * This counts the number of partitioned tables at upper levels
- * including those of the current level.
- */
- offset += m;
+ offset += k;
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+
+ *ptinfos = lappend(*ptinfos, ptinfo);
+ }
}
- return pd;
+ Assert(i == list_length(*leaf_part_oids));
+ Assert((offset + 1) == list_length(*ptinfos));
}
/* Module-local functions */
@@ -1872,7 +1874,7 @@ generate_partition_qual(Relation rel)
* ----------------
*/
void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
@@ -1881,20 +1883,21 @@ FormPartitionKeyDatum(PartitionDispatch pd,
ListCell *partexpr_item;
int i;
- if (pd->key->partexprs != NIL && pd->keystate == NIL)
+ if (keyinfo->key->partexprs != NIL && keyinfo->keystate == NIL)
{
/* Check caller has set up context correctly */
Assert(estate != NULL &&
GetPerTupleExprContext(estate)->ecxt_scantuple == slot);
/* First time through, set up expression evaluation state */
- pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+ keyinfo->keystate = ExecPrepareExprList(keyinfo->key->partexprs,
+ estate);
}
- partexpr_item = list_head(pd->keystate);
- for (i = 0; i < pd->key->partnatts; i++)
+ partexpr_item = list_head(keyinfo->keystate);
+ for (i = 0; i < keyinfo->key->partnatts; i++)
{
- AttrNumber keycol = pd->key->partattrs[i];
+ AttrNumber keycol = keyinfo->key->partattrs[i];
Datum datum;
bool isNull;
@@ -1931,13 +1934,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
* the latter case.
*/
int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot)
{
- PartitionDispatch parent;
+ PartitionTupleRoutingInfo *parent;
Datum values[PARTITION_MAX_KEYS];
bool isnull[PARTITION_MAX_KEYS];
int cur_offset,
@@ -1948,11 +1951,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
/* start with the root partitioned table */
- parent = pd[0];
+ parent = ptrinfos[0];
while (true)
{
- PartitionKey key = parent->key;
- PartitionDesc partdesc = parent->partdesc;
+ PartitionKey key = parent->pd->partkey;
+ PartitionDesc partdesc = parent->pd->partdesc;
TupleTableSlot *myslot = parent->tupslot;
TupleConversionMap *map = parent->tupmap;
@@ -1984,7 +1987,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
* So update ecxt_scantuple accordingly.
*/
ecxt->ecxt_scantuple = slot;
- FormPartitionKeyDatum(parent, slot, estate, values, isnull);
+ FormPartitionKeyDatum(parent->keyinfo, slot, estate, values, isnull);
if (key->strategy == PARTITION_STRATEGY_RANGE)
{
@@ -2055,13 +2058,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
*failed_slot = slot;
break;
}
- else if (parent->indexes[cur_index] >= 0)
+ else if (parent->pd->indexes[cur_index] >= 0)
{
- result = parent->indexes[cur_index];
+ result = parent->pd->indexes[cur_index];
break;
}
else
- parent = pd[-parent->indexes[cur_index]];
+ parent = ptrinfos[-parent->pd->indexes[cur_index]];
}
error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index a258965c20..e17a339349 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
bool volatile_defexprs; /* is any of defexprs volatile? */
List *range_table;
- PartitionDispatch *partition_dispatch_info;
- int num_dispatch; /* Number of entries in the above array */
+ PartitionTupleRoutingInfo **ptrinfos;
+ int num_parted; /* Number of entries in the above array */
int num_partitions; /* Number of members in the following arrays */
ResultRelInfo *partitions; /* Per partition result relation */
TupleConversionMap **partition_tupconv_maps;
@@ -1425,7 +1425,7 @@ BeginCopy(ParseState *pstate,
/* Initialize state for CopyFrom tuple routing. */
if (is_from && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1434,13 +1434,13 @@ BeginCopy(ParseState *pstate,
ExecSetupPartitionTupleRouting(rel,
1,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- cstate->partition_dispatch_info = partition_dispatch_info;
- cstate->num_dispatch = num_parted;
+ cstate->ptrinfos = ptrinfos;
+ cstate->num_parted = num_parted;
cstate->partitions = partitions;
cstate->num_partitions = num_partitions;
cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2495,7 +2495,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
- cstate->partition_dispatch_info != NULL ||
+ cstate->ptrinfos != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2573,7 +2573,7 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
/* Determine the partition to heap_insert the tuple into */
- if (cstate->partition_dispatch_info)
+ if (cstate->ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -2587,7 +2587,7 @@ CopyFrom(CopyState cstate)
* partition, respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- cstate->partition_dispatch_info,
+ cstate->ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -2819,23 +2819,20 @@ CopyFrom(CopyState cstate)
ExecCloseIndices(resultRelInfo);
- /* Close all the partitioned tables, leaf partitions, and their indices */
- if (cstate->partition_dispatch_info)
+ /* Close all the leaf partitions and their indices */
+ if (cstate->ptrinfos)
{
int i;
/*
- * Remember cstate->partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is
- * the main target table of COPY that will be closed eventually by
- * DoCopy(). Also, tupslot is NULL for the root partitioned table.
+ * cstate->ptrinfo[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot.
*/
- for (i = 1; i < cstate->num_dispatch; i++)
+ for (i = 1; i < cstate->num_parted; i++)
{
- PartitionDispatch pd = cstate->partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
for (i = 0; i < cstate->num_partitions; i++)
{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 3db8b6f971..790fd8f208 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3215,8 +3215,8 @@ EvalPlanQualEnd(EPQState *epqstate)
* tuple routing for partitioned tables
*
* Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- * every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ * entry for each partitioned table in the partition tree
* 'partitions' receives an array of ResultRelInfo objects with one entry for
* every leaf partition in the partition tree
* 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3238,7 +3238,7 @@ EvalPlanQualEnd(EPQState *epqstate)
void
ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
@@ -3246,10 +3246,12 @@ ExecSetupPartitionTupleRouting(Relation rel,
{
TupleDesc tupDesc = RelationGetDescr(rel);
List *leaf_parts;
+ List *ptinfos = NIL;
ListCell *cell;
int i;
ResultRelInfo *leaf_part_rri;
List *all_parts;
+ Relation parent;
/*
* Get the information about the partition tree after locking all the
@@ -3258,7 +3260,125 @@ ExecSetupPartitionTupleRouting(Relation rel,
all_parts = find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock,
NULL, NULL);
list_free(all_parts);
- *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+
+ RelationGetPartitionDispatchInfo(rel, &ptinfos, &leaf_parts);
+
+ /*
+ * The ptinfos list contains PartitionedTableInfo objects for all the
+ * partitioned tables in the partition tree. Using the information
+ * therein, we construct an array of PartitionTupleRoutingInfo objects
+ * to be used during tuple-routing.
+ */
+ *num_parted = list_length(ptinfos);
+ *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+ sizeof(PartitionTupleRoutingInfo *));
+ /*
+ * Free the ptinfos List structure itself as we go through (open-coded
+ * list_free).
+ */
+ i = 0;
+ cell = list_head(ptinfos);
+ parent = NULL;
+ while (cell)
+ {
+ ListCell *tmp = cell;
+ PartitionedTableInfo *ptinfo = lfirst(tmp),
+ *next_ptinfo = NULL;
+ Relation partrel;
+ PartitionTupleRoutingInfo *ptrinfo;
+
+ if (lnext(tmp))
+ next_ptinfo = lfirst(lnext(tmp));
+
+ /* As mentioned above, the partitioned tables have been locked. */
+ if (ptinfo->relid != RelationGetRelid(rel))
+ partrel = heap_open(ptinfo->relid, NoLock);
+ else
+ partrel = rel;
+
+ ptrinfo = (PartitionTupleRoutingInfo *)
+ palloc0(sizeof(PartitionTupleRoutingInfo));
+ ptrinfo->relid = ptinfo->relid;
+
+ /* Stash a reference to this PartitionDispatch. */
+ ptrinfo->pd = ptinfo->pd;
+
+ /* State for extracting partition key from tuples will go here. */
+ ptrinfo->keyinfo = (PartitionKeyInfo *)
+ palloc0(sizeof(PartitionKeyInfo));
+ ptrinfo->keyinfo->key = RelationGetPartitionKey(partrel);
+ ptrinfo->keyinfo->keystate = NIL;
+
+ /*
+ * For every partitioned table other than root, we must store a tuple
+ * table slot initialized with its tuple descriptor and a tuple
+ * conversion map to convert a tuple from its parent's rowtype to its
+ * own. That is to make sure that we are looking at the correct row
+ * using the correct tuple descriptor when computing its partition key
+ * for tuple routing.
+ */
+ if (ptinfo->parentid != InvalidOid)
+ {
+ TupleDesc tupdesc = RelationGetDescr(partrel);
+
+ /* Open the parent relation descriptor if not already done. */
+ if (ptinfo->parentid == RelationGetRelid(rel))
+ {
+ parent = rel;
+ }
+ else if (parent == NULL)
+ {
+ /* Locked by RelationGetPartitionDispatchInfo(). */
+ parent = heap_open(ptinfo->parentid, NoLock);
+ }
+
+ ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+
+ /*
+ * Close the parent descriptor, if the next partitioned table in
+ * the list is not a sibling, because it will have a different
+ * parent if so.
+ */
+ if (parent != NULL && parent != rel &&
+ next_ptinfo != NULL &&
+ next_ptinfo->parentid != ptinfo->parentid)
+ {
+ heap_close(parent, NoLock);
+ parent = NULL;
+ }
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ ptrinfo->tupslot = NULL;
+ ptrinfo->tupmap = NULL;
+ }
+
+ (*ptrinfos)[i++] = ptrinfo;
+
+ /* Free the ListCell. */
+ cell = lnext(cell);
+ pfree(tmp);
+ }
+
+ /* Free the List itself. */
+ if (ptinfos)
+ pfree(ptinfos);
+
+ /* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
@@ -3284,7 +3404,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* All the partitions were locked above. Note that the relcache
* entries will be closed by ExecEndModifyTable().
*/
- partrel = heap_open(lfirst_oid(cell), NoLock);
+ partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
part_tupdesc = RelationGetDescr(partrel);
/*
@@ -3297,7 +3417,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* partition from the parent's type to the partition's.
*/
(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
- gettext_noop("could not convert row type"));
+ gettext_noop("could not convert row type"));
InitResultRelInfo(leaf_part_rri,
partrel,
@@ -3331,11 +3451,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
* by get_partition_for_tuple() unchanged.
*/
int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
- TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+ PartitionTupleRoutingInfo **ptrinfos,
+ TupleTableSlot *slot,
+ EState *estate)
{
int result;
- PartitionDispatchData *failed_at;
+ PartitionTupleRoutingInfo *failed_at;
TupleTableSlot *failed_slot;
/*
@@ -3345,7 +3467,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
if (resultRelInfo->ri_PartitionCheck)
ExecPartitionCheck(resultRelInfo, slot, estate);
- result = get_partition_for_tuple(pd, slot, estate,
+ result = get_partition_for_tuple(ptrinfos, slot, estate,
&failed_at, &failed_slot);
if (result < 0)
{
@@ -3355,9 +3477,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
char *val_desc;
ExprContext *ecxt = GetPerTupleExprContext(estate);
- failed_rel = failed_at->reldesc;
+ failed_rel = heap_open(failed_at->relid, NoLock);
ecxt->ecxt_scantuple = failed_slot;
- FormPartitionKeyDatum(failed_at, failed_slot, estate,
+ FormPartitionKeyDatum(failed_at->keyinfo, failed_slot, estate,
key_values, key_isnull);
val_desc = ExecBuildSlotPartitionKeyDescription(failed_rel,
key_values,
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 36b2b43bc6..9cf974c938 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -277,7 +277,7 @@ ExecInsert(ModifyTableState *mtstate,
resultRelInfo = estate->es_result_relation_info;
/* Determine the partition to heap_insert the tuple into */
- if (mtstate->mt_partition_dispatch_info)
+ if (mtstate->mt_ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -291,7 +291,7 @@ ExecInsert(ModifyTableState *mtstate,
* respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- mtstate->mt_partition_dispatch_info,
+ mtstate->mt_ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -1486,7 +1486,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
int numResultRelInfos;
/* Find the set of partitions so that we can find their TupleDescs. */
- if (mtstate->mt_partition_dispatch_info != NULL)
+ if (mtstate->mt_ptrinfos != NULL)
{
/*
* For INSERT via partitioned table, so we need TupleDescs based
@@ -1910,7 +1910,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
if (operation == CMD_INSERT &&
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1919,13 +1919,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
ExecSetupPartitionTupleRouting(rel,
node->nominalRelation,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- mtstate->mt_partition_dispatch_info = partition_dispatch_info;
- mtstate->mt_num_dispatch = num_parted;
+ mtstate->mt_ptrinfos = ptrinfos;
+ mtstate->mt_num_parted = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2335,19 +2335,16 @@ ExecEndModifyTable(ModifyTableState *node)
}
/*
- * Close all the partitioned tables, leaf partitions, and their indices
+ * Close all the leaf partitions and their indices.
*
- * Remember node->mt_partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is the
- * main target table of the query that will be closed by ExecEndPlan().
- * Also, tupslot is NULL for the root partitioned table.
+ * node->mt_partition_dispatch_info[0] corresponds to the root partitioned
+ * table, for which we didn't create tupslot.
*/
- for (i = 1; i < node->mt_num_dispatch; i++)
+ for (i = 1; i < node->mt_num_parted; i++)
{
- PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
for (i = 0; i < node->mt_num_partitions; i++)
{
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 734a7e55df..2d6f3900c3 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1459,48 +1459,30 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
*/
if (rte->relkind == RELKIND_PARTITIONED_TABLE)
{
- List *leaf_part_oids;
- int num_parted,
- i;
- PartitionDispatch *pds;
+ List *leaf_part_oids,
+ *ptinfos;
/* Discard the original list. */
list_free(inhOIDs);
inhOIDs = NIL;
/* Request partitioning information. */
- pds = RelationGetPartitionDispatchInfo(oldrelation, &num_parted,
- &leaf_part_oids);
+ RelationGetPartitionDispatchInfo(oldrelation, &ptinfos,
+ &leaf_part_oids);
/*
* First collect the partitioned child table OIDs, which includes the
* root parent at the head.
*/
- for (i = 0; i < num_parted; i++)
+ foreach(l, ptinfos)
{
- PartitionDispatch pd = pds[i];
+ PartitionedTableInfo *ptinfo = lfirst(l);
- inhOIDs = lappend_oid(inhOIDs, RelationGetRelid(pd->reldesc));
+ inhOIDs = lappend_oid(inhOIDs, ptinfo->relid);
}
/* Concatenate the leaf partition OIDs. */
inhOIDs = list_concat(inhOIDs, leaf_part_oids);
-
- /*
- * Release the resources that RelationGetPartitionDispatchInfo
- * acquired for us but we don't really need in this case. Note that
- * we don't touch the root partitioned table itself by starting the
- * loop with 1, not 0.
- */
- for (i = 1; i < num_parted; i++)
- {
- PartitionDispatch pd = pds[i];
-
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
- if (pd->tupmap)
- pfree(pd->tupmap);
- }
}
/* Scan the inheritance set and expand it */
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2283c675e9..7b53baf847 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -39,36 +39,23 @@ typedef struct PartitionDescData
typedef struct PartitionDescData *PartitionDesc;
-/*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
- *
- * reldesc Relation descriptor of the table
- * key Partition key information of the table
- * keystate Execution state required for expressions in the partition key
- * partdesc Partition descriptor of the table
- * tupslot A standalone TupleTableSlot initialized with this table's tuple
- * descriptor
- * tupmap TupleConversionMap to convert from the parent's rowtype to
- * this table's rowtype (when extracting the partition key of a
- * tuple just before routing it through this table)
- * indexes Array with partdesc->nparts members (for details on what
- * individual members represent, see how they are set in
- * RelationGetPartitionDispatchInfo())
- *-----------------------
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * Information about one partitioned table in a given partition tree
*/
-typedef struct PartitionDispatchData
+typedef struct PartitionedTableInfo
{
- Relation reldesc;
- PartitionKey key;
- List *keystate; /* list of ExprState */
- PartitionDesc partdesc;
- TupleTableSlot *tupslot;
- TupleConversionMap *tupmap;
- int *indexes;
-} PartitionDispatchData;
+ Oid relid;
+ Oid parentid;
-typedef struct PartitionDispatchData *PartitionDispatch;
+ /*
+ * This contains information about bounds of the partitions of this
+ * table and about where individual partitions are placed in the global
+ * partition tree.
+ */
+ PartitionDispatch pd;
+} PartitionedTableInfo;
extern void RelationBuildPartitionDesc(Relation relation);
extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
@@ -86,17 +73,18 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
+extern void RelationGetPartitionDispatchInfo(Relation rel,
+ List **ptinfos, List **leaf_part_oids);
+
/* For tuple routing */
-extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
- int *num_parted, List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot);
#endif /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 60326f9d03..6e1d3a6d2f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -208,13 +208,13 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
extern void ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
int *num_parted, int *num_partitions);
extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
- PartitionDispatch *pd,
+ PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 577499465d..07e50e0914 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,55 @@ typedef struct ResultRelInfo
Relation ri_PartitionRoot;
} ResultRelInfo;
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionKeyData *PartitionKey;
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionKeyInfoData - execution state for the partition key of a
+ * partitioned table
+ *
+ * keystate is the execution state required for expressions contained in the
+ * partition key. It is NIL until initialized by FormPartitionKeyDatum() if
+ * and when it is called; for example, during tuple routing through a given
+ * partitioned table.
+ */
+typedef struct PartitionKeyInfo
+{
+ PartitionKey key; /* Points into the table's relcache entry */
+ List *keystate;
+} PartitionKeyInfo;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ * through one partitioned table in a partition
+ * tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+ /* OID of the table */
+ Oid relid;
+
+ /* Information about the table's partitions */
+ PartitionDispatch pd;
+
+ /* See comment above the definition of PartitionKeyInfo */
+ PartitionKeyInfo *keyinfo;
+
+ /*
+ * A standalone TupleTableSlot initialized with this table's tuple
+ * descriptor
+ */
+ TupleTableSlot *tupslot;
+
+ /*
+ * TupleConversionMap to convert from the parent's rowtype to this table's
+ * rowtype (when extracting the partition key of a tuple just before
+ * routing it through this table)
+ */
+ TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
/* ----------------
* EState information
*
@@ -970,9 +1019,9 @@ typedef struct ModifyTableState
TupleTableSlot *mt_existing; /* slot to store existing target tuple in */
List *mt_excludedtlist; /* the excluded pseudo relation's tlist */
TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */
- struct PartitionDispatchData **mt_partition_dispatch_info;
/* Tuple-routing support info */
- int mt_num_dispatch; /* Number of entries in the above array */
+ struct PartitionTupleRoutingInfo **mt_ptrinfos;
+ int mt_num_parted; /* Number of entries in the above array */
int mt_num_partitions; /* Number of members in the following
* arrays */
ResultRelInfo *mt_partitions; /* Per partition result relation */
--
2.11.0
On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated patches.
Thanks Amit for the patches.
I too agree with the overall approach taken for keeping the locking
order consistent: it's best to do the locking with the existing
find_all_inheritors() since it is much cheaper than to lock them in
partition-bound order, the later being expensive since it requires
opening partitioned tables.
I haven't yet done anything about changing the timing of opening and
locking leaf partitions, because it will require some more thinking about
the required planner changes. But the above set of patches will get us
far enough to get leaf partition sub-plans appear in the partition bound
order (same order as what partition tuple-routing uses in the executor).
So, I believe none of the changes done in pg_inherits.c are essential
for expanding the inheritence tree in bound order, right ? I am not
sure whether we are planning to commit these two things together or
incrementally :
1. Expand in bound order
2. Allow for locking only the partitioned tables first.
For #1, the changes in pg_inherits.c are not required (viz, keeping
partitioned tables at the head of the list, adding inhchildparted
column, etc).
If we are going to do #2 together with #1, then in the patch set there
is no one using the capability introduced by #2. That is, there are no
callers of find_all_inheritors() that make use of the new
num_partitioned_children parameter. Also, there is no boolean
parameter for find_all_inheritors() to be used to lock only the
partitioned tables.
I feel we should think about
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
first get the review done for the other patches.
-------
I see that RelationGetPartitionDispatchInfo() now returns quite a
small subset of what it used to return, which is good. But I feel for
expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
still a bit heavy. We only require the oids, so the
PartitionedTableInfo data structure (including the pd->indexes array)
gets discarded.
Also, RelationGetPartitionDispatchInfo() has to call get_rel_relkind()
for each child. In expand_inherited_rtentry(), we anyway have to open
all the child tables, so we get the partition descriptors for each of
the children for free. So how about, in expand_inherited_rtentry(), we
traverse the partition tree using these descriptors similar to how it
is traversed in RelationGetPartitionDispatchInfo() ? May be to avoid
code duplication for traversing, we can have a common API.
Still looking at RelationGetPartitionDispatchInfo() changes ...
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 16, 2017 at 11:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
This patch series is blocking a bunch of other things, so it would be
nice if you could press forward with this quickly.Attached updated patches.
Review for 0001. The attached patch has some minor changes to the
comments and code.
+ * All the relations in the partition tree (including 'rel') must have been
+ * locked (using at least the AccessShareLock) by the caller.
It would be good if we can Assert this in the function. But I couldn't find a
way to check whether the backend holds a lock of required strength. Is there
any?
/*
- * We locked all the partitions above including the leaf partitions.
- * Note that each of the relations in *partitions are eventually
- * closed by the caller.
+ * All the partitions were locked above. Note that the relcache
+ * entries will be closed by ExecEndModifyTable().
*/
I don't see much value in this hunk, so removed it in the attached patch.
+ list_free(all_parts);
ExecSetupPartitionTupleRouting() will be called only once per DML statement.
Leaking the memory for the duration of DML may be worth the time spent
in the traversing
the list and freeing each cell independently. So removed the hunk in the
attached set.
0002 review
+
+ <row>
+ <entry><structfield>inhchildparted</structfield></entry>
+ <entry><type>bool</type></entry>
+ <entry></entry>
+ <entry>
+ This is <literal>true</> if the child table is a partitioned table,
+ <literal>false</> otherwise
+ </entry>
+ </row>
In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
be we should name the column as "inhchildpartitioned".
+#define OID_CMP(o1, o2) \
+ ((o1) < (o2) ? -1 : ((o1) > (o2) ? 1 : 0));
Instead of duplicating the logic in this macro and oid_cmp(), we may want to
call this macro in oid_cmp()? Or simply call oid_cmp() from inhchildinfo_cmp()
passing pointers to the OIDs; a pointer indirection would be costly, but still
maintainable.
+ if (num_partitioned_children)
+ *num_partitioned_children = my_num_partitioned_children;
+
Instead of this conditional, why not to make every caller pass a pointer to
integer. The callers will just ignore the value if they don't want it. If we do
this change, we can get rid of my_num_partitioned_children variable and
directly update the passed in pointer variable.
inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
- if (numoids >= maxoids)
+ is_partitioned = ((Form_pg_inherits)
+ GETSTRUCT(inheritsTuple))->inhchildparted;
Now that we are fetching two members from Form_pg_inherits structure, may be we
should use a local variable
Form_pg_inherits inherits_tuple = GETSTRUCT(inheritsTuple);
and use that to fetch its members.
I am still reviewing changes in this patch.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachments:
0001-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchtext/x-patch; charset=US-ASCII; name=0001-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchDownload
From e9ad9f947d6d553ce8ec29feb8560dff48f166b6 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 16 Aug 2017 11:36:14 +0900
Subject: [PATCH 1/3] Relieve RelationGetPartitionDispatchInfo() of doing any
locking
Anyone who wants to call RelationGetPartitionDispatchInfo() must first
acquire locks using find_all_inheritors.
Doing it this way gets rid of the possibility of a deadlock when partitions
are concurrently locked, because RelationGetPartitionDispatchInfo would lock
the partitions in one order and find_all_inheritors would in another.
Reported-by: Amit Khandekar, Robert Haas
Reports: https://postgr.es/m/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB%3DwY9ZLfe8cQOfG4bTqVGyQ%40mail.gmail.com
https://postgr.es/m/CA%2BTgmobwbh12OJerqAGyPEjb_%2B2y7T0nqRKTcjed6L4NTET6Fg%40mail.gmail.com
---
src/backend/catalog/partition.c | 52 ++++++++++++++++++---------------------
src/backend/executor/execMain.c | 7 +++---
src/include/catalog/partition.h | 3 +--
3 files changed, 29 insertions(+), 33 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index c1a307c..e5dc42d 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -998,14 +998,22 @@ get_partition_qual_relid(Oid relid)
/*
* RelationGetPartitionDispatchInfo
* Returns information necessary to route tuples down a partition tree
+ * rooted at "rel" as an array of PartitionDispatch entries.
*
- * All the partitions will be locked with lockmode, unless it is NoLock.
- * A list of the OIDs of all the leaf partitions of rel is returned in
- * *leaf_part_oids.
+ * The array contains as many entries as the number of partitioned tables in
+ * the partition tree. The number of entries is returned in "num_parted". The
+ * functions also returns a list of the OIDs of all the leaf partitions of rel
+ * in "leaf_part_oids".
+ *
+ * The function traverses the the partition tree using relcaches of partitioned
+ * tables within it. Hence all the relations in the partition tree including
+ * the root must have been locked (with at least AccessShareLock) by the caller
+ * typically using find_all_inheritors() to preserve the locking order to avoid
+ * deadlocks.
*/
PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
- int *num_parted, List **leaf_part_oids)
+RelationGetPartitionDispatchInfo(Relation rel, int *num_parted,
+ List **leaf_part_oids)
{
PartitionDispatchData **pd;
List *all_parts = NIL,
@@ -1019,14 +1027,12 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
offset;
/*
- * Lock partitions and make a list of the partitioned ones to prepare
- * their PartitionDispatch objects below.
- *
- * Cannot use find_all_inheritors() here, because then the order of OIDs
- * in parted_rels list would be unknown, which does not help, because we
- * assign indexes within individual PartitionDispatch in an order that is
- * predetermined (determined by the order of OIDs in individual partition
- * descriptors).
+ * For every partitioned table in the tree, starting with the root
+ * partitioned table, add its relcache entry to parted_rels, while also
+ * queuing its partitions (in the order in which they appear in the
+ * partition descriptor) to be looked at later in the same loop. This is
+ * a bit tricky but works because the foreach() macro doesn't fetch the
+ * next list element until the bottom of the loop.
*/
*num_parted = 1;
parted_rels = list_make1(rel);
@@ -1035,29 +1041,19 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
forboth(lc1, all_parts, lc2, all_parents)
{
- Relation partrel = heap_open(lfirst_oid(lc1), lockmode);
+ Oid partrelid = lfirst_oid(lc1);
Relation parent = lfirst(lc2);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- /*
- * If this partition is a partitioned table, add its children to the
- * end of the list, so that they are processed as well.
- */
- if (partdesc)
+ if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
{
+ /* Already locked by the caller. */
+ Relation partrel = heap_open(partrelid, NoLock);
+
(*num_parted)++;
parted_rels = lappend(parted_rels, partrel);
parted_rel_parents = lappend(parted_rel_parents, parent);
APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
}
- else
- heap_close(partrel, NoLock);
-
- /*
- * We keep the partitioned ones open until we're done using the
- * information being collected here (for example, see
- * ExecEndModifyTable).
- */
}
/*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 6671a25..91a3766 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -43,6 +43,7 @@
#include "access/xact.h"
#include "catalog/namespace.h"
#include "catalog/partition.h"
+#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_publication.h"
#include "commands/matview.h"
#include "commands/trigger.h"
@@ -3249,9 +3250,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
int i;
ResultRelInfo *leaf_part_rri;
- /* Get the tuple-routing information and lock partitions */
- *pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, num_parted,
- &leaf_parts);
+ /* Get the tuple-routing information after locking all the partitions */
+ find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
+ *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index bef7a0f..2283c67 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -88,8 +88,7 @@ extern Expr *get_partition_qual_relid(Oid relid);
/* For tuple routing */
extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
- int lockmode, int *num_parted,
- List **leaf_part_oids);
+ int *num_parted, List **leaf_part_oids);
extern void FormPartitionKeyDatum(PartitionDispatch pd,
TupleTableSlot *slot,
EState *estate,
--
1.7.9.5
Hi Amit,
Thanks for the comments.
On 2017/08/16 20:30, Amit Khandekar wrote:
On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated patches.
Thanks Amit for the patches.
I too agree with the overall approach taken for keeping the locking
order consistent: it's best to do the locking with the existing
find_all_inheritors() since it is much cheaper than to lock them in
partition-bound order, the later being expensive since it requires
opening partitioned tables.
Yeah. Per the Robert's design idea, we will always do the *locking* in
the order determined by find_all_inheritors/find_inheritance_children.
I haven't yet done anything about changing the timing of opening and
locking leaf partitions, because it will require some more thinking about
the required planner changes. But the above set of patches will get us
far enough to get leaf partition sub-plans appear in the partition bound
order (same order as what partition tuple-routing uses in the executor).So, I believe none of the changes done in pg_inherits.c are essential
for expanding the inheritence tree in bound order, right ?
Right.
The changes to pg_inherits.c are only about recognizing partitioned tables
in an inheritance hierarchy and putting them ahead in the returned list.
Now that I think of it, the title of the patch (teach pg_inherits.c about
"partitioning") sounds a bit confusing. In particular, the patch does not
teach it things like partition bound order, just that some tables in the
hierarchy could be partitioned tables.
I am not
sure whether we are planning to commit these two things together or
incrementally :
1. Expand in bound order
2. Allow for locking only the partitioned tables first.For #1, the changes in pg_inherits.c are not required (viz, keeping
partitioned tables at the head of the list, adding inhchildparted
column, etc).
Yes. Changes to pg_inherits.c can be committed completely independently
of either 1 or 2, although then there would be nobody using that capability.
About 2: I think the capability to lock only the partitioned tables in
expand_inherited_rtentry() will only be used once we have the patch to do
the necessary planner restructuring that will allow us to defer child
table locking to some place that is not expand_inherited_rtentry(). I am
working on that patch now and should be able to show something soon.
If we are going to do #2 together with #1, then in the patch set there
is no one using the capability introduced by #2. That is, there are no
callers of find_all_inheritors() that make use of the new
num_partitioned_children parameter. Also, there is no boolean
parameter for find_all_inheritors() to be used to lock only the
partitioned tables.I feel we should think about
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
first get the review done for the other patches.
I think that makes sense.
I see that RelationGetPartitionDispatchInfo() now returns quite a
small subset of what it used to return, which is good. But I feel for
expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
still a bit heavy. We only require the oids, so the
PartitionedTableInfo data structure (including the pd->indexes array)
gets discarded.
Maybe we could make the output argument optional, but I don't see much
point in being too conservative here. Building the indexes array does not
cost us that much and if a not-too-distant-in-future patch could use that
information somehow, it could do so for free.
Also, RelationGetPartitionDispatchInfo() has to call get_rel_relkind()
for each child. In expand_inherited_rtentry(), we anyway have to open
all the child tables, so we get the partition descriptors for each of
the children for free. So how about, in expand_inherited_rtentry(), we
traverse the partition tree using these descriptors similar to how it
is traversed in RelationGetPartitionDispatchInfo() ? May be to avoid
code duplication for traversing, we can have a common API.
As mentioned, one goal I'm seeking is to avoid having to open the child
tables in expand_inherited_rtentry().
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Ashutosh,
Thanks for the review and the updated patch.
On 2017/08/16 21:48, Ashutosh Bapat wrote:
On Wed, Aug 16, 2017 at 11:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:This patch series is blocking a bunch of other things, so it would be
nice if you could press forward with this quickly.Attached updated patches.
Review for 0001. The attached patch has some minor changes to the
comments and code.+ * All the relations in the partition tree (including 'rel') must have been + * locked (using at least the AccessShareLock) by the caller.It would be good if we can Assert this in the function. But I couldn't find a
way to check whether the backend holds a lock of required strength. Is there
any?
Currently there isn't. Robert suggested a RelationLockHeldByMe(Oid) [1]/messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com,
which is still a TODO on my plate.
/* - * We locked all the partitions above including the leaf partitions. - * Note that each of the relations in *partitions are eventually - * closed by the caller. + * All the partitions were locked above. Note that the relcache + * entries will be closed by ExecEndModifyTable(). */ I don't see much value in this hunk,
I thought the new text was a bit clearer, but maybe that's just me. Will
remove.
+ list_free(all_parts);
ExecSetupPartitionTupleRouting() will be called only once per DML statement.
Leaking the memory for the duration of DML may be worth the time spent
in the traversing
the list and freeing each cell independently.
Fair enough, will remove the list_free().
0002 review + + <row> + <entry><structfield>inhchildparted</structfield></entry> + <entry><type>bool</type></entry> + <entry></entry> + <entry> + This is <literal>true</> if the child table is a partitioned table, + <literal>false</> otherwise + </entry> + </row> In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May be we should name the column as "inhchildpartitioned".
Sure.
+#define OID_CMP(o1, o2) \ + ((o1) < (o2) ? -1 : ((o1) > (o2) ? 1 : 0)); Instead of duplicating the logic in this macro and oid_cmp(), we may want to call this macro in oid_cmp()? Or simply call oid_cmp() from inhchildinfo_cmp() passing pointers to the OIDs; a pointer indirection would be costly, but still maintainable.
Actually, I avoided using oid_cmp exactly for that reason.
+ if (num_partitioned_children) + *num_partitioned_children = my_num_partitioned_children; + Instead of this conditional, why not to make every caller pass a pointer to integer. The callers will just ignore the value if they don't want it. If we do this change, we can get rid of my_num_partitioned_children variable and directly update the passed in pointer variable.
There are a bunch of callers of find_all_inheritors() and
find_inheritance_children. Changes to make them all declare a pointless
variable seemed off to me. The conditional in question doesn't seem to be
that expensive. (To be fair, the one introduced in find_all_inheritors()
kind of is as implemented by the patch, because it's executed for every
iteration of the foreach(l, rels_list) loop, which I will fix.)
inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid; - if (numoids >= maxoids) + is_partitioned = ((Form_pg_inherits) + GETSTRUCT(inheritsTuple))->inhchildparted; Now that we are fetching two members from Form_pg_inherits structure, may be we should use a local variable Form_pg_inherits inherits_tuple = GETSTRUCT(inheritsTuple); and use that to fetch its members.
Sure, will do.
I am still reviewing changes in this patch.
Okay, will wait for more comments before sending the updated patches.
Thanks,
Amit
[1]: /messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com
/messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
be we should name the column as "inhchildpartitioned".Sure.
I suggest inhpartitioned or inhispartition. inhchildpartitioned seems too long.
There are a bunch of callers of find_all_inheritors() and
find_inheritance_children. Changes to make them all declare a pointless
variable seemed off to me. The conditional in question doesn't seem to be
that expensive.
+1.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/17 11:22, Robert Haas wrote:
On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
be we should name the column as "inhchildpartitioned".Sure.
I suggest inhpartitioned or inhispartition. inhchildpartitioned seems too long.
inhchildpartitioned indeed seems long.
Since we storing if the child table (one with the OID inhrelid) is
partitioned, inhpartitioned seems best to me. Will implement that.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 17, 2017 at 8:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/08/17 11:22, Robert Haas wrote:
On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
be we should name the column as "inhchildpartitioned".Sure.
I suggest inhpartitioned or inhispartition. inhchildpartitioned seems too long.
inhchildpartitioned indeed seems long.
Since we storing if the child table (one with the OID inhrelid) is
partitioned, inhpartitioned seems best to me. Will implement that.
inhchildpartitioned is long but clearly tells that the child table is
partitioned, not the parent. pg_inherit can have parents which are not
partitioned, so it's better to have self-explanatory catalog name. I
am fine with some other name as long as it's clear.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/17 13:56, Ashutosh Bapat wrote:
On Thu, Aug 17, 2017 at 8:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:On 2017/08/17 11:22, Robert Haas wrote:
On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
be we should name the column as "inhchildpartitioned".Sure.
I suggest inhpartitioned or inhispartition. inhchildpartitioned seems too long.
inhchildpartitioned indeed seems long.
Since we storing if the child table (one with the OID inhrelid) is
partitioned, inhpartitioned seems best to me. Will implement that.inhchildpartitioned is long but clearly tells that the child table is
partitioned, not the parent. pg_inherit can have parents which are not
partitioned, so it's better to have self-explanatory catalog name. I
am fine with some other name as long as it's clear.
OTOH, the pg_inherits field that stores the OID of the child table does
not mention "child" in its name (inhrelid), although you are right that
inhpartitioned can be taken to mean that the inheritance parent
(inhparent) is partitioned. In any case, system catalog documentation
which clearly states what's what might be the best guide for the confused.
Of course, we can add a comment in pg_inherits.h next to the field
explaining what it is for those reading the source code and confused about
what inhpartitioned means.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 17, 2017 at 10:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/08/17 13:56, Ashutosh Bapat wrote:
On Thu, Aug 17, 2017 at 8:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:On 2017/08/17 11:22, Robert Haas wrote:
On Wed, Aug 16, 2017 at 10:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
be we should name the column as "inhchildpartitioned".Sure.
I suggest inhpartitioned or inhispartition. inhchildpartitioned seems too long.
inhchildpartitioned indeed seems long.
Since we storing if the child table (one with the OID inhrelid) is
partitioned, inhpartitioned seems best to me. Will implement that.inhchildpartitioned is long but clearly tells that the child table is
partitioned, not the parent. pg_inherit can have parents which are not
partitioned, so it's better to have self-explanatory catalog name. I
am fine with some other name as long as it's clear.OTOH, the pg_inherits field that stores the OID of the child table does
not mention "child" in its name (inhrelid), although you are right that
inhpartitioned can be taken to mean that the inheritance parent
(inhparent) is partitioned. In any case, system catalog documentation
which clearly states what's what might be the best guide for the confused.
Sorry, I overlooked this detail. To me it means that the table is
driven by the child and inhpartitioned looks good then.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/17 10:09, Amit Langote wrote:
On 2017/08/16 20:30, Amit Khandekar wrote:
On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
I am not
sure whether we are planning to commit these two things together or
incrementally :
1. Expand in bound order
2. Allow for locking only the partitioned tables first.For #1, the changes in pg_inherits.c are not required (viz, keeping
partitioned tables at the head of the list, adding inhchildparted
column, etc).Yes. Changes to pg_inherits.c can be committed completely independently
of either 1 or 2, although then there would be nobody using that capability.About 2: I think the capability to lock only the partitioned tables in
expand_inherited_rtentry() will only be used once we have the patch to do
the necessary planner restructuring that will allow us to defer child
table locking to some place that is not expand_inherited_rtentry(). I am
working on that patch now and should be able to show something soon.If we are going to do #2 together with #1, then in the patch set there
is no one using the capability introduced by #2. That is, there are no
callers of find_all_inheritors() that make use of the new
num_partitioned_children parameter. Also, there is no boolean
parameter for find_all_inheritors() to be used to lock only the
partitioned tables.I feel we should think about
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
first get the review done for the other patches.I think that makes sense.
After thinking some more on this, I think Amit's suggestion to put this
patch at the end of the priority list is good (that is, the patch that
teaches pg_inherits infrastructure to list partitioned tables ahead in the
list.) Its purpose is mainly to fulfill the requirement that partitioned
tables be able to be locked ahead of any leaf partitions in the list (per
the design idea Robert suggested [1]/messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com). Patch which requires that
capability is not in the picture yet. Perhaps, we could review and commit
this patch only when that future patch shows up. So, I will hold that
patch for now.
Thoughts?
Attached rest of the patches. 0001 has changes per Ashutosh's review
comments [2]/messages/by-id/CAFjFpRdXn7w7nVKv4qN9fa+tzRi_mJFNCsBWy=bd0SLbYczUfA@mail.gmail.com:
0001: Relieve RelationGetPartitionDispatchInfo() of doing any locking
0002: Teach expand_inherited_rtentry to use partition bound order
0003: Decouple RelationGetPartitionDispatchInfo() from executor
Thanks,
Amit
[1]: /messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com
/messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com
[2]: /messages/by-id/CAFjFpRdXn7w7nVKv4qN9fa+tzRi_mJFNCsBWy=bd0SLbYczUfA@mail.gmail.com
/messages/by-id/CAFjFpRdXn7w7nVKv4qN9fa+tzRi_mJFNCsBWy=bd0SLbYczUfA@mail.gmail.com
Attachments:
0001-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchtext/plain; charset=UTF-8; name=0001-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchDownload
From 365409b9d7cf723a65b832804bc5002d83ae15d5 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 16 Aug 2017 11:36:14 +0900
Subject: [PATCH 1/3] Relieve RelationGetPartitionDispatchInfo() of doing any
locking
Anyone who wants to call RelationGetPartitionDispatchInfo() must first
acquire locks using find_all_inheritors.
Doing it this way gets rid of the possibility of a deadlock when partitions
are concurrently locked, because RelationGetPartitionDispatchInfo would lock
the partitions in one order and find_all_inheritors would in another.
Reported-by: Amit Khandekar, Robert Haas
Reports: https://postgr.es/m/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB%3DwY9ZLfe8cQOfG4bTqVGyQ%40mail.gmail.com
https://postgr.es/m/CA%2BTgmobwbh12OJerqAGyPEjb_%2B2y7T0nqRKTcjed6L4NTET6Fg%40mail.gmail.com
---
src/backend/catalog/partition.c | 55 ++++++++++++++++++++++-------------------
src/backend/executor/execMain.c | 10 +++++---
src/include/catalog/partition.h | 3 +--
3 files changed, 37 insertions(+), 31 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index c1a307c8d3..96a64ce6b2 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -999,12 +999,16 @@ get_partition_qual_relid(Oid relid)
* RelationGetPartitionDispatchInfo
* Returns information necessary to route tuples down a partition tree
*
- * All the partitions will be locked with lockmode, unless it is NoLock.
- * A list of the OIDs of all the leaf partitions of rel is returned in
- * *leaf_part_oids.
+ * The number of elements in the returned array (that is, the number of
+ * PartitionDispatch objects for the partitioned tables in the partition tree)
+ * is returned in *num_parted and a list of the OIDs of all the leaf
+ * partitions of rel is returned in *leaf_part_oids.
+ *
+ * All the relations in the partition tree (including 'rel') must have been
+ * locked (using at least the AccessShareLock) by the caller.
*/
PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
+RelationGetPartitionDispatchInfo(Relation rel,
int *num_parted, List **leaf_part_oids)
{
PartitionDispatchData **pd;
@@ -1019,14 +1023,18 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
offset;
/*
- * Lock partitions and make a list of the partitioned ones to prepare
- * their PartitionDispatch objects below.
+ * We rely on the relcache to traverse the partition tree to build both
+ * the leaf partition OIDs list and the array of PartitionDispatch objects
+ * for the partitioned tables in the tree. That means every partitioned
+ * table in the tree must be locked, which is fine since we require the
+ * caller to lock all the partitions anyway.
*
- * Cannot use find_all_inheritors() here, because then the order of OIDs
- * in parted_rels list would be unknown, which does not help, because we
- * assign indexes within individual PartitionDispatch in an order that is
- * predetermined (determined by the order of OIDs in individual partition
- * descriptors).
+ * For every partitioned table in the tree, starting with the root
+ * partitioned table, add its relcache entry to parted_rels, while also
+ * queuing its partitions (in the order in which they appear in the
+ * partition descriptor) to be looked at later in the same loop. This is
+ * a bit tricky but works because the foreach() macro doesn't fetch the
+ * next list element until the bottom of the loop.
*/
*num_parted = 1;
parted_rels = list_make1(rel);
@@ -1035,29 +1043,24 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
forboth(lc1, all_parts, lc2, all_parents)
{
- Relation partrel = heap_open(lfirst_oid(lc1), lockmode);
+ Oid partrelid = lfirst_oid(lc1);
Relation parent = lfirst(lc2);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- /*
- * If this partition is a partitioned table, add its children to the
- * end of the list, so that they are processed as well.
- */
- if (partdesc)
+ if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
{
+ /*
+ * Already locked by the caller. Note that it is the
+ * responsibility of the caller to close the below relcache entry,
+ * once done using the information being collected here (for
+ * example, in ExecEndModifyTable).
+ */
+ Relation partrel = heap_open(partrelid, NoLock);
+
(*num_parted)++;
parted_rels = lappend(parted_rels, partrel);
parted_rel_parents = lappend(parted_rel_parents, parent);
APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
}
- else
- heap_close(partrel, NoLock);
-
- /*
- * We keep the partitioned ones open until we're done using the
- * information being collected here (for example, see
- * ExecEndModifyTable).
- */
}
/*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 6671a25ffb..74071eede6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -43,6 +43,7 @@
#include "access/xact.h"
#include "catalog/namespace.h"
#include "catalog/partition.h"
+#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_publication.h"
#include "commands/matview.h"
#include "commands/trigger.h"
@@ -3249,9 +3250,12 @@ ExecSetupPartitionTupleRouting(Relation rel,
int i;
ResultRelInfo *leaf_part_rri;
- /* Get the tuple-routing information and lock partitions */
- *pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, num_parted,
- &leaf_parts);
+ /*
+ * Get the information about the partition tree after locking all the
+ * partitions.
+ */
+ (void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
+ *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index bef7a0f5fb..2283c675e9 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -88,8 +88,7 @@ extern Expr *get_partition_qual_relid(Oid relid);
/* For tuple routing */
extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
- int lockmode, int *num_parted,
- List **leaf_part_oids);
+ int *num_parted, List **leaf_part_oids);
extern void FormPartitionKeyDatum(PartitionDispatch pd,
TupleTableSlot *slot,
EState *estate,
--
2.11.0
0002-Teach-expand_inherited_rtentry-to-use-partition-boun.patchtext/plain; charset=UTF-8; name=0002-Teach-expand_inherited_rtentry-to-use-partition-boun.patchDownload
From 48f066bcf49cad67ce141cf490e9ae98c3f98568 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 9 Aug 2017 15:52:36 +0900
Subject: [PATCH 2/3] Teach expand_inherited_rtentry to use partition bound
order
After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.
---
src/backend/optimizer/prep/prepunion.c | 51 ++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 6d8f8938b2..e730c24ee4 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
#include "access/heapam.h"
#include "access/htup_details.h"
#include "access/sysattr.h"
+#include "catalog/partition.h"
#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -1452,6 +1453,56 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
*/
oldrelation = heap_open(parentOID, NoLock);
+ /*
+ * For partitioned tables, we arrange the child table OIDs such that they
+ * appear in the partition bound order.
+ */
+ if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+ {
+ List *leaf_part_oids;
+ int num_parted,
+ i;
+ PartitionDispatch *pds;
+
+ /* Discard the original list. */
+ list_free(inhOIDs);
+ inhOIDs = NIL;
+
+ /* Request partitioning information. */
+ pds = RelationGetPartitionDispatchInfo(oldrelation, &num_parted,
+ &leaf_part_oids);
+
+ /*
+ * First collect the partitioned child table OIDs, which includes the
+ * root parent at the head.
+ */
+ for (i = 0; i < num_parted; i++)
+ {
+ PartitionDispatch pd = pds[i];
+
+ inhOIDs = lappend_oid(inhOIDs, RelationGetRelid(pd->reldesc));
+ }
+
+ /* Concatenate the leaf partition OIDs. */
+ inhOIDs = list_concat(inhOIDs, leaf_part_oids);
+
+ /*
+ * Release the resources that RelationGetPartitionDispatchInfo
+ * acquired for us but we don't really need in this case. Note that
+ * we don't touch the root partitioned table itself by starting the
+ * loop with 1, not 0.
+ */
+ for (i = 1; i < num_parted; i++)
+ {
+ PartitionDispatch pd = pds[i];
+
+ heap_close(pd->reldesc, NoLock);
+ ExecDropSingleTupleTableSlot(pd->tupslot);
+ if (pd->tupmap)
+ pfree(pd->tupmap);
+ }
+ }
+
/* Scan the inheritance set and expand it */
appinfos = NIL;
has_child = false;
--
2.11.0
0003-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0003-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload
From 45301124931fda224ea3ab70146c5a6c4a72dbca Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 3/3] Decouple RelationGetPartitionDispatchInfo() from executor
Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code. In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as relcache references
and tuple table slots. That makes it harder to use in places other
than where it's currently being used.
After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo() and expand_inherited_rtentry() no
longer needs to do some things that it used to.
---
src/backend/catalog/partition.c | 309 +++++++++++++++++----------------
src/backend/commands/copy.c | 35 ++--
src/backend/executor/execMain.c | 145 ++++++++++++++--
src/backend/executor/nodeModifyTable.c | 29 ++--
src/backend/optimizer/prep/prepunion.c | 32 +---
src/include/catalog/partition.h | 52 +++---
src/include/executor/executor.h | 4 +-
src/include/nodes/execnodes.h | 53 +++++-
8 files changed, 398 insertions(+), 261 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 96a64ce6b2..7618e4cb31 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,24 @@ typedef struct PartitionRangeBound
bool lower; /* this is the lower (vs upper) bound */
} PartitionRangeBound;
+/*-----------------------
+ * PartitionDispatchData - information of partitions of one partitioned table
+ * in a partition tree
+ *
+ * partkey Partition key of the table
+ * partdesc Partition descriptor of the table
+ * indexes Array with partdesc->nparts members (for details on what the
+ * individual value represents, see the comments in
+ * RelationGetPartitionDispatchInfo())
+ *-----------------------
+ */
+typedef struct PartitionDispatchData
+{
+ PartitionKey partkey; /* Points into the table's relcache entry */
+ PartitionDesc partdesc; /* Ditto */
+ int *indexes;
+} PartitionDispatchData;
+
static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
void *arg);
static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -981,181 +999,165 @@ get_partition_qual_relid(Oid relid)
}
/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
- do\
- {\
- int i;\
- for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
- {\
- (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
- (parents) = lappend((parents), (rel));\
- }\
- } while(0)
-
-/*
* RelationGetPartitionDispatchInfo
- * Returns information necessary to route tuples down a partition tree
+ * Returns necessary information for each partition in the partition
+ * tree rooted at rel
*
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
+ * Information returned includes the following: *ptinfos contains a list of
+ * PartitionedTableInfo objects, one for each partitioned table (with at least
+ * one member, that is, one for the root partitioned table), *leaf_part_oids
+ * contains a list of the OIDs of of all the leaf partitions.
*
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
+ * We require that the caller has locked at least the partitioned tables in the
+ * partition tree (including 'rel') using at least the AccessShareLock,
+ * because we need to look at their relcache entries to get PartitionKey and
+ * PartitionDesc.
*/
-PartitionDispatch *
+void
RelationGetPartitionDispatchInfo(Relation rel,
- int *num_parted, List **leaf_part_oids)
+ List **ptinfos, List **leaf_part_oids)
{
- PartitionDispatchData **pd;
- List *all_parts = NIL,
- *all_parents = NIL,
- *parted_rels,
- *parted_rel_parents;
+ List *all_parts,
+ *all_parents;
ListCell *lc1,
*lc2;
int i,
- k,
offset;
/*
* We rely on the relcache to traverse the partition tree to build both
- * the leaf partition OIDs list and the array of PartitionDispatch objects
- * for the partitioned tables in the tree. That means every partitioned
- * table in the tree must be locked, which is fine since we require the
- * caller to lock all the partitions anyway.
+ * the leaf partition OIDs list and the list of PartitionedTableInfo
+ * objects for partitioned tables. That means every partitioned table in
+ * the tree must be locked, which is fine since the callers must have done
+ * that already.
*
* For every partitioned table in the tree, starting with the root
* partitioned table, add its relcache entry to parted_rels, while also
* queuing its partitions (in the order in which they appear in the
* partition descriptor) to be looked at later in the same loop. This is
* a bit tricky but works because the foreach() macro doesn't fetch the
- * next list element until the bottom of the loop.
+ * next list element until the bottom of the loop. Non-partitioned tables
+ * are simply added to the leaf partitions list.
*/
- *num_parted = 1;
- parted_rels = list_make1(rel);
- /* Root partitioned table has no parent, so NULL for parent */
- parted_rel_parents = list_make1(NULL);
- APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
+ i = offset = 0;
+ *ptinfos = *leaf_part_oids = NIL;
+
+ /* Start with the root table. */
+ all_parts = list_make1_oid(RelationGetRelid(rel));
+ all_parents = list_make1_oid(InvalidOid);
forboth(lc1, all_parts, lc2, all_parents)
{
- Oid partrelid = lfirst_oid(lc1);
- Relation parent = lfirst(lc2);
+ Oid partrelid = lfirst_oid(lc1);
+ Oid parentrelid = lfirst_oid(lc2);
if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
{
- /*
- * Already locked by the caller. Note that it is the
- * responsibility of the caller to close the below relcache entry,
- * once done using the information being collected here (for
- * example, in ExecEndModifyTable).
- */
- Relation partrel = heap_open(partrelid, NoLock);
+ int j,
+ k;
+ Relation partrel;
+ PartitionKey partkey;
+ PartitionDesc partdesc;
+ PartitionedTableInfo *ptinfo;
+ PartitionDispatch pd;
+
+ if (partrelid != RelationGetRelid(rel))
+ partrel = heap_open(partrelid, NoLock);
+ else
+ partrel = rel;
- (*num_parted)++;
- parted_rels = lappend(parted_rels, partrel);
- parted_rel_parents = lappend(parted_rel_parents, parent);
- APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
- }
- }
+ partkey = RelationGetPartitionKey(partrel);
+ partdesc = RelationGetPartitionDesc(partrel);
+
+ ptinfo = (PartitionedTableInfo *)
+ palloc0(sizeof(PartitionedTableInfo));
+ ptinfo->relid = partrelid;
+ ptinfo->parentid = parentrelid;
+
+ ptinfo->pd = pd = (PartitionDispatchData *)
+ palloc0(sizeof(PartitionDispatchData));
+ pd->partkey = partkey;
- /*
- * We want to create two arrays - one for leaf partitions and another for
- * partitioned tables (including the root table and internal partitions).
- * While we only create the latter here, leaf partition array of suitable
- * objects (such as, ResultRelInfo) is created by the caller using the
- * list of OIDs we return. Indexes into these arrays get assigned in a
- * breadth-first manner, whereby partitions of any given level are placed
- * consecutively in the respective arrays.
- */
- pd = (PartitionDispatchData **) palloc(*num_parted *
- sizeof(PartitionDispatchData *));
- *leaf_part_oids = NIL;
- i = k = offset = 0;
- forboth(lc1, parted_rels, lc2, parted_rel_parents)
- {
- Relation partrel = lfirst(lc1);
- Relation parent = lfirst(lc2);
- PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- int j,
- m;
-
- pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
- pd[i]->reldesc = partrel;
- pd[i]->key = partkey;
- pd[i]->keystate = NIL;
- pd[i]->partdesc = partdesc;
- if (parent != NULL)
- {
/*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
+ * XXX- do we need a pinning mechanism for partition descriptors
+ * so that there references can be managed independently of
+ * the parent relcache entry? Like PinPartitionDesc(partdesc)?
*/
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
- }
- else
- {
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
- pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+ pd->partdesc = partdesc;
- /*
- * Indexes corresponding to the internal partitions are multiplied by
- * -1 to distinguish them from those of leaf partitions. Encountering
- * an index >= 0 means we found a leaf partition, which is immediately
- * returned as the partition we are looking for. A negative index
- * means we found a partitioned table, whose PartitionDispatch object
- * is located at the above index multiplied back by -1. Using the
- * PartitionDispatch object, search is continued further down the
- * partition tree.
- */
- m = 0;
- for (j = 0; j < partdesc->nparts; j++)
- {
- Oid partrelid = partdesc->oids[j];
+ /*
+ * The values contained in the following array correspond to
+ * indexes of this table's partitions in the global sequence of
+ * all the partitions contained in the partition tree rooted at
+ * rel, traversed in a breadh-first manner. The values should be
+ * such that we will be able to distinguish the leaf partitions
+ * from the non-leaf partitions, because they are returned to
+ * to the caller in separate structures from where they will be
+ * accessed. The way that's done is described below:
+ *
+ * Leaf partition OIDs are put into the global leaf_part_oids list,
+ * and for each one, the value stored is its ordinal position in
+ * the list minus 1.
+ *
+ * PartitionedTableInfo objects corresponding to partitions that
+ * are partitioned tables are put into the global ptinfos[] list,
+ * and for each one, the value stored is its ordinal position in
+ * the list multiplied by -1.
+ *
+ * So while looking at the values in the indexes array, if one
+ * gets zero or a positive value, then it's a leaf partition,
+ * Otherwise, it's a partitioned table.
+ */
+ pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
- if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
- {
- *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
- pd[i]->indexes[j] = k++;
- }
- else
+ k = 0;
+ for (j = 0; j < partdesc->nparts; j++)
{
+ Oid partrelid = partdesc->oids[j];
+
/*
- * offset denotes the number of partitioned tables of upper
- * levels including those of the current level. Any partition
- * of this table must belong to the next level and hence will
- * be placed after the last partitioned table of this level.
+ * Queue this partition so that it will be processed later
+ * by the outer loop.
*/
- pd[i]->indexes[j] = -(1 + offset + m);
- m++;
+ all_parts = lappend_oid(all_parts, partrelid);
+ all_parents = lappend_oid(all_parents,
+ RelationGetRelid(partrel));
+
+ if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
+ {
+ *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+ pd->indexes[j] = i++;
+ }
+ else
+ {
+ /*
+ * offset denotes the number of partitioned tables that
+ * we have already processed. k counts the number of
+ * partitions of this table that were found to be
+ * partitioned tables.
+ */
+ pd->indexes[j] = -(1 + offset + k);
+ k++;
+ }
}
- }
- i++;
- /*
- * This counts the number of partitioned tables at upper levels
- * including those of the current level.
- */
- offset += m;
+ offset += k;
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+
+ *ptinfos = lappend(*ptinfos, ptinfo);
+ }
}
- return pd;
+ Assert(i == list_length(*leaf_part_oids));
+ Assert((offset + 1) == list_length(*ptinfos));
}
/* Module-local functions */
@@ -1872,7 +1874,7 @@ generate_partition_qual(Relation rel)
* ----------------
*/
void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
@@ -1881,20 +1883,21 @@ FormPartitionKeyDatum(PartitionDispatch pd,
ListCell *partexpr_item;
int i;
- if (pd->key->partexprs != NIL && pd->keystate == NIL)
+ if (keyinfo->key->partexprs != NIL && keyinfo->keystate == NIL)
{
/* Check caller has set up context correctly */
Assert(estate != NULL &&
GetPerTupleExprContext(estate)->ecxt_scantuple == slot);
/* First time through, set up expression evaluation state */
- pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+ keyinfo->keystate = ExecPrepareExprList(keyinfo->key->partexprs,
+ estate);
}
- partexpr_item = list_head(pd->keystate);
- for (i = 0; i < pd->key->partnatts; i++)
+ partexpr_item = list_head(keyinfo->keystate);
+ for (i = 0; i < keyinfo->key->partnatts; i++)
{
- AttrNumber keycol = pd->key->partattrs[i];
+ AttrNumber keycol = keyinfo->key->partattrs[i];
Datum datum;
bool isNull;
@@ -1931,13 +1934,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
* the latter case.
*/
int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot)
{
- PartitionDispatch parent;
+ PartitionTupleRoutingInfo *parent;
Datum values[PARTITION_MAX_KEYS];
bool isnull[PARTITION_MAX_KEYS];
int cur_offset,
@@ -1948,11 +1951,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
/* start with the root partitioned table */
- parent = pd[0];
+ parent = ptrinfos[0];
while (true)
{
- PartitionKey key = parent->key;
- PartitionDesc partdesc = parent->partdesc;
+ PartitionKey key = parent->pd->partkey;
+ PartitionDesc partdesc = parent->pd->partdesc;
TupleTableSlot *myslot = parent->tupslot;
TupleConversionMap *map = parent->tupmap;
@@ -1984,7 +1987,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
* So update ecxt_scantuple accordingly.
*/
ecxt->ecxt_scantuple = slot;
- FormPartitionKeyDatum(parent, slot, estate, values, isnull);
+ FormPartitionKeyDatum(parent->keyinfo, slot, estate, values, isnull);
if (key->strategy == PARTITION_STRATEGY_RANGE)
{
@@ -2055,13 +2058,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
*failed_slot = slot;
break;
}
- else if (parent->indexes[cur_index] >= 0)
+ else if (parent->pd->indexes[cur_index] >= 0)
{
- result = parent->indexes[cur_index];
+ result = parent->pd->indexes[cur_index];
break;
}
else
- parent = pd[-parent->indexes[cur_index]];
+ parent = ptrinfos[-parent->pd->indexes[cur_index]];
}
error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index a258965c20..e17a339349 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
bool volatile_defexprs; /* is any of defexprs volatile? */
List *range_table;
- PartitionDispatch *partition_dispatch_info;
- int num_dispatch; /* Number of entries in the above array */
+ PartitionTupleRoutingInfo **ptrinfos;
+ int num_parted; /* Number of entries in the above array */
int num_partitions; /* Number of members in the following arrays */
ResultRelInfo *partitions; /* Per partition result relation */
TupleConversionMap **partition_tupconv_maps;
@@ -1425,7 +1425,7 @@ BeginCopy(ParseState *pstate,
/* Initialize state for CopyFrom tuple routing. */
if (is_from && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1434,13 +1434,13 @@ BeginCopy(ParseState *pstate,
ExecSetupPartitionTupleRouting(rel,
1,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- cstate->partition_dispatch_info = partition_dispatch_info;
- cstate->num_dispatch = num_parted;
+ cstate->ptrinfos = ptrinfos;
+ cstate->num_parted = num_parted;
cstate->partitions = partitions;
cstate->num_partitions = num_partitions;
cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2495,7 +2495,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
- cstate->partition_dispatch_info != NULL ||
+ cstate->ptrinfos != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2573,7 +2573,7 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
/* Determine the partition to heap_insert the tuple into */
- if (cstate->partition_dispatch_info)
+ if (cstate->ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -2587,7 +2587,7 @@ CopyFrom(CopyState cstate)
* partition, respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- cstate->partition_dispatch_info,
+ cstate->ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -2819,23 +2819,20 @@ CopyFrom(CopyState cstate)
ExecCloseIndices(resultRelInfo);
- /* Close all the partitioned tables, leaf partitions, and their indices */
- if (cstate->partition_dispatch_info)
+ /* Close all the leaf partitions and their indices */
+ if (cstate->ptrinfos)
{
int i;
/*
- * Remember cstate->partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is
- * the main target table of COPY that will be closed eventually by
- * DoCopy(). Also, tupslot is NULL for the root partitioned table.
+ * cstate->ptrinfo[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot.
*/
- for (i = 1; i < cstate->num_dispatch; i++)
+ for (i = 1; i < cstate->num_parted; i++)
{
- PartitionDispatch pd = cstate->partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
for (i = 0; i < cstate->num_partitions; i++)
{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 74071eede6..15366fa4cd 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3215,8 +3215,8 @@ EvalPlanQualEnd(EPQState *epqstate)
* tuple routing for partitioned tables
*
* Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- * every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ * entry for each partitioned table in the partition tree
* 'partitions' receives an array of ResultRelInfo objects with one entry for
* every leaf partition in the partition tree
* 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3238,7 +3238,7 @@ EvalPlanQualEnd(EPQState *epqstate)
void
ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
@@ -3246,16 +3246,135 @@ ExecSetupPartitionTupleRouting(Relation rel,
{
TupleDesc tupDesc = RelationGetDescr(rel);
List *leaf_parts;
+ List *ptinfos = NIL;
ListCell *cell;
int i;
ResultRelInfo *leaf_part_rri;
+ Relation parent;
/*
* Get the information about the partition tree after locking all the
* partitions.
*/
(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
- *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+ RelationGetPartitionDispatchInfo(rel, &ptinfos, &leaf_parts);
+
+ /*
+ * The ptinfos list contains PartitionedTableInfo objects for all the
+ * partitioned tables in the partition tree. Using the information
+ * therein, we construct an array of PartitionTupleRoutingInfo objects
+ * to be used during tuple-routing.
+ */
+ *num_parted = list_length(ptinfos);
+ *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+ sizeof(PartitionTupleRoutingInfo *));
+ /*
+ * Free the ptinfos List structure itself as we go through (open-coded
+ * list_free).
+ */
+ i = 0;
+ cell = list_head(ptinfos);
+ parent = NULL;
+ while (cell)
+ {
+ ListCell *tmp = cell;
+ PartitionedTableInfo *ptinfo = lfirst(tmp),
+ *next_ptinfo = NULL;
+ Relation partrel;
+ PartitionTupleRoutingInfo *ptrinfo;
+
+ if (lnext(tmp))
+ next_ptinfo = lfirst(lnext(tmp));
+
+ /* As mentioned above, the partitioned tables have been locked. */
+ if (ptinfo->relid != RelationGetRelid(rel))
+ partrel = heap_open(ptinfo->relid, NoLock);
+ else
+ partrel = rel;
+
+ ptrinfo = (PartitionTupleRoutingInfo *)
+ palloc0(sizeof(PartitionTupleRoutingInfo));
+ ptrinfo->relid = ptinfo->relid;
+
+ /* Stash a reference to this PartitionDispatch. */
+ ptrinfo->pd = ptinfo->pd;
+
+ /* State for extracting partition key from tuples will go here. */
+ ptrinfo->keyinfo = (PartitionKeyInfo *)
+ palloc0(sizeof(PartitionKeyInfo));
+ ptrinfo->keyinfo->key = RelationGetPartitionKey(partrel);
+ ptrinfo->keyinfo->keystate = NIL;
+
+ /*
+ * For every partitioned table other than root, we must store a tuple
+ * table slot initialized with its tuple descriptor and a tuple
+ * conversion map to convert a tuple from its parent's rowtype to its
+ * own. That is to make sure that we are looking at the correct row
+ * using the correct tuple descriptor when computing its partition key
+ * for tuple routing.
+ */
+ if (ptinfo->parentid != InvalidOid)
+ {
+ TupleDesc tupdesc = RelationGetDescr(partrel);
+
+ /* Open the parent relation descriptor if not already done. */
+ if (ptinfo->parentid == RelationGetRelid(rel))
+ {
+ parent = rel;
+ }
+ else if (parent == NULL)
+ {
+ /* Locked by RelationGetPartitionDispatchInfo(). */
+ parent = heap_open(ptinfo->parentid, NoLock);
+ }
+
+ ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+
+ /*
+ * Close the parent descriptor, if the next partitioned table in
+ * the list is not a sibling, because it will have a different
+ * parent if so.
+ */
+ if (parent != NULL && parent != rel &&
+ next_ptinfo != NULL &&
+ next_ptinfo->parentid != ptinfo->parentid)
+ {
+ heap_close(parent, NoLock);
+ parent = NULL;
+ }
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ ptrinfo->tupslot = NULL;
+ ptrinfo->tupmap = NULL;
+ }
+
+ (*ptrinfos)[i++] = ptrinfo;
+
+ /* Free the ListCell. */
+ cell = lnext(cell);
+ pfree(tmp);
+ }
+
+ /* Free the List itself. */
+ if (ptinfos)
+ pfree(ptinfos);
+
+ /* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
@@ -3282,7 +3401,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* Note that each of the relations in *partitions are eventually
* closed by the caller.
*/
- partrel = heap_open(lfirst_oid(cell), NoLock);
+ partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
part_tupdesc = RelationGetDescr(partrel);
/*
@@ -3295,7 +3414,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* partition from the parent's type to the partition's.
*/
(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
- gettext_noop("could not convert row type"));
+ gettext_noop("could not convert row type"));
InitResultRelInfo(leaf_part_rri,
partrel,
@@ -3329,11 +3448,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
* by get_partition_for_tuple() unchanged.
*/
int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
- TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+ PartitionTupleRoutingInfo **ptrinfos,
+ TupleTableSlot *slot,
+ EState *estate)
{
int result;
- PartitionDispatchData *failed_at;
+ PartitionTupleRoutingInfo *failed_at;
TupleTableSlot *failed_slot;
/*
@@ -3343,7 +3464,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
if (resultRelInfo->ri_PartitionCheck)
ExecPartitionCheck(resultRelInfo, slot, estate);
- result = get_partition_for_tuple(pd, slot, estate,
+ result = get_partition_for_tuple(ptrinfos, slot, estate,
&failed_at, &failed_slot);
if (result < 0)
{
@@ -3353,9 +3474,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
char *val_desc;
ExprContext *ecxt = GetPerTupleExprContext(estate);
- failed_rel = failed_at->reldesc;
+ failed_rel = heap_open(failed_at->relid, NoLock);
ecxt->ecxt_scantuple = failed_slot;
- FormPartitionKeyDatum(failed_at, failed_slot, estate,
+ FormPartitionKeyDatum(failed_at->keyinfo, failed_slot, estate,
key_values, key_isnull);
val_desc = ExecBuildSlotPartitionKeyDescription(failed_rel,
key_values,
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 36b2b43bc6..9cf974c938 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -277,7 +277,7 @@ ExecInsert(ModifyTableState *mtstate,
resultRelInfo = estate->es_result_relation_info;
/* Determine the partition to heap_insert the tuple into */
- if (mtstate->mt_partition_dispatch_info)
+ if (mtstate->mt_ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -291,7 +291,7 @@ ExecInsert(ModifyTableState *mtstate,
* respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- mtstate->mt_partition_dispatch_info,
+ mtstate->mt_ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -1486,7 +1486,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
int numResultRelInfos;
/* Find the set of partitions so that we can find their TupleDescs. */
- if (mtstate->mt_partition_dispatch_info != NULL)
+ if (mtstate->mt_ptrinfos != NULL)
{
/*
* For INSERT via partitioned table, so we need TupleDescs based
@@ -1910,7 +1910,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
if (operation == CMD_INSERT &&
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1919,13 +1919,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
ExecSetupPartitionTupleRouting(rel,
node->nominalRelation,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- mtstate->mt_partition_dispatch_info = partition_dispatch_info;
- mtstate->mt_num_dispatch = num_parted;
+ mtstate->mt_ptrinfos = ptrinfos;
+ mtstate->mt_num_parted = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2335,19 +2335,16 @@ ExecEndModifyTable(ModifyTableState *node)
}
/*
- * Close all the partitioned tables, leaf partitions, and their indices
+ * Close all the leaf partitions and their indices.
*
- * Remember node->mt_partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is the
- * main target table of the query that will be closed by ExecEndPlan().
- * Also, tupslot is NULL for the root partitioned table.
+ * node->mt_partition_dispatch_info[0] corresponds to the root partitioned
+ * table, for which we didn't create tupslot.
*/
- for (i = 1; i < node->mt_num_dispatch; i++)
+ for (i = 1; i < node->mt_num_parted; i++)
{
- PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
for (i = 0; i < node->mt_num_partitions; i++)
{
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e730c24ee4..6abfbec236 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1459,48 +1459,30 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
*/
if (rte->relkind == RELKIND_PARTITIONED_TABLE)
{
- List *leaf_part_oids;
- int num_parted,
- i;
- PartitionDispatch *pds;
+ List *leaf_part_oids,
+ *ptinfos;
/* Discard the original list. */
list_free(inhOIDs);
inhOIDs = NIL;
/* Request partitioning information. */
- pds = RelationGetPartitionDispatchInfo(oldrelation, &num_parted,
- &leaf_part_oids);
+ RelationGetPartitionDispatchInfo(oldrelation, &ptinfos,
+ &leaf_part_oids);
/*
* First collect the partitioned child table OIDs, which includes the
* root parent at the head.
*/
- for (i = 0; i < num_parted; i++)
+ foreach(l, ptinfos)
{
- PartitionDispatch pd = pds[i];
+ PartitionedTableInfo *ptinfo = lfirst(l);
- inhOIDs = lappend_oid(inhOIDs, RelationGetRelid(pd->reldesc));
+ inhOIDs = lappend_oid(inhOIDs, ptinfo->relid);
}
/* Concatenate the leaf partition OIDs. */
inhOIDs = list_concat(inhOIDs, leaf_part_oids);
-
- /*
- * Release the resources that RelationGetPartitionDispatchInfo
- * acquired for us but we don't really need in this case. Note that
- * we don't touch the root partitioned table itself by starting the
- * loop with 1, not 0.
- */
- for (i = 1; i < num_parted; i++)
- {
- PartitionDispatch pd = pds[i];
-
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
- if (pd->tupmap)
- pfree(pd->tupmap);
- }
}
/* Scan the inheritance set and expand it */
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2283c675e9..7b53baf847 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -39,36 +39,23 @@ typedef struct PartitionDescData
typedef struct PartitionDescData *PartitionDesc;
-/*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
- *
- * reldesc Relation descriptor of the table
- * key Partition key information of the table
- * keystate Execution state required for expressions in the partition key
- * partdesc Partition descriptor of the table
- * tupslot A standalone TupleTableSlot initialized with this table's tuple
- * descriptor
- * tupmap TupleConversionMap to convert from the parent's rowtype to
- * this table's rowtype (when extracting the partition key of a
- * tuple just before routing it through this table)
- * indexes Array with partdesc->nparts members (for details on what
- * individual members represent, see how they are set in
- * RelationGetPartitionDispatchInfo())
- *-----------------------
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * Information about one partitioned table in a given partition tree
*/
-typedef struct PartitionDispatchData
+typedef struct PartitionedTableInfo
{
- Relation reldesc;
- PartitionKey key;
- List *keystate; /* list of ExprState */
- PartitionDesc partdesc;
- TupleTableSlot *tupslot;
- TupleConversionMap *tupmap;
- int *indexes;
-} PartitionDispatchData;
+ Oid relid;
+ Oid parentid;
-typedef struct PartitionDispatchData *PartitionDispatch;
+ /*
+ * This contains information about bounds of the partitions of this
+ * table and about where individual partitions are placed in the global
+ * partition tree.
+ */
+ PartitionDispatch pd;
+} PartitionedTableInfo;
extern void RelationBuildPartitionDesc(Relation relation);
extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
@@ -86,17 +73,18 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
+extern void RelationGetPartitionDispatchInfo(Relation rel,
+ List **ptinfos, List **leaf_part_oids);
+
/* For tuple routing */
-extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
- int *num_parted, List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot);
#endif /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 60326f9d03..6e1d3a6d2f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -208,13 +208,13 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
extern void ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
int *num_parted, int *num_partitions);
extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
- PartitionDispatch *pd,
+ PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 577499465d..07e50e0914 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,55 @@ typedef struct ResultRelInfo
Relation ri_PartitionRoot;
} ResultRelInfo;
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionKeyData *PartitionKey;
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionKeyInfoData - execution state for the partition key of a
+ * partitioned table
+ *
+ * keystate is the execution state required for expressions contained in the
+ * partition key. It is NIL until initialized by FormPartitionKeyDatum() if
+ * and when it is called; for example, during tuple routing through a given
+ * partitioned table.
+ */
+typedef struct PartitionKeyInfo
+{
+ PartitionKey key; /* Points into the table's relcache entry */
+ List *keystate;
+} PartitionKeyInfo;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ * through one partitioned table in a partition
+ * tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+ /* OID of the table */
+ Oid relid;
+
+ /* Information about the table's partitions */
+ PartitionDispatch pd;
+
+ /* See comment above the definition of PartitionKeyInfo */
+ PartitionKeyInfo *keyinfo;
+
+ /*
+ * A standalone TupleTableSlot initialized with this table's tuple
+ * descriptor
+ */
+ TupleTableSlot *tupslot;
+
+ /*
+ * TupleConversionMap to convert from the parent's rowtype to this table's
+ * rowtype (when extracting the partition key of a tuple just before
+ * routing it through this table)
+ */
+ TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
/* ----------------
* EState information
*
@@ -970,9 +1019,9 @@ typedef struct ModifyTableState
TupleTableSlot *mt_existing; /* slot to store existing target tuple in */
List *mt_excludedtlist; /* the excluded pseudo relation's tlist */
TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */
- struct PartitionDispatchData **mt_partition_dispatch_info;
/* Tuple-routing support info */
- int mt_num_dispatch; /* Number of entries in the above array */
+ struct PartitionTupleRoutingInfo **mt_ptrinfos;
+ int mt_num_parted; /* Number of entries in the above array */
int mt_num_partitions; /* Number of members in the following
* arrays */
ResultRelInfo *mt_partitions; /* Per partition result relation */
--
2.11.0
On 17 August 2017 at 06:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi Amit,
Thanks for the comments.
On 2017/08/16 20:30, Amit Khandekar wrote:
On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated patches.
Thanks Amit for the patches.
I too agree with the overall approach taken for keeping the locking
order consistent: it's best to do the locking with the existing
find_all_inheritors() since it is much cheaper than to lock them in
partition-bound order, the later being expensive since it requires
opening partitioned tables.Yeah. Per the Robert's design idea, we will always do the *locking* in
the order determined by find_all_inheritors/find_inheritance_children.I haven't yet done anything about changing the timing of opening and
locking leaf partitions, because it will require some more thinking about
the required planner changes. But the above set of patches will get us
far enough to get leaf partition sub-plans appear in the partition bound
order (same order as what partition tuple-routing uses in the executor).So, I believe none of the changes done in pg_inherits.c are essential
for expanding the inheritence tree in bound order, right ?Right.
The changes to pg_inherits.c are only about recognizing partitioned tables
in an inheritance hierarchy and putting them ahead in the returned list.
Now that I think of it, the title of the patch (teach pg_inherits.c about
"partitioning") sounds a bit confusing. In particular, the patch does not
teach it things like partition bound order, just that some tables in the
hierarchy could be partitioned tables.I am not
sure whether we are planning to commit these two things together or
incrementally :
1. Expand in bound order
2. Allow for locking only the partitioned tables first.For #1, the changes in pg_inherits.c are not required (viz, keeping
partitioned tables at the head of the list, adding inhchildparted
column, etc).Yes. Changes to pg_inherits.c can be committed completely independently
of either 1 or 2, although then there would be nobody using that capability.About 2: I think the capability to lock only the partitioned tables in
expand_inherited_rtentry() will only be used once we have the patch to do
the necessary planner restructuring that will allow us to defer child
table locking to some place that is not expand_inherited_rtentry(). I am
working on that patch now and should be able to show something soon.If we are going to do #2 together with #1, then in the patch set there
is no one using the capability introduced by #2. That is, there are no
callers of find_all_inheritors() that make use of the new
num_partitioned_children parameter. Also, there is no boolean
parameter for find_all_inheritors() to be used to lock only the
partitioned tables.I feel we should think about
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
first get the review done for the other patches.I think that makes sense.
I see that RelationGetPartitionDispatchInfo() now returns quite a
small subset of what it used to return, which is good. But I feel for
expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
still a bit heavy. We only require the oids, so the
PartitionedTableInfo data structure (including the pd->indexes array)
gets discarded.Maybe we could make the output argument optional, but I don't see much
point in being too conservative here. Building the indexes array does not
cost us that much and if a not-too-distant-in-future patch could use that
information somehow, it could do so for free.
Ok, so these changes are mostly kept keeping in mind some future
use-cases. Otherwise, I was thinking we could just keep a light-weight
function to generate the oids, and keep the current
RelationGetPartitionDispatchInfo() intact.
Anyways, some more comments :
In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an
array of pointers. Why can't it be an array of
PartitionTupleRoutingInfo structure rather than pointer to that
structure ?
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
+ * Close all the leaf partitions and their indices.
*
Above comment needs to be shifted a bit down to the subsequent "for"
loop where it's actually applicable.
* node->mt_partition_dispatch_info[0] corresponds to the root partitioned
* table, for which we didn't create tupslot.
Above : node->mt_partition_dispatch_info[0] => node->mt_ptrinfos[0]
/*
* XXX- do we need a pinning mechanism for partition descriptors
* so that there references can be managed independently of
* the parent relcache entry? Like PinPartitionDesc(partdesc)?
*/
pd->partdesc = partdesc;
Any idea if the above can be handled ? I am not too sure.
Thanks,
Amit
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 17, 2017 at 12:59 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached rest of the patches. 0001 has changes per Ashutosh's review
comments [2]:0001: Relieve RelationGetPartitionDispatchInfo() of doing any locking
[2]: /messages/by-id/CAFjFpRdXn7w7nVKv4qN9fa+tzRi_mJFNCsBWy=bd0SLbYczUfA@mail.gmail.com
didn't describe those changes in my mail, since they rearranged the
comments. Those changes are not part of this patch and you haven't
comments about those changes as well. If you have intentionally
excluded those changes, it's fine. In case, you haven't reviewed them,
please see if they are good to be incorporated.
0002: Teach expand_inherited_rtentry to use partition bound order
0004 in [1]/messages/by-id/CAFjFpRfkr7igCGBBWH1PQ__W-XPy1O79Y-qxCmJc6FizLqFz7Q@mail.gmail.com expands a multi-level partition hierarchy into similar
inheritance hierarchy. That patch doesn't need all OIDs in one go. It
will have to handle the partition hierarchy level by level, so most of
the code added by this patch will need to be changed by that patch. Is
there a way we can somehow change this patch so that the delta in 0004
is reduced? That may need rethinking about using
RelationGetPartitionDispatchInfo().
[1]: /messages/by-id/CAFjFpRfkr7igCGBBWH1PQ__W-XPy1O79Y-qxCmJc6FizLqFz7Q@mail.gmail.com
[2]: /messages/by-id/CAFjFpRdXn7w7nVKv4qN9fa+tzRi_mJFNCsBWy=bd0SLbYczUfA@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
[2] had a patch with some changes to the original patch you posted. I
didn't describe those changes in my mail, since they rearranged the
comments. Those changes are not part of this patch and you haven't
comments about those changes as well. If you have intentionally
excluded those changes, it's fine. In case, you haven't reviewed them,
please see if they are good to be incorporated.
I took a quick look at your version but I think I like Amit's fine the
way it is, so committed that and back-patched it to v10.
I find 0002 pretty ugly as things stand. We get a bunch of tuple maps
that we don't really need, only to turn around and free them. We get
a bunch of tuple slots that we don't need, only to turn around and
drop them. We don't really need the PartitionDispatch objects either,
except for the OIDs they contain. There's a lot of extra stuff being
computed here that is really irrelevant for this purpose. I think we
should try to clean that up somehow.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Ashutosh,
On 2017/08/17 21:39, Ashutosh Bapat wrote:
On Thu, Aug 17, 2017 at 12:59 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:Attached rest of the patches. 0001 has changes per Ashutosh's review
comments [2]:0001: Relieve RelationGetPartitionDispatchInfo() of doing any locking
[2] had a patch with some changes to the original patch you posted. I
didn't describe those changes in my mail, since they rearranged the
comments. Those changes are not part of this patch and you haven't
comments about those changes as well. If you have intentionally
excluded those changes, it's fine. In case, you haven't reviewed them,
please see if they are good to be incorporated.
Sorry, I thought the ones you mentioned in the email were the only changes
you made to the original patch. I noted only those and included them when
editing the relevant commit in my local repository in an interactive
rebase session. I didn't actually take your patch and try to merge it
with the commit in my local repository. IMHO, simply commenting in the
email which parts of the patch you would like to see changed would be
helpful. Then we can discuss those changes and proceed with them (or not)
per the result of that discussion.
0002: Teach expand_inherited_rtentry to use partition bound order
0004 in [1] expands a multi-level partition hierarchy into similar
inheritance hierarchy. That patch doesn't need all OIDs in one go. It
will have to handle the partition hierarchy level by level, so most of
the code added by this patch will need to be changed by that patch. Is
there a way we can somehow change this patch so that the delta in 0004
is reduced? That may need rethinking about using
RelationGetPartitionDispatchInfo().
Regarding that, I have a question:
Does the multi-level partition-wise join planning logic depend on the
inheritance itself to be expanded in a multi-level aware manner. That is,
expanding the partitioned table inheritance in multi-level aware manner in
expan_inherited_rtentry()?
Wouldn't it suffice to just have the resulting Append paths be nested per
multi-level partitioning hierarchy? Creating such nested Append paths
doesn't necessarily require that the inheritance be expanded that way in
the first place (as I am finding out when working on another patch).
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 18, 2017 at 10:12 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote
0002: Teach expand_inherited_rtentry to use partition bound order
0004 in [1] expands a multi-level partition hierarchy into similar
inheritance hierarchy. That patch doesn't need all OIDs in one go. It
will have to handle the partition hierarchy level by level, so most of
the code added by this patch will need to be changed by that patch. Is
there a way we can somehow change this patch so that the delta in 0004
is reduced? That may need rethinking about using
RelationGetPartitionDispatchInfo().Regarding that, I have a question:
Does the multi-level partition-wise join planning logic depend on the
inheritance itself to be expanded in a multi-level aware manner. That is,
expanding the partitioned table inheritance in multi-level aware manner in
expan_inherited_rtentry()?
Yes, it needs AppendRelInfos to retain the parent-child relationship.
Please refer [1]/messages/by-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV=kB8Akkqg@mail.gmail.com, [2]/messages/by-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT=uQ@mail.gmail.com, [3]/messages/by-id/CAFjFpRd6Kzx6Xn=7vdwwnh6rEw2VEgo--iPdhV+Fb7bHfPzsbw@mail.gmail.com for details.
Wouldn't it suffice to just have the resulting Append paths be nested per
multi-level partitioning hierarchy?
We are joining RelOptInfos, so those need to be nested. For those to
be nested, we need AppendRelInfos to preserve parent-child
relationship. Nesting paths doesn't help. Append paths actually should
be flattened out to avoid any extra time consumed in nested Append
node.
[1]: /messages/by-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV=kB8Akkqg@mail.gmail.com
[2]: /messages/by-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT=uQ@mail.gmail.com
[3]: /messages/by-id/CAFjFpRd6Kzx6Xn=7vdwwnh6rEw2VEgo--iPdhV+Fb7bHfPzsbw@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 18 August 2017 at 01:24, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:[2] had a patch with some changes to the original patch you posted. I
didn't describe those changes in my mail, since they rearranged the
comments. Those changes are not part of this patch and you haven't
comments about those changes as well. If you have intentionally
excluded those changes, it's fine. In case, you haven't reviewed them,
please see if they are good to be incorporated.I took a quick look at your version but I think I like Amit's fine the
way it is, so committed that and back-patched it to v10.I find 0002 pretty ugly as things stand. We get a bunch of tuple maps
that we don't really need, only to turn around and free them. We get
a bunch of tuple slots that we don't need, only to turn around and
drop them.
I think in the final changes after applying all 3 patches, the
redundant tuple slot is no longer present. But ...
We don't really need the PartitionDispatch objects either,
except for the OIDs they contain. There's a lot of extra stuff being
computed here that is really irrelevant for this purpose. I think we
should try to clean that up somehow.
... I am of the same opinion. That's why - as I mentioned upthread - I
was thinking why not have a separate light-weight function to just
generate oids, and keep RelationGetPartitionDispatchInfo() intact, to
be used only for tuple routing.
But, I haven't yet checked Ashuthosh's requirements, which suggest
that it does not help to even get the oid list.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 18, 2017 at 10:32 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
I think in the final changes after applying all 3 patches, the
redundant tuple slot is no longer present. But ...We don't really need the PartitionDispatch objects either,
except for the OIDs they contain. There's a lot of extra stuff being
computed here that is really irrelevant for this purpose. I think we
should try to clean that up somehow.... I am of the same opinion. That's why - as I mentioned upthread - I
was thinking why not have a separate light-weight function to just
generate oids, and keep RelationGetPartitionDispatchInfo() intact, to
be used only for tuple routing.But, I haven't yet checked Ashuthosh's requirements, which suggest
that it does not help to even get the oid list.
0004 patch in partition-wise join patchset has code to expand
partition hierarchy. That patch is expanding inheritance hierarchy in
depth first manner. Robert commented that instead of depth first
manner, it will be better if we expand it in partitioned tables first
manner. With the latest changes in your patch-set I don't see the
reason for expanding in partitioned tables first order. Can you please
elaborate if we still need to expand in partitioned table first
manner? May be we should just address the expansion issue in 0004
instead of dividing it in two patches.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/18 4:54, Robert Haas wrote:
On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:[2] had a patch with some changes to the original patch you posted. I
didn't describe those changes in my mail, since they rearranged the
comments. Those changes are not part of this patch and you haven't
comments about those changes as well. If you have intentionally
excluded those changes, it's fine. In case, you haven't reviewed them,
please see if they are good to be incorporated.I took a quick look at your version but I think I like Amit's fine the
way it is, so committed that and back-patched it to v10.
Thanks for committing.
I find 0002 pretty ugly as things stand. We get a bunch of tuple maps
that we don't really need, only to turn around and free them. We get
a bunch of tuple slots that we don't need, only to turn around and
drop them. We don't really need the PartitionDispatch objects either,
except for the OIDs they contain. There's a lot of extra stuff being
computed here that is really irrelevant for this purpose. I think we
should try to clean that up somehow.
One way to do that might be to reverse the order of the remaining patches
and put the patch to refactor RelationGetPartitionDispatchInfo() first.
With that refactoring, PartitionDispatch itself has become much simpler in
that it does not contain a relcache reference to be closed eventually by
the caller, the tuple map, and the tuple table slot. Since those things
are required for tuple-routing, the refactoring makes
ExecSetupPartitionTupleRouting itself create them from the (minimal)
information returned by RelationGetPartitionDispatchInfo and ultimately
destroy when done using it. I kept the indexes field in
PartitionDispatchData though, because it's essentially free to create
while we are walking the partition tree in
RelationGetPartitionDispatchInfo() and it seems undesirable to make the
caller compute that information (indexes) by traversing the partition tree
all over again, if it doesn't otherwise have to. I am still considering
some counter-arguments raised by Amit Khandekar about this last assertion.
Thoughts?
Thanks,
Amit
Attachments:
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload
From b86fd0e920e4bc49b17a2dba9e848420ec99c22b Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor
Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code. In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as relcache references
and tuple table slots. That makes it harder to use in places other
than where it's currently being used.
After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo() and expand_inherited_rtentry() no
longer needs to do some things that it used to.
---
src/backend/catalog/partition.c | 309 +++++++++++++++++----------------
src/backend/commands/copy.c | 35 ++--
src/backend/executor/execMain.c | 145 ++++++++++++++--
src/backend/executor/nodeModifyTable.c | 29 ++--
src/include/catalog/partition.h | 52 +++---
src/include/executor/executor.h | 4 +-
src/include/nodes/execnodes.h | 53 +++++-
7 files changed, 391 insertions(+), 236 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 96a64ce6b2..7618e4cb31 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,24 @@ typedef struct PartitionRangeBound
bool lower; /* this is the lower (vs upper) bound */
} PartitionRangeBound;
+/*-----------------------
+ * PartitionDispatchData - information of partitions of one partitioned table
+ * in a partition tree
+ *
+ * partkey Partition key of the table
+ * partdesc Partition descriptor of the table
+ * indexes Array with partdesc->nparts members (for details on what the
+ * individual value represents, see the comments in
+ * RelationGetPartitionDispatchInfo())
+ *-----------------------
+ */
+typedef struct PartitionDispatchData
+{
+ PartitionKey partkey; /* Points into the table's relcache entry */
+ PartitionDesc partdesc; /* Ditto */
+ int *indexes;
+} PartitionDispatchData;
+
static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
void *arg);
static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -981,181 +999,165 @@ get_partition_qual_relid(Oid relid)
}
/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
- do\
- {\
- int i;\
- for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
- {\
- (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
- (parents) = lappend((parents), (rel));\
- }\
- } while(0)
-
-/*
* RelationGetPartitionDispatchInfo
- * Returns information necessary to route tuples down a partition tree
+ * Returns necessary information for each partition in the partition
+ * tree rooted at rel
*
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
+ * Information returned includes the following: *ptinfos contains a list of
+ * PartitionedTableInfo objects, one for each partitioned table (with at least
+ * one member, that is, one for the root partitioned table), *leaf_part_oids
+ * contains a list of the OIDs of of all the leaf partitions.
*
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
+ * We require that the caller has locked at least the partitioned tables in the
+ * partition tree (including 'rel') using at least the AccessShareLock,
+ * because we need to look at their relcache entries to get PartitionKey and
+ * PartitionDesc.
*/
-PartitionDispatch *
+void
RelationGetPartitionDispatchInfo(Relation rel,
- int *num_parted, List **leaf_part_oids)
+ List **ptinfos, List **leaf_part_oids)
{
- PartitionDispatchData **pd;
- List *all_parts = NIL,
- *all_parents = NIL,
- *parted_rels,
- *parted_rel_parents;
+ List *all_parts,
+ *all_parents;
ListCell *lc1,
*lc2;
int i,
- k,
offset;
/*
* We rely on the relcache to traverse the partition tree to build both
- * the leaf partition OIDs list and the array of PartitionDispatch objects
- * for the partitioned tables in the tree. That means every partitioned
- * table in the tree must be locked, which is fine since we require the
- * caller to lock all the partitions anyway.
+ * the leaf partition OIDs list and the list of PartitionedTableInfo
+ * objects for partitioned tables. That means every partitioned table in
+ * the tree must be locked, which is fine since the callers must have done
+ * that already.
*
* For every partitioned table in the tree, starting with the root
* partitioned table, add its relcache entry to parted_rels, while also
* queuing its partitions (in the order in which they appear in the
* partition descriptor) to be looked at later in the same loop. This is
* a bit tricky but works because the foreach() macro doesn't fetch the
- * next list element until the bottom of the loop.
+ * next list element until the bottom of the loop. Non-partitioned tables
+ * are simply added to the leaf partitions list.
*/
- *num_parted = 1;
- parted_rels = list_make1(rel);
- /* Root partitioned table has no parent, so NULL for parent */
- parted_rel_parents = list_make1(NULL);
- APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
+ i = offset = 0;
+ *ptinfos = *leaf_part_oids = NIL;
+
+ /* Start with the root table. */
+ all_parts = list_make1_oid(RelationGetRelid(rel));
+ all_parents = list_make1_oid(InvalidOid);
forboth(lc1, all_parts, lc2, all_parents)
{
- Oid partrelid = lfirst_oid(lc1);
- Relation parent = lfirst(lc2);
+ Oid partrelid = lfirst_oid(lc1);
+ Oid parentrelid = lfirst_oid(lc2);
if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
{
- /*
- * Already locked by the caller. Note that it is the
- * responsibility of the caller to close the below relcache entry,
- * once done using the information being collected here (for
- * example, in ExecEndModifyTable).
- */
- Relation partrel = heap_open(partrelid, NoLock);
+ int j,
+ k;
+ Relation partrel;
+ PartitionKey partkey;
+ PartitionDesc partdesc;
+ PartitionedTableInfo *ptinfo;
+ PartitionDispatch pd;
+
+ if (partrelid != RelationGetRelid(rel))
+ partrel = heap_open(partrelid, NoLock);
+ else
+ partrel = rel;
- (*num_parted)++;
- parted_rels = lappend(parted_rels, partrel);
- parted_rel_parents = lappend(parted_rel_parents, parent);
- APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
- }
- }
+ partkey = RelationGetPartitionKey(partrel);
+ partdesc = RelationGetPartitionDesc(partrel);
+
+ ptinfo = (PartitionedTableInfo *)
+ palloc0(sizeof(PartitionedTableInfo));
+ ptinfo->relid = partrelid;
+ ptinfo->parentid = parentrelid;
+
+ ptinfo->pd = pd = (PartitionDispatchData *)
+ palloc0(sizeof(PartitionDispatchData));
+ pd->partkey = partkey;
- /*
- * We want to create two arrays - one for leaf partitions and another for
- * partitioned tables (including the root table and internal partitions).
- * While we only create the latter here, leaf partition array of suitable
- * objects (such as, ResultRelInfo) is created by the caller using the
- * list of OIDs we return. Indexes into these arrays get assigned in a
- * breadth-first manner, whereby partitions of any given level are placed
- * consecutively in the respective arrays.
- */
- pd = (PartitionDispatchData **) palloc(*num_parted *
- sizeof(PartitionDispatchData *));
- *leaf_part_oids = NIL;
- i = k = offset = 0;
- forboth(lc1, parted_rels, lc2, parted_rel_parents)
- {
- Relation partrel = lfirst(lc1);
- Relation parent = lfirst(lc2);
- PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- int j,
- m;
-
- pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
- pd[i]->reldesc = partrel;
- pd[i]->key = partkey;
- pd[i]->keystate = NIL;
- pd[i]->partdesc = partdesc;
- if (parent != NULL)
- {
/*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
+ * XXX- do we need a pinning mechanism for partition descriptors
+ * so that there references can be managed independently of
+ * the parent relcache entry? Like PinPartitionDesc(partdesc)?
*/
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
- }
- else
- {
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
- pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+ pd->partdesc = partdesc;
- /*
- * Indexes corresponding to the internal partitions are multiplied by
- * -1 to distinguish them from those of leaf partitions. Encountering
- * an index >= 0 means we found a leaf partition, which is immediately
- * returned as the partition we are looking for. A negative index
- * means we found a partitioned table, whose PartitionDispatch object
- * is located at the above index multiplied back by -1. Using the
- * PartitionDispatch object, search is continued further down the
- * partition tree.
- */
- m = 0;
- for (j = 0; j < partdesc->nparts; j++)
- {
- Oid partrelid = partdesc->oids[j];
+ /*
+ * The values contained in the following array correspond to
+ * indexes of this table's partitions in the global sequence of
+ * all the partitions contained in the partition tree rooted at
+ * rel, traversed in a breadh-first manner. The values should be
+ * such that we will be able to distinguish the leaf partitions
+ * from the non-leaf partitions, because they are returned to
+ * to the caller in separate structures from where they will be
+ * accessed. The way that's done is described below:
+ *
+ * Leaf partition OIDs are put into the global leaf_part_oids list,
+ * and for each one, the value stored is its ordinal position in
+ * the list minus 1.
+ *
+ * PartitionedTableInfo objects corresponding to partitions that
+ * are partitioned tables are put into the global ptinfos[] list,
+ * and for each one, the value stored is its ordinal position in
+ * the list multiplied by -1.
+ *
+ * So while looking at the values in the indexes array, if one
+ * gets zero or a positive value, then it's a leaf partition,
+ * Otherwise, it's a partitioned table.
+ */
+ pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
- if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
- {
- *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
- pd[i]->indexes[j] = k++;
- }
- else
+ k = 0;
+ for (j = 0; j < partdesc->nparts; j++)
{
+ Oid partrelid = partdesc->oids[j];
+
/*
- * offset denotes the number of partitioned tables of upper
- * levels including those of the current level. Any partition
- * of this table must belong to the next level and hence will
- * be placed after the last partitioned table of this level.
+ * Queue this partition so that it will be processed later
+ * by the outer loop.
*/
- pd[i]->indexes[j] = -(1 + offset + m);
- m++;
+ all_parts = lappend_oid(all_parts, partrelid);
+ all_parents = lappend_oid(all_parents,
+ RelationGetRelid(partrel));
+
+ if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
+ {
+ *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+ pd->indexes[j] = i++;
+ }
+ else
+ {
+ /*
+ * offset denotes the number of partitioned tables that
+ * we have already processed. k counts the number of
+ * partitions of this table that were found to be
+ * partitioned tables.
+ */
+ pd->indexes[j] = -(1 + offset + k);
+ k++;
+ }
}
- }
- i++;
- /*
- * This counts the number of partitioned tables at upper levels
- * including those of the current level.
- */
- offset += m;
+ offset += k;
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+
+ *ptinfos = lappend(*ptinfos, ptinfo);
+ }
}
- return pd;
+ Assert(i == list_length(*leaf_part_oids));
+ Assert((offset + 1) == list_length(*ptinfos));
}
/* Module-local functions */
@@ -1872,7 +1874,7 @@ generate_partition_qual(Relation rel)
* ----------------
*/
void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
@@ -1881,20 +1883,21 @@ FormPartitionKeyDatum(PartitionDispatch pd,
ListCell *partexpr_item;
int i;
- if (pd->key->partexprs != NIL && pd->keystate == NIL)
+ if (keyinfo->key->partexprs != NIL && keyinfo->keystate == NIL)
{
/* Check caller has set up context correctly */
Assert(estate != NULL &&
GetPerTupleExprContext(estate)->ecxt_scantuple == slot);
/* First time through, set up expression evaluation state */
- pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+ keyinfo->keystate = ExecPrepareExprList(keyinfo->key->partexprs,
+ estate);
}
- partexpr_item = list_head(pd->keystate);
- for (i = 0; i < pd->key->partnatts; i++)
+ partexpr_item = list_head(keyinfo->keystate);
+ for (i = 0; i < keyinfo->key->partnatts; i++)
{
- AttrNumber keycol = pd->key->partattrs[i];
+ AttrNumber keycol = keyinfo->key->partattrs[i];
Datum datum;
bool isNull;
@@ -1931,13 +1934,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
* the latter case.
*/
int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot)
{
- PartitionDispatch parent;
+ PartitionTupleRoutingInfo *parent;
Datum values[PARTITION_MAX_KEYS];
bool isnull[PARTITION_MAX_KEYS];
int cur_offset,
@@ -1948,11 +1951,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
/* start with the root partitioned table */
- parent = pd[0];
+ parent = ptrinfos[0];
while (true)
{
- PartitionKey key = parent->key;
- PartitionDesc partdesc = parent->partdesc;
+ PartitionKey key = parent->pd->partkey;
+ PartitionDesc partdesc = parent->pd->partdesc;
TupleTableSlot *myslot = parent->tupslot;
TupleConversionMap *map = parent->tupmap;
@@ -1984,7 +1987,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
* So update ecxt_scantuple accordingly.
*/
ecxt->ecxt_scantuple = slot;
- FormPartitionKeyDatum(parent, slot, estate, values, isnull);
+ FormPartitionKeyDatum(parent->keyinfo, slot, estate, values, isnull);
if (key->strategy == PARTITION_STRATEGY_RANGE)
{
@@ -2055,13 +2058,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
*failed_slot = slot;
break;
}
- else if (parent->indexes[cur_index] >= 0)
+ else if (parent->pd->indexes[cur_index] >= 0)
{
- result = parent->indexes[cur_index];
+ result = parent->pd->indexes[cur_index];
break;
}
else
- parent = pd[-parent->indexes[cur_index]];
+ parent = ptrinfos[-parent->pd->indexes[cur_index]];
}
error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index a258965c20..e17a339349 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
bool volatile_defexprs; /* is any of defexprs volatile? */
List *range_table;
- PartitionDispatch *partition_dispatch_info;
- int num_dispatch; /* Number of entries in the above array */
+ PartitionTupleRoutingInfo **ptrinfos;
+ int num_parted; /* Number of entries in the above array */
int num_partitions; /* Number of members in the following arrays */
ResultRelInfo *partitions; /* Per partition result relation */
TupleConversionMap **partition_tupconv_maps;
@@ -1425,7 +1425,7 @@ BeginCopy(ParseState *pstate,
/* Initialize state for CopyFrom tuple routing. */
if (is_from && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1434,13 +1434,13 @@ BeginCopy(ParseState *pstate,
ExecSetupPartitionTupleRouting(rel,
1,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- cstate->partition_dispatch_info = partition_dispatch_info;
- cstate->num_dispatch = num_parted;
+ cstate->ptrinfos = ptrinfos;
+ cstate->num_parted = num_parted;
cstate->partitions = partitions;
cstate->num_partitions = num_partitions;
cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2495,7 +2495,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
- cstate->partition_dispatch_info != NULL ||
+ cstate->ptrinfos != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2573,7 +2573,7 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
/* Determine the partition to heap_insert the tuple into */
- if (cstate->partition_dispatch_info)
+ if (cstate->ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -2587,7 +2587,7 @@ CopyFrom(CopyState cstate)
* partition, respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- cstate->partition_dispatch_info,
+ cstate->ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -2819,23 +2819,20 @@ CopyFrom(CopyState cstate)
ExecCloseIndices(resultRelInfo);
- /* Close all the partitioned tables, leaf partitions, and their indices */
- if (cstate->partition_dispatch_info)
+ /* Close all the leaf partitions and their indices */
+ if (cstate->ptrinfos)
{
int i;
/*
- * Remember cstate->partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is
- * the main target table of COPY that will be closed eventually by
- * DoCopy(). Also, tupslot is NULL for the root partitioned table.
+ * cstate->ptrinfo[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot.
*/
- for (i = 1; i < cstate->num_dispatch; i++)
+ for (i = 1; i < cstate->num_parted; i++)
{
- PartitionDispatch pd = cstate->partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
for (i = 0; i < cstate->num_partitions; i++)
{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 74071eede6..15366fa4cd 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3215,8 +3215,8 @@ EvalPlanQualEnd(EPQState *epqstate)
* tuple routing for partitioned tables
*
* Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- * every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ * entry for each partitioned table in the partition tree
* 'partitions' receives an array of ResultRelInfo objects with one entry for
* every leaf partition in the partition tree
* 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3238,7 +3238,7 @@ EvalPlanQualEnd(EPQState *epqstate)
void
ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
@@ -3246,16 +3246,135 @@ ExecSetupPartitionTupleRouting(Relation rel,
{
TupleDesc tupDesc = RelationGetDescr(rel);
List *leaf_parts;
+ List *ptinfos = NIL;
ListCell *cell;
int i;
ResultRelInfo *leaf_part_rri;
+ Relation parent;
/*
* Get the information about the partition tree after locking all the
* partitions.
*/
(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
- *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+ RelationGetPartitionDispatchInfo(rel, &ptinfos, &leaf_parts);
+
+ /*
+ * The ptinfos list contains PartitionedTableInfo objects for all the
+ * partitioned tables in the partition tree. Using the information
+ * therein, we construct an array of PartitionTupleRoutingInfo objects
+ * to be used during tuple-routing.
+ */
+ *num_parted = list_length(ptinfos);
+ *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+ sizeof(PartitionTupleRoutingInfo *));
+ /*
+ * Free the ptinfos List structure itself as we go through (open-coded
+ * list_free).
+ */
+ i = 0;
+ cell = list_head(ptinfos);
+ parent = NULL;
+ while (cell)
+ {
+ ListCell *tmp = cell;
+ PartitionedTableInfo *ptinfo = lfirst(tmp),
+ *next_ptinfo = NULL;
+ Relation partrel;
+ PartitionTupleRoutingInfo *ptrinfo;
+
+ if (lnext(tmp))
+ next_ptinfo = lfirst(lnext(tmp));
+
+ /* As mentioned above, the partitioned tables have been locked. */
+ if (ptinfo->relid != RelationGetRelid(rel))
+ partrel = heap_open(ptinfo->relid, NoLock);
+ else
+ partrel = rel;
+
+ ptrinfo = (PartitionTupleRoutingInfo *)
+ palloc0(sizeof(PartitionTupleRoutingInfo));
+ ptrinfo->relid = ptinfo->relid;
+
+ /* Stash a reference to this PartitionDispatch. */
+ ptrinfo->pd = ptinfo->pd;
+
+ /* State for extracting partition key from tuples will go here. */
+ ptrinfo->keyinfo = (PartitionKeyInfo *)
+ palloc0(sizeof(PartitionKeyInfo));
+ ptrinfo->keyinfo->key = RelationGetPartitionKey(partrel);
+ ptrinfo->keyinfo->keystate = NIL;
+
+ /*
+ * For every partitioned table other than root, we must store a tuple
+ * table slot initialized with its tuple descriptor and a tuple
+ * conversion map to convert a tuple from its parent's rowtype to its
+ * own. That is to make sure that we are looking at the correct row
+ * using the correct tuple descriptor when computing its partition key
+ * for tuple routing.
+ */
+ if (ptinfo->parentid != InvalidOid)
+ {
+ TupleDesc tupdesc = RelationGetDescr(partrel);
+
+ /* Open the parent relation descriptor if not already done. */
+ if (ptinfo->parentid == RelationGetRelid(rel))
+ {
+ parent = rel;
+ }
+ else if (parent == NULL)
+ {
+ /* Locked by RelationGetPartitionDispatchInfo(). */
+ parent = heap_open(ptinfo->parentid, NoLock);
+ }
+
+ ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+
+ /*
+ * Close the parent descriptor, if the next partitioned table in
+ * the list is not a sibling, because it will have a different
+ * parent if so.
+ */
+ if (parent != NULL && parent != rel &&
+ next_ptinfo != NULL &&
+ next_ptinfo->parentid != ptinfo->parentid)
+ {
+ heap_close(parent, NoLock);
+ parent = NULL;
+ }
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ ptrinfo->tupslot = NULL;
+ ptrinfo->tupmap = NULL;
+ }
+
+ (*ptrinfos)[i++] = ptrinfo;
+
+ /* Free the ListCell. */
+ cell = lnext(cell);
+ pfree(tmp);
+ }
+
+ /* Free the List itself. */
+ if (ptinfos)
+ pfree(ptinfos);
+
+ /* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
@@ -3282,7 +3401,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* Note that each of the relations in *partitions are eventually
* closed by the caller.
*/
- partrel = heap_open(lfirst_oid(cell), NoLock);
+ partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
part_tupdesc = RelationGetDescr(partrel);
/*
@@ -3295,7 +3414,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* partition from the parent's type to the partition's.
*/
(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
- gettext_noop("could not convert row type"));
+ gettext_noop("could not convert row type"));
InitResultRelInfo(leaf_part_rri,
partrel,
@@ -3329,11 +3448,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
* by get_partition_for_tuple() unchanged.
*/
int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
- TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+ PartitionTupleRoutingInfo **ptrinfos,
+ TupleTableSlot *slot,
+ EState *estate)
{
int result;
- PartitionDispatchData *failed_at;
+ PartitionTupleRoutingInfo *failed_at;
TupleTableSlot *failed_slot;
/*
@@ -3343,7 +3464,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
if (resultRelInfo->ri_PartitionCheck)
ExecPartitionCheck(resultRelInfo, slot, estate);
- result = get_partition_for_tuple(pd, slot, estate,
+ result = get_partition_for_tuple(ptrinfos, slot, estate,
&failed_at, &failed_slot);
if (result < 0)
{
@@ -3353,9 +3474,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
char *val_desc;
ExprContext *ecxt = GetPerTupleExprContext(estate);
- failed_rel = failed_at->reldesc;
+ failed_rel = heap_open(failed_at->relid, NoLock);
ecxt->ecxt_scantuple = failed_slot;
- FormPartitionKeyDatum(failed_at, failed_slot, estate,
+ FormPartitionKeyDatum(failed_at->keyinfo, failed_slot, estate,
key_values, key_isnull);
val_desc = ExecBuildSlotPartitionKeyDescription(failed_rel,
key_values,
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 36b2b43bc6..9cf974c938 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -277,7 +277,7 @@ ExecInsert(ModifyTableState *mtstate,
resultRelInfo = estate->es_result_relation_info;
/* Determine the partition to heap_insert the tuple into */
- if (mtstate->mt_partition_dispatch_info)
+ if (mtstate->mt_ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -291,7 +291,7 @@ ExecInsert(ModifyTableState *mtstate,
* respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- mtstate->mt_partition_dispatch_info,
+ mtstate->mt_ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -1486,7 +1486,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
int numResultRelInfos;
/* Find the set of partitions so that we can find their TupleDescs. */
- if (mtstate->mt_partition_dispatch_info != NULL)
+ if (mtstate->mt_ptrinfos != NULL)
{
/*
* For INSERT via partitioned table, so we need TupleDescs based
@@ -1910,7 +1910,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
if (operation == CMD_INSERT &&
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1919,13 +1919,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
ExecSetupPartitionTupleRouting(rel,
node->nominalRelation,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- mtstate->mt_partition_dispatch_info = partition_dispatch_info;
- mtstate->mt_num_dispatch = num_parted;
+ mtstate->mt_ptrinfos = ptrinfos;
+ mtstate->mt_num_parted = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2335,19 +2335,16 @@ ExecEndModifyTable(ModifyTableState *node)
}
/*
- * Close all the partitioned tables, leaf partitions, and their indices
+ * Close all the leaf partitions and their indices.
*
- * Remember node->mt_partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is the
- * main target table of the query that will be closed by ExecEndPlan().
- * Also, tupslot is NULL for the root partitioned table.
+ * node->mt_partition_dispatch_info[0] corresponds to the root partitioned
+ * table, for which we didn't create tupslot.
*/
- for (i = 1; i < node->mt_num_dispatch; i++)
+ for (i = 1; i < node->mt_num_parted; i++)
{
- PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
for (i = 0; i < node->mt_num_partitions; i++)
{
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2283c675e9..7b53baf847 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -39,36 +39,23 @@ typedef struct PartitionDescData
typedef struct PartitionDescData *PartitionDesc;
-/*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
- *
- * reldesc Relation descriptor of the table
- * key Partition key information of the table
- * keystate Execution state required for expressions in the partition key
- * partdesc Partition descriptor of the table
- * tupslot A standalone TupleTableSlot initialized with this table's tuple
- * descriptor
- * tupmap TupleConversionMap to convert from the parent's rowtype to
- * this table's rowtype (when extracting the partition key of a
- * tuple just before routing it through this table)
- * indexes Array with partdesc->nparts members (for details on what
- * individual members represent, see how they are set in
- * RelationGetPartitionDispatchInfo())
- *-----------------------
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * Information about one partitioned table in a given partition tree
*/
-typedef struct PartitionDispatchData
+typedef struct PartitionedTableInfo
{
- Relation reldesc;
- PartitionKey key;
- List *keystate; /* list of ExprState */
- PartitionDesc partdesc;
- TupleTableSlot *tupslot;
- TupleConversionMap *tupmap;
- int *indexes;
-} PartitionDispatchData;
+ Oid relid;
+ Oid parentid;
-typedef struct PartitionDispatchData *PartitionDispatch;
+ /*
+ * This contains information about bounds of the partitions of this
+ * table and about where individual partitions are placed in the global
+ * partition tree.
+ */
+ PartitionDispatch pd;
+} PartitionedTableInfo;
extern void RelationBuildPartitionDesc(Relation relation);
extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
@@ -86,17 +73,18 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
+extern void RelationGetPartitionDispatchInfo(Relation rel,
+ List **ptinfos, List **leaf_part_oids);
+
/* For tuple routing */
-extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
- int *num_parted, List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot);
#endif /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 60326f9d03..6e1d3a6d2f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -208,13 +208,13 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
extern void ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
int *num_parted, int *num_partitions);
extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
- PartitionDispatch *pd,
+ PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 577499465d..07e50e0914 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,55 @@ typedef struct ResultRelInfo
Relation ri_PartitionRoot;
} ResultRelInfo;
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionKeyData *PartitionKey;
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionKeyInfoData - execution state for the partition key of a
+ * partitioned table
+ *
+ * keystate is the execution state required for expressions contained in the
+ * partition key. It is NIL until initialized by FormPartitionKeyDatum() if
+ * and when it is called; for example, during tuple routing through a given
+ * partitioned table.
+ */
+typedef struct PartitionKeyInfo
+{
+ PartitionKey key; /* Points into the table's relcache entry */
+ List *keystate;
+} PartitionKeyInfo;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ * through one partitioned table in a partition
+ * tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+ /* OID of the table */
+ Oid relid;
+
+ /* Information about the table's partitions */
+ PartitionDispatch pd;
+
+ /* See comment above the definition of PartitionKeyInfo */
+ PartitionKeyInfo *keyinfo;
+
+ /*
+ * A standalone TupleTableSlot initialized with this table's tuple
+ * descriptor
+ */
+ TupleTableSlot *tupslot;
+
+ /*
+ * TupleConversionMap to convert from the parent's rowtype to this table's
+ * rowtype (when extracting the partition key of a tuple just before
+ * routing it through this table)
+ */
+ TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
/* ----------------
* EState information
*
@@ -970,9 +1019,9 @@ typedef struct ModifyTableState
TupleTableSlot *mt_existing; /* slot to store existing target tuple in */
List *mt_excludedtlist; /* the excluded pseudo relation's tlist */
TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */
- struct PartitionDispatchData **mt_partition_dispatch_info;
/* Tuple-routing support info */
- int mt_num_dispatch; /* Number of entries in the above array */
+ struct PartitionTupleRoutingInfo **mt_ptrinfos;
+ int mt_num_parted; /* Number of entries in the above array */
int mt_num_partitions; /* Number of members in the following
* arrays */
ResultRelInfo *mt_partitions; /* Per partition result relation */
--
2.11.0
0002-Teach-expand_inherited_rtentry-to-use-partition-boun.patchtext/plain; charset=UTF-8; name=0002-Teach-expand_inherited_rtentry-to-use-partition-boun.patchDownload
From 593b8dba9e859344bb3402786014201a3fa1e363 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 9 Aug 2017 15:52:36 +0900
Subject: [PATCH 2/2] Teach expand_inherited_rtentry to use partition bound
order
After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.
---
src/backend/optimizer/prep/prepunion.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index f43c3f3007..dad6892c4d 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
#include "access/heapam.h"
#include "access/htup_details.h"
#include "access/sysattr.h"
+#include "catalog/partition.h"
#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -1452,6 +1453,38 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
*/
oldrelation = heap_open(parentOID, NoLock);
+ /*
+ * For partitioned tables, we arrange the child table OIDs such that they
+ * appear in the partition bound order.
+ */
+ if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+ {
+ List *leaf_part_oids,
+ *ptinfos;
+
+ /* Discard the original list. */
+ list_free(inhOIDs);
+ inhOIDs = NIL;
+
+ /* Request partitioning information. */
+ RelationGetPartitionDispatchInfo(oldrelation, &ptinfos,
+ &leaf_part_oids);
+
+ /*
+ * First collect the partitioned child table OIDs, which includes the
+ * root parent at the head.
+ */
+ foreach(l, ptinfos)
+ {
+ PartitionedTableInfo *ptinfo = lfirst(l);
+
+ inhOIDs = lappend_oid(inhOIDs, ptinfo->relid);
+ }
+
+ /* Concatenate the leaf partition OIDs. */
+ inhOIDs = list_concat(inhOIDs, leaf_part_oids);
+ }
+
/* Scan the inheritance set and expand it */
appinfos = NIL;
has_child = false;
--
2.11.0
On 18 August 2017 at 10:55, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/08/18 4:54, Robert Haas wrote:
On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:[2] had a patch with some changes to the original patch you posted. I
didn't describe those changes in my mail, since they rearranged the
comments. Those changes are not part of this patch and you haven't
comments about those changes as well. If you have intentionally
excluded those changes, it's fine. In case, you haven't reviewed them,
please see if they are good to be incorporated.I took a quick look at your version but I think I like Amit's fine the
way it is, so committed that and back-patched it to v10.Thanks for committing.
I find 0002 pretty ugly as things stand. We get a bunch of tuple maps
that we don't really need, only to turn around and free them. We get
a bunch of tuple slots that we don't need, only to turn around and
drop them. We don't really need the PartitionDispatch objects either,
except for the OIDs they contain. There's a lot of extra stuff being
computed here that is really irrelevant for this purpose. I think we
should try to clean that up somehow.One way to do that might be to reverse the order of the remaining patches
and put the patch to refactor RelationGetPartitionDispatchInfo() first.
With that refactoring, PartitionDispatch itself has become much simpler in
that it does not contain a relcache reference to be closed eventually by
the caller, the tuple map, and the tuple table slot. Since those things
are required for tuple-routing, the refactoring makes
ExecSetupPartitionTupleRouting itself create them from the (minimal)
information returned by RelationGetPartitionDispatchInfo and ultimately
destroy when done using it. I kept the indexes field in
PartitionDispatchData though, because it's essentially free to create
while we are walking the partition tree in
RelationGetPartitionDispatchInfo() and it seems undesirable to make the
caller compute that information (indexes) by traversing the partition tree
all over again, if it doesn't otherwise have to. I am still considering
some counter-arguments raised by Amit Khandekar about this last assertion.Thoughts?
One another approach (that I have used in update-partition-key patch)
is to *not* generate the oids beforehand, and instead, call a
partition_walker_next() function to traverse through the tree. Each
next() function would return a ChildInfo that includes child oid,
parent oid, etc. All users of this would guarantee a fixed order of
oids. In the update-partition-key patch, I am opening and closing each
of the children, which of course, we need to avoid.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 18, 2017 at 1:17 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
0004 patch in partition-wise join patchset has code to expand
partition hierarchy. That patch is expanding inheritance hierarchy in
depth first manner. Robert commented that instead of depth first
manner, it will be better if we expand it in partitioned tables first
manner. With the latest changes in your patch-set I don't see the
reason for expanding in partitioned tables first order. Can you please
elaborate if we still need to expand in partitioned table first
manner? May be we should just address the expansion issue in 0004
instead of dividing it in two patches.
Let me see if I can clarify. I think there are three requirements here:
A. Amit wants to be able to prune leaf partitions before opening and
locking those relations, so that pruning can be done earlier and,
therefore, more cheaply.
B. Partition-wise join wants to expand the inheritance hierarchy a
level at a time instead of all at once, ending up with rte->inh = true
entries for intermediate partitioned tables.
C. Partition-wise join (and lots of other things; see numerous
mentions of EIBO in
http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html)
want to expand in bound order.
Obviously, bound-order and partitioned-tables-first are incompatible
orderings, but there's no actual conflict because the first one has to
do with the order of *expansion* and the second one has to do with the
order of *locking*. So in the end game I think
expand_inherited_rtentry looks approximately like this:
1. Calling find_all_inheritors with a new only-lock-the-partitions
flag. This should result in locking all partitioned tables in the
inheritance hierarchy in breadth-first, low-OID-first order. (When
the only-lock-the-partitions isn't specified, all partitioned tables
should still be locked before any unpartitioned tables, so that the
locking order in that case is consistent with what we do here.)
2. Iterate over the partitioned tables identified in step 1 in the
order in which they were returned. For each one:
- Decide which children can be pruned.
- Lock the unpruned, non-partitioned children in low-OID-first order.
3. Make another pass over the inheritance hierarchy, starting at the
root. Traverse the whole hierarchy in breadth-first in *bound* order.
Add RTEs and AppendRelInfos as we go -- these will have rte->inh =
true for partitioned tables and rte->inh = false for leaf partitions.
Whether we should try to go straight to the end state here or do this
via a series of incremental changes, I'm not entirely sure right now.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Aug 19, 2017 at 1:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 18, 2017 at 1:17 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:0004 patch in partition-wise join patchset has code to expand
partition hierarchy. That patch is expanding inheritance hierarchy in
depth first manner. Robert commented that instead of depth first
manner, it will be better if we expand it in partitioned tables first
manner. With the latest changes in your patch-set I don't see the
reason for expanding in partitioned tables first order. Can you please
elaborate if we still need to expand in partitioned table first
manner? May be we should just address the expansion issue in 0004
instead of dividing it in two patches.Let me see if I can clarify. I think there are three requirements here:
A. Amit wants to be able to prune leaf partitions before opening and
locking those relations, so that pruning can be done earlier and,
therefore, more cheaply.
We could actually prune partitioned tables thus pruning whole
partitioned tree. Do we want to then lock those partitioned tables but
not the leaves in that tree?
If there's already some discussion answering this question, please
point me to the same. Sorry for not paying attention to it.
B. Partition-wise join wants to expand the inheritance hierarchy a
level at a time instead of all at once, ending up with rte->inh = true
entries for intermediate partitioned tables.
And create AppendRelInfos which pair children with their partitioned
parent rather than the root.
C. Partition-wise join (and lots of other things; see numerous
mentions of EIBO in
http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html)
want to expand in bound order.Obviously, bound-order and partitioned-tables-first are incompatible
orderings, but there's no actual conflict because the first one has to
do with the order of *expansion* and the second one has to do with the
order of *locking*.
right. Thanks for making it clear.
So in the end game I think
expand_inherited_rtentry looks approximately like this:1. Calling find_all_inheritors with a new only-lock-the-partitions
flag. This should result in locking all partitioned tables in the
inheritance hierarchy in breadth-first, low-OID-first order. (When
the only-lock-the-partitions isn't specified, all partitioned tables
should still be locked before any unpartitioned tables, so that the
locking order in that case is consistent with what we do here.)
I am confused. When "only-lock-the-partitions" is true, do we expect
intermediate partitioned tables to be locked? Why then "only" in the
flag?
2. Iterate over the partitioned tables identified in step 1 in the
order in which they were returned. For each one:
- Decide which children can be pruned.
- Lock the unpruned, non-partitioned children in low-OID-first order.3. Make another pass over the inheritance hierarchy, starting at the
root. Traverse the whole hierarchy in breadth-first in *bound* order.
Add RTEs and AppendRelInfos as we go -- these will have rte->inh =
true for partitioned tables and rte->inh = false for leaf partitions.
These two seem to be based on the assumption that we have to lock all
the partitioned tables even if they can be pruned.
For regular inheritance there is only a single parent, so traversing
the list returned by find_all_inheritors suffices. For partitioned
hierarchy, we need to know the parent of every child, which is not
part of the find_all_inheritors() output list. Even if it returns only
the partitioned children, they themselves may have a parent different
from the root partition. So, we have to discard the output of
find_all_inheritors() for partitioned hierarchy and traverse the
children as per their orders in oids array in PartitionDesc. May be
it's better to separate the guts of expand_inherited_rtentry(), which
create AppendRelInfos, RTEs and rowmarks for the children into a
separate routine. Use that routine in two different functions
expand_inherited_rtentry() and expand_partitioned_rtentry() for
regular inheritance and partitioned inheritance resp. The functions
will use two different traversal methods appropriate for traversing
the children in either case.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Amit,
On 2017/08/17 21:18, Amit Khandekar wrote:
Anyways, some more comments :
In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an
array of pointers. Why can't it be an array of
PartitionTupleRoutingInfo structure rather than pointer to that
structure ?
AFAIK, assigning pointers is less expensive than assigning struct and we
end up doing a lot of assigning of the members of that array to a local
variable in get_partition_for_tuple(), for example. Perhaps, we could
avoid those assignments and implement it the way you suggest.
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c + * Close all the leaf partitions and their indices. * Above comment needs to be shifted a bit down to the subsequent "for" loop where it's actually applicable.
That's right, done.
* node->mt_partition_dispatch_info[0] corresponds to the root partitioned
* table, for which we didn't create tupslot.
Above : node->mt_partition_dispatch_info[0] => node->mt_ptrinfos[0]
Oops, fixed.
/*
* XXX- do we need a pinning mechanism for partition descriptors
* so that there references can be managed independently of
* the parent relcache entry? Like PinPartitionDesc(partdesc)?
*/
pd->partdesc = partdesc;Any idea if the above can be handled ? I am not too sure.
A similar mechanism exists for TupleDesc ref-counting (see the usage of
PinTupleDesc and ReleaseTupleDesc across the backend code.) I too am
currently unsure if such an elaborate mechanism is actually *necessary*
for rd_partdesc.
Attached updated patches.
Thanks,
Amit
Attachments:
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload
From 0cf8ab795fd3a8db462e8c692cfaa73f19e71ed6 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor
Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code. In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as relcache references
and tuple table slots. That makes it harder to use in places other
than where it's currently being used.
After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo() and expand_inherited_rtentry() no
longer needs to do some things that it used to.
---
src/backend/catalog/partition.c | 309 +++++++++++++++++----------------
src/backend/commands/copy.c | 37 ++--
src/backend/executor/execMain.c | 145 ++++++++++++++--
src/backend/executor/nodeModifyTable.c | 35 ++--
src/include/catalog/partition.h | 52 +++---
src/include/executor/executor.h | 4 +-
src/include/nodes/execnodes.h | 53 +++++-
7 files changed, 398 insertions(+), 237 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 96a64ce6b2..7618e4cb31 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,24 @@ typedef struct PartitionRangeBound
bool lower; /* this is the lower (vs upper) bound */
} PartitionRangeBound;
+/*-----------------------
+ * PartitionDispatchData - information of partitions of one partitioned table
+ * in a partition tree
+ *
+ * partkey Partition key of the table
+ * partdesc Partition descriptor of the table
+ * indexes Array with partdesc->nparts members (for details on what the
+ * individual value represents, see the comments in
+ * RelationGetPartitionDispatchInfo())
+ *-----------------------
+ */
+typedef struct PartitionDispatchData
+{
+ PartitionKey partkey; /* Points into the table's relcache entry */
+ PartitionDesc partdesc; /* Ditto */
+ int *indexes;
+} PartitionDispatchData;
+
static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
void *arg);
static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -981,181 +999,165 @@ get_partition_qual_relid(Oid relid)
}
/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
- do\
- {\
- int i;\
- for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
- {\
- (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
- (parents) = lappend((parents), (rel));\
- }\
- } while(0)
-
-/*
* RelationGetPartitionDispatchInfo
- * Returns information necessary to route tuples down a partition tree
+ * Returns necessary information for each partition in the partition
+ * tree rooted at rel
*
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
+ * Information returned includes the following: *ptinfos contains a list of
+ * PartitionedTableInfo objects, one for each partitioned table (with at least
+ * one member, that is, one for the root partitioned table), *leaf_part_oids
+ * contains a list of the OIDs of of all the leaf partitions.
*
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
+ * We require that the caller has locked at least the partitioned tables in the
+ * partition tree (including 'rel') using at least the AccessShareLock,
+ * because we need to look at their relcache entries to get PartitionKey and
+ * PartitionDesc.
*/
-PartitionDispatch *
+void
RelationGetPartitionDispatchInfo(Relation rel,
- int *num_parted, List **leaf_part_oids)
+ List **ptinfos, List **leaf_part_oids)
{
- PartitionDispatchData **pd;
- List *all_parts = NIL,
- *all_parents = NIL,
- *parted_rels,
- *parted_rel_parents;
+ List *all_parts,
+ *all_parents;
ListCell *lc1,
*lc2;
int i,
- k,
offset;
/*
* We rely on the relcache to traverse the partition tree to build both
- * the leaf partition OIDs list and the array of PartitionDispatch objects
- * for the partitioned tables in the tree. That means every partitioned
- * table in the tree must be locked, which is fine since we require the
- * caller to lock all the partitions anyway.
+ * the leaf partition OIDs list and the list of PartitionedTableInfo
+ * objects for partitioned tables. That means every partitioned table in
+ * the tree must be locked, which is fine since the callers must have done
+ * that already.
*
* For every partitioned table in the tree, starting with the root
* partitioned table, add its relcache entry to parted_rels, while also
* queuing its partitions (in the order in which they appear in the
* partition descriptor) to be looked at later in the same loop. This is
* a bit tricky but works because the foreach() macro doesn't fetch the
- * next list element until the bottom of the loop.
+ * next list element until the bottom of the loop. Non-partitioned tables
+ * are simply added to the leaf partitions list.
*/
- *num_parted = 1;
- parted_rels = list_make1(rel);
- /* Root partitioned table has no parent, so NULL for parent */
- parted_rel_parents = list_make1(NULL);
- APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
+ i = offset = 0;
+ *ptinfos = *leaf_part_oids = NIL;
+
+ /* Start with the root table. */
+ all_parts = list_make1_oid(RelationGetRelid(rel));
+ all_parents = list_make1_oid(InvalidOid);
forboth(lc1, all_parts, lc2, all_parents)
{
- Oid partrelid = lfirst_oid(lc1);
- Relation parent = lfirst(lc2);
+ Oid partrelid = lfirst_oid(lc1);
+ Oid parentrelid = lfirst_oid(lc2);
if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
{
- /*
- * Already locked by the caller. Note that it is the
- * responsibility of the caller to close the below relcache entry,
- * once done using the information being collected here (for
- * example, in ExecEndModifyTable).
- */
- Relation partrel = heap_open(partrelid, NoLock);
+ int j,
+ k;
+ Relation partrel;
+ PartitionKey partkey;
+ PartitionDesc partdesc;
+ PartitionedTableInfo *ptinfo;
+ PartitionDispatch pd;
+
+ if (partrelid != RelationGetRelid(rel))
+ partrel = heap_open(partrelid, NoLock);
+ else
+ partrel = rel;
- (*num_parted)++;
- parted_rels = lappend(parted_rels, partrel);
- parted_rel_parents = lappend(parted_rel_parents, parent);
- APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
- }
- }
+ partkey = RelationGetPartitionKey(partrel);
+ partdesc = RelationGetPartitionDesc(partrel);
+
+ ptinfo = (PartitionedTableInfo *)
+ palloc0(sizeof(PartitionedTableInfo));
+ ptinfo->relid = partrelid;
+ ptinfo->parentid = parentrelid;
+
+ ptinfo->pd = pd = (PartitionDispatchData *)
+ palloc0(sizeof(PartitionDispatchData));
+ pd->partkey = partkey;
- /*
- * We want to create two arrays - one for leaf partitions and another for
- * partitioned tables (including the root table and internal partitions).
- * While we only create the latter here, leaf partition array of suitable
- * objects (such as, ResultRelInfo) is created by the caller using the
- * list of OIDs we return. Indexes into these arrays get assigned in a
- * breadth-first manner, whereby partitions of any given level are placed
- * consecutively in the respective arrays.
- */
- pd = (PartitionDispatchData **) palloc(*num_parted *
- sizeof(PartitionDispatchData *));
- *leaf_part_oids = NIL;
- i = k = offset = 0;
- forboth(lc1, parted_rels, lc2, parted_rel_parents)
- {
- Relation partrel = lfirst(lc1);
- Relation parent = lfirst(lc2);
- PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- int j,
- m;
-
- pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
- pd[i]->reldesc = partrel;
- pd[i]->key = partkey;
- pd[i]->keystate = NIL;
- pd[i]->partdesc = partdesc;
- if (parent != NULL)
- {
/*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
+ * XXX- do we need a pinning mechanism for partition descriptors
+ * so that there references can be managed independently of
+ * the parent relcache entry? Like PinPartitionDesc(partdesc)?
*/
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
- }
- else
- {
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
- pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+ pd->partdesc = partdesc;
- /*
- * Indexes corresponding to the internal partitions are multiplied by
- * -1 to distinguish them from those of leaf partitions. Encountering
- * an index >= 0 means we found a leaf partition, which is immediately
- * returned as the partition we are looking for. A negative index
- * means we found a partitioned table, whose PartitionDispatch object
- * is located at the above index multiplied back by -1. Using the
- * PartitionDispatch object, search is continued further down the
- * partition tree.
- */
- m = 0;
- for (j = 0; j < partdesc->nparts; j++)
- {
- Oid partrelid = partdesc->oids[j];
+ /*
+ * The values contained in the following array correspond to
+ * indexes of this table's partitions in the global sequence of
+ * all the partitions contained in the partition tree rooted at
+ * rel, traversed in a breadh-first manner. The values should be
+ * such that we will be able to distinguish the leaf partitions
+ * from the non-leaf partitions, because they are returned to
+ * to the caller in separate structures from where they will be
+ * accessed. The way that's done is described below:
+ *
+ * Leaf partition OIDs are put into the global leaf_part_oids list,
+ * and for each one, the value stored is its ordinal position in
+ * the list minus 1.
+ *
+ * PartitionedTableInfo objects corresponding to partitions that
+ * are partitioned tables are put into the global ptinfos[] list,
+ * and for each one, the value stored is its ordinal position in
+ * the list multiplied by -1.
+ *
+ * So while looking at the values in the indexes array, if one
+ * gets zero or a positive value, then it's a leaf partition,
+ * Otherwise, it's a partitioned table.
+ */
+ pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
- if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
- {
- *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
- pd[i]->indexes[j] = k++;
- }
- else
+ k = 0;
+ for (j = 0; j < partdesc->nparts; j++)
{
+ Oid partrelid = partdesc->oids[j];
+
/*
- * offset denotes the number of partitioned tables of upper
- * levels including those of the current level. Any partition
- * of this table must belong to the next level and hence will
- * be placed after the last partitioned table of this level.
+ * Queue this partition so that it will be processed later
+ * by the outer loop.
*/
- pd[i]->indexes[j] = -(1 + offset + m);
- m++;
+ all_parts = lappend_oid(all_parts, partrelid);
+ all_parents = lappend_oid(all_parents,
+ RelationGetRelid(partrel));
+
+ if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
+ {
+ *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+ pd->indexes[j] = i++;
+ }
+ else
+ {
+ /*
+ * offset denotes the number of partitioned tables that
+ * we have already processed. k counts the number of
+ * partitions of this table that were found to be
+ * partitioned tables.
+ */
+ pd->indexes[j] = -(1 + offset + k);
+ k++;
+ }
}
- }
- i++;
- /*
- * This counts the number of partitioned tables at upper levels
- * including those of the current level.
- */
- offset += m;
+ offset += k;
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+
+ *ptinfos = lappend(*ptinfos, ptinfo);
+ }
}
- return pd;
+ Assert(i == list_length(*leaf_part_oids));
+ Assert((offset + 1) == list_length(*ptinfos));
}
/* Module-local functions */
@@ -1872,7 +1874,7 @@ generate_partition_qual(Relation rel)
* ----------------
*/
void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
@@ -1881,20 +1883,21 @@ FormPartitionKeyDatum(PartitionDispatch pd,
ListCell *partexpr_item;
int i;
- if (pd->key->partexprs != NIL && pd->keystate == NIL)
+ if (keyinfo->key->partexprs != NIL && keyinfo->keystate == NIL)
{
/* Check caller has set up context correctly */
Assert(estate != NULL &&
GetPerTupleExprContext(estate)->ecxt_scantuple == slot);
/* First time through, set up expression evaluation state */
- pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+ keyinfo->keystate = ExecPrepareExprList(keyinfo->key->partexprs,
+ estate);
}
- partexpr_item = list_head(pd->keystate);
- for (i = 0; i < pd->key->partnatts; i++)
+ partexpr_item = list_head(keyinfo->keystate);
+ for (i = 0; i < keyinfo->key->partnatts; i++)
{
- AttrNumber keycol = pd->key->partattrs[i];
+ AttrNumber keycol = keyinfo->key->partattrs[i];
Datum datum;
bool isNull;
@@ -1931,13 +1934,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
* the latter case.
*/
int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot)
{
- PartitionDispatch parent;
+ PartitionTupleRoutingInfo *parent;
Datum values[PARTITION_MAX_KEYS];
bool isnull[PARTITION_MAX_KEYS];
int cur_offset,
@@ -1948,11 +1951,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
/* start with the root partitioned table */
- parent = pd[0];
+ parent = ptrinfos[0];
while (true)
{
- PartitionKey key = parent->key;
- PartitionDesc partdesc = parent->partdesc;
+ PartitionKey key = parent->pd->partkey;
+ PartitionDesc partdesc = parent->pd->partdesc;
TupleTableSlot *myslot = parent->tupslot;
TupleConversionMap *map = parent->tupmap;
@@ -1984,7 +1987,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
* So update ecxt_scantuple accordingly.
*/
ecxt->ecxt_scantuple = slot;
- FormPartitionKeyDatum(parent, slot, estate, values, isnull);
+ FormPartitionKeyDatum(parent->keyinfo, slot, estate, values, isnull);
if (key->strategy == PARTITION_STRATEGY_RANGE)
{
@@ -2055,13 +2058,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
*failed_slot = slot;
break;
}
- else if (parent->indexes[cur_index] >= 0)
+ else if (parent->pd->indexes[cur_index] >= 0)
{
- result = parent->indexes[cur_index];
+ result = parent->pd->indexes[cur_index];
break;
}
else
- parent = pd[-parent->indexes[cur_index]];
+ parent = ptrinfos[-parent->pd->indexes[cur_index]];
}
error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f059c2..b0c596345b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
bool volatile_defexprs; /* is any of defexprs volatile? */
List *range_table;
- PartitionDispatch *partition_dispatch_info;
- int num_dispatch; /* Number of entries in the above array */
+ PartitionTupleRoutingInfo **ptrinfos;
+ int num_parted; /* Number of entries in the above array */
int num_partitions; /* Number of members in the following arrays */
ResultRelInfo *partitions; /* Per partition result relation */
TupleConversionMap **partition_tupconv_maps;
@@ -2445,7 +2445,7 @@ CopyFrom(CopyState cstate)
*/
if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -2455,13 +2455,13 @@ CopyFrom(CopyState cstate)
ExecSetupPartitionTupleRouting(cstate->rel,
1,
estate,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- cstate->partition_dispatch_info = partition_dispatch_info;
- cstate->num_dispatch = num_parted;
+ cstate->ptrinfos = ptrinfos;
+ cstate->num_parted = num_parted;
cstate->partitions = partitions;
cstate->num_partitions = num_partitions;
cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2502,7 +2502,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
- cstate->partition_dispatch_info != NULL ||
+ cstate->ptrinfos != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2580,7 +2580,7 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
/* Determine the partition to heap_insert the tuple into */
- if (cstate->partition_dispatch_info)
+ if (cstate->ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -2594,7 +2594,7 @@ CopyFrom(CopyState cstate)
* partition, respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- cstate->partition_dispatch_info,
+ cstate->ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -2826,24 +2826,23 @@ CopyFrom(CopyState cstate)
ExecCloseIndices(resultRelInfo);
- /* Close all the partitioned tables, leaf partitions, and their indices */
- if (cstate->partition_dispatch_info)
+ /* Release some resources that we acquired for tuple-routing. */
+ if (cstate->ptrinfos)
{
int i;
/*
- * Remember cstate->partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is
- * the main target table of COPY that will be closed eventually by
- * DoCopy(). Also, tupslot is NULL for the root partitioned table.
+ * cstate->ptrinfos[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot.
*/
- for (i = 1; i < cstate->num_dispatch; i++)
+ for (i = 1; i < cstate->num_parted; i++)
{
- PartitionDispatch pd = cstate->partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
+
+ /* Close all the leaf partitions and their indices */
for (i = 0; i < cstate->num_partitions; i++)
{
ResultRelInfo *resultRelInfo = cstate->partitions + i;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 2946a0edee..a03188aba3 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3236,8 +3236,8 @@ EvalPlanQualEnd(EPQState *epqstate)
* tuple routing for partitioned tables
*
* Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- * every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ * entry for each partitioned table in the partition tree
* 'partitions' receives an array of ResultRelInfo objects with one entry for
* every leaf partition in the partition tree
* 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3260,7 +3260,7 @@ void
ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
EState *estate,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
@@ -3268,16 +3268,135 @@ ExecSetupPartitionTupleRouting(Relation rel,
{
TupleDesc tupDesc = RelationGetDescr(rel);
List *leaf_parts;
+ List *ptinfos = NIL;
ListCell *cell;
int i;
ResultRelInfo *leaf_part_rri;
+ Relation parent;
/*
* Get the information about the partition tree after locking all the
* partitions.
*/
(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
- *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+ RelationGetPartitionDispatchInfo(rel, &ptinfos, &leaf_parts);
+
+ /*
+ * The ptinfos list contains PartitionedTableInfo objects for all the
+ * partitioned tables in the partition tree. Using the information
+ * therein, we construct an array of PartitionTupleRoutingInfo objects
+ * to be used during tuple-routing.
+ */
+ *num_parted = list_length(ptinfos);
+ *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+ sizeof(PartitionTupleRoutingInfo *));
+ /*
+ * Free the ptinfos List structure itself as we go through (open-coded
+ * list_free).
+ */
+ i = 0;
+ cell = list_head(ptinfos);
+ parent = NULL;
+ while (cell)
+ {
+ ListCell *tmp = cell;
+ PartitionedTableInfo *ptinfo = lfirst(tmp),
+ *next_ptinfo = NULL;
+ Relation partrel;
+ PartitionTupleRoutingInfo *ptrinfo;
+
+ if (lnext(tmp))
+ next_ptinfo = lfirst(lnext(tmp));
+
+ /* As mentioned above, the partitioned tables have been locked. */
+ if (ptinfo->relid != RelationGetRelid(rel))
+ partrel = heap_open(ptinfo->relid, NoLock);
+ else
+ partrel = rel;
+
+ ptrinfo = (PartitionTupleRoutingInfo *)
+ palloc0(sizeof(PartitionTupleRoutingInfo));
+ ptrinfo->relid = ptinfo->relid;
+
+ /* Stash a reference to this PartitionDispatch. */
+ ptrinfo->pd = ptinfo->pd;
+
+ /* State for extracting partition key from tuples will go here. */
+ ptrinfo->keyinfo = (PartitionKeyInfo *)
+ palloc0(sizeof(PartitionKeyInfo));
+ ptrinfo->keyinfo->key = RelationGetPartitionKey(partrel);
+ ptrinfo->keyinfo->keystate = NIL;
+
+ /*
+ * For every partitioned table other than root, we must store a tuple
+ * table slot initialized with its tuple descriptor and a tuple
+ * conversion map to convert a tuple from its parent's rowtype to its
+ * own. That is to make sure that we are looking at the correct row
+ * using the correct tuple descriptor when computing its partition key
+ * for tuple routing.
+ */
+ if (ptinfo->parentid != InvalidOid)
+ {
+ TupleDesc tupdesc = RelationGetDescr(partrel);
+
+ /* Open the parent relation descriptor if not already done. */
+ if (ptinfo->parentid == RelationGetRelid(rel))
+ {
+ parent = rel;
+ }
+ else if (parent == NULL)
+ {
+ /* Locked by RelationGetPartitionDispatchInfo(). */
+ parent = heap_open(ptinfo->parentid, NoLock);
+ }
+
+ ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+
+ /*
+ * Close the parent descriptor, if the next partitioned table in
+ * the list is not a sibling, because it will have a different
+ * parent if so.
+ */
+ if (parent != NULL && parent != rel &&
+ next_ptinfo != NULL &&
+ next_ptinfo->parentid != ptinfo->parentid)
+ {
+ heap_close(parent, NoLock);
+ parent = NULL;
+ }
+
+ /*
+ * Release the relation descriptor. Lock that we have on the
+ * table will keep the PartitionDesc that is pointing into
+ * RelationData intact, a pointer to which hope to keep
+ * through this transaction's commit.
+ * (XXX - how true is that?)
+ */
+ if (partrel != rel)
+ heap_close(partrel, NoLock);
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ ptrinfo->tupslot = NULL;
+ ptrinfo->tupmap = NULL;
+ }
+
+ (*ptrinfos)[i++] = ptrinfo;
+
+ /* Free the ListCell. */
+ cell = lnext(cell);
+ pfree(tmp);
+ }
+
+ /* Free the List itself. */
+ if (ptinfos)
+ pfree(ptinfos);
+
+ /* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
@@ -3304,7 +3423,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* Note that each of the relations in *partitions are eventually
* closed by the caller.
*/
- partrel = heap_open(lfirst_oid(cell), NoLock);
+ partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
part_tupdesc = RelationGetDescr(partrel);
/*
@@ -3317,7 +3436,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* partition from the parent's type to the partition's.
*/
(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
- gettext_noop("could not convert row type"));
+ gettext_noop("could not convert row type"));
InitResultRelInfo(leaf_part_rri,
partrel,
@@ -3354,11 +3473,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
* by get_partition_for_tuple() unchanged.
*/
int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
- TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+ PartitionTupleRoutingInfo **ptrinfos,
+ TupleTableSlot *slot,
+ EState *estate)
{
int result;
- PartitionDispatchData *failed_at;
+ PartitionTupleRoutingInfo *failed_at;
TupleTableSlot *failed_slot;
/*
@@ -3368,7 +3489,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
if (resultRelInfo->ri_PartitionCheck)
ExecPartitionCheck(resultRelInfo, slot, estate);
- result = get_partition_for_tuple(pd, slot, estate,
+ result = get_partition_for_tuple(ptrinfos, slot, estate,
&failed_at, &failed_slot);
if (result < 0)
{
@@ -3378,9 +3499,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
char *val_desc;
ExprContext *ecxt = GetPerTupleExprContext(estate);
- failed_rel = failed_at->reldesc;
+ failed_rel = heap_open(failed_at->relid, NoLock);
ecxt->ecxt_scantuple = failed_slot;
- FormPartitionKeyDatum(failed_at, failed_slot, estate,
+ FormPartitionKeyDatum(failed_at->keyinfo, failed_slot, estate,
key_values, key_isnull);
val_desc = ExecBuildSlotPartitionKeyDescription(failed_rel,
key_values,
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e12721a9b6..c5deed4685 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -278,7 +278,7 @@ ExecInsert(ModifyTableState *mtstate,
resultRelInfo = estate->es_result_relation_info;
/* Determine the partition to heap_insert the tuple into */
- if (mtstate->mt_partition_dispatch_info)
+ if (mtstate->mt_ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -292,7 +292,7 @@ ExecInsert(ModifyTableState *mtstate,
* respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- mtstate->mt_partition_dispatch_info,
+ mtstate->mt_ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -1487,7 +1487,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
int numResultRelInfos;
/* Find the set of partitions so that we can find their TupleDescs. */
- if (mtstate->mt_partition_dispatch_info != NULL)
+ if (mtstate->mt_ptrinfos != NULL)
{
/*
* For INSERT via partitioned table, so we need TupleDescs based
@@ -1911,7 +1911,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
if (operation == CMD_INSERT &&
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1921,13 +1921,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
ExecSetupPartitionTupleRouting(rel,
node->nominalRelation,
estate,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- mtstate->mt_partition_dispatch_info = partition_dispatch_info;
- mtstate->mt_num_dispatch = num_parted;
+ mtstate->mt_ptrinfos = ptrinfos;
+ mtstate->mt_num_parted = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2336,21 +2336,22 @@ ExecEndModifyTable(ModifyTableState *node)
resultRelInfo);
}
+ /* Release some resources that we acquired for tuple-routing. */
+
/*
- * Close all the partitioned tables, leaf partitions, and their indices
- *
- * Remember node->mt_partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is the
- * main target table of the query that will be closed by ExecEndPlan().
- * Also, tupslot is NULL for the root partitioned table.
+ * node->mt_ptrinfos[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot.
*/
- for (i = 1; i < node->mt_num_dispatch; i++)
+ for (i = 1; i < node->mt_num_parted; i++)
{
- PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
+
+ /*
+ * Close all the leaf partitions and their indices.
+ */
for (i = 0; i < node->mt_num_partitions; i++)
{
ResultRelInfo *resultRelInfo = node->mt_partitions + i;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2283c675e9..7b53baf847 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -39,36 +39,23 @@ typedef struct PartitionDescData
typedef struct PartitionDescData *PartitionDesc;
-/*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
- *
- * reldesc Relation descriptor of the table
- * key Partition key information of the table
- * keystate Execution state required for expressions in the partition key
- * partdesc Partition descriptor of the table
- * tupslot A standalone TupleTableSlot initialized with this table's tuple
- * descriptor
- * tupmap TupleConversionMap to convert from the parent's rowtype to
- * this table's rowtype (when extracting the partition key of a
- * tuple just before routing it through this table)
- * indexes Array with partdesc->nparts members (for details on what
- * individual members represent, see how they are set in
- * RelationGetPartitionDispatchInfo())
- *-----------------------
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * Information about one partitioned table in a given partition tree
*/
-typedef struct PartitionDispatchData
+typedef struct PartitionedTableInfo
{
- Relation reldesc;
- PartitionKey key;
- List *keystate; /* list of ExprState */
- PartitionDesc partdesc;
- TupleTableSlot *tupslot;
- TupleConversionMap *tupmap;
- int *indexes;
-} PartitionDispatchData;
+ Oid relid;
+ Oid parentid;
-typedef struct PartitionDispatchData *PartitionDispatch;
+ /*
+ * This contains information about bounds of the partitions of this
+ * table and about where individual partitions are placed in the global
+ * partition tree.
+ */
+ PartitionDispatch pd;
+} PartitionedTableInfo;
extern void RelationBuildPartitionDesc(Relation relation);
extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
@@ -86,17 +73,18 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
+extern void RelationGetPartitionDispatchInfo(Relation rel,
+ List **ptinfos, List **leaf_part_oids);
+
/* For tuple routing */
-extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
- int *num_parted, List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot);
#endif /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index eacbea3c36..44b7cd0fd6 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -209,13 +209,13 @@ extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
extern void ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
EState *estate,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
int *num_parted, int *num_partitions);
extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
- PartitionDispatch *pd,
+ PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3272c4b315..2dcbb139fc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,55 @@ typedef struct ResultRelInfo
Relation ri_PartitionRoot;
} ResultRelInfo;
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionKeyData *PartitionKey;
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionKeyInfoData - execution state for the partition key of a
+ * partitioned table
+ *
+ * keystate is the execution state required for expressions contained in the
+ * partition key. It is NIL until initialized by FormPartitionKeyDatum() if
+ * and when it is called; for example, during tuple routing through a given
+ * partitioned table.
+ */
+typedef struct PartitionKeyInfo
+{
+ PartitionKey key; /* Points into the table's relcache entry */
+ List *keystate;
+} PartitionKeyInfo;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ * through one partitioned table in a partition
+ * tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+ /* OID of the table */
+ Oid relid;
+
+ /* Information about the table's partitions */
+ PartitionDispatch pd;
+
+ /* See comment above the definition of PartitionKeyInfo */
+ PartitionKeyInfo *keyinfo;
+
+ /*
+ * A standalone TupleTableSlot initialized with this table's tuple
+ * descriptor
+ */
+ TupleTableSlot *tupslot;
+
+ /*
+ * TupleConversionMap to convert from the parent's rowtype to this table's
+ * rowtype (when extracting the partition key of a tuple just before
+ * routing it through this table)
+ */
+ TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
/* ----------------
* EState information
*
@@ -973,9 +1022,9 @@ typedef struct ModifyTableState
TupleTableSlot *mt_existing; /* slot to store existing target tuple in */
List *mt_excludedtlist; /* the excluded pseudo relation's tlist */
TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */
- struct PartitionDispatchData **mt_partition_dispatch_info;
/* Tuple-routing support info */
- int mt_num_dispatch; /* Number of entries in the above array */
+ struct PartitionTupleRoutingInfo **mt_ptrinfos;
+ int mt_num_parted; /* Number of entries in the above array */
int mt_num_partitions; /* Number of members in the following
* arrays */
ResultRelInfo *mt_partitions; /* Per partition result relation */
--
2.11.0
0002-Teach-expand_inherited_rtentry-to-use-partition-boun.patchtext/plain; charset=UTF-8; name=0002-Teach-expand_inherited_rtentry-to-use-partition-boun.patchDownload
From df7b8c780fad226b9635b7530018f391acfe7055 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 9 Aug 2017 15:52:36 +0900
Subject: [PATCH 2/2] Teach expand_inherited_rtentry to use partition bound
order
After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.
---
src/backend/optimizer/prep/prepunion.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e73c819901..68d0d8efa3 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
#include "access/heapam.h"
#include "access/htup_details.h"
#include "access/sysattr.h"
+#include "catalog/partition.h"
#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -1452,6 +1453,38 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
*/
oldrelation = heap_open(parentOID, NoLock);
+ /*
+ * For partitioned tables, we arrange the child table OIDs such that they
+ * appear in the partition bound order.
+ */
+ if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+ {
+ List *leaf_part_oids,
+ *ptinfos;
+
+ /* Discard the original list. */
+ list_free(inhOIDs);
+ inhOIDs = NIL;
+
+ /* Request partitioning information. */
+ RelationGetPartitionDispatchInfo(oldrelation, &ptinfos,
+ &leaf_part_oids);
+
+ /*
+ * First collect the partitioned child table OIDs, which includes the
+ * root parent at the head.
+ */
+ foreach(l, ptinfos)
+ {
+ PartitionedTableInfo *ptinfo = lfirst(l);
+
+ inhOIDs = lappend_oid(inhOIDs, ptinfo->relid);
+ }
+
+ /* Concatenate the leaf partition OIDs. */
+ inhOIDs = list_concat(inhOIDs, leaf_part_oids);
+ }
+
/* Scan the inheritance set and expand it */
appinfos = NIL;
has_child = false;
--
2.11.0
On 2017/08/21 13:11, Ashutosh Bapat wrote:
On Sat, Aug 19, 2017 at 1:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 18, 2017 at 1:17 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:0004 patch in partition-wise join patchset has code to expand
partition hierarchy. That patch is expanding inheritance hierarchy in
depth first manner. Robert commented that instead of depth first
manner, it will be better if we expand it in partitioned tables first
manner. With the latest changes in your patch-set I don't see the
reason for expanding in partitioned tables first order. Can you please
elaborate if we still need to expand in partitioned table first
manner? May be we should just address the expansion issue in 0004
instead of dividing it in two patches.Let me see if I can clarify. I think there are three requirements here:
A. Amit wants to be able to prune leaf partitions before opening and
locking those relations, so that pruning can be done earlier and,
therefore, more cheaply.We could actually prune partitioned tables thus pruning whole
partitioned tree. Do we want to then lock those partitioned tables but
not the leaves in that tree?
I think it would be nice if we keep the current approach of expanding the
whole partition tree in expand_inherited_rtentry(), at least to know how
many more entries a given partitioned table will add to the query's range
table. It would be nice, because that way, we don't have to worry *right
away* about modifying the planner to cope with some new behavior whereby
range table entries will get added at some later point.
Then, as you might already know, if we want to use the partition bound
order when expanding the whole partition tree, we will depend on the
relcache entries of the partitioned tables in that tree, which will
require us to take locks on them.
It does sound odd that we may end up locking a child *partitioned* table
that is potentially prune-able, but maybe there is some way to relinquish
that lock once we find out that it is pruned after all.
B. Partition-wise join wants to expand the inheritance hierarchy a
level at a time instead of all at once, ending up with rte->inh = true
entries for intermediate partitioned tables.And create AppendRelInfos which pair children with their partitioned
parent rather than the root.
There should be *some* way to preserve the parent-child RT index mapping
and to preserve the multi-level hierarchy, a way that doesn't map all the
child tables in a partition tree to the root table's RT index.
AppendRelInfo is one way of doing that mapping currently, but if we
continue to treat it as the only way (for the purpose of mapping), we will
be stuck with the way they are created and manipulated. Especially, if we
are going to always depend on the fact that root->append_rel_list contains
all the required AppendRelInfos, then we will always have to fully expand
the inheritance in expand_inherited_rtentry() (by fully I mean, locking
and opening all the child tables, instead of just the partitioned tables).
In a world where we don't want to open the partition child tables in
expand_inherited_rtentry(), we cannot build the corresponding
AppendRelInfos there. Note that this is not about completely dispelling
AppendRelInfos-for-partition-child-tables, but about doing without them
being present in root->append_rel_list. We would still need them to be
able to use adjust_appendrel_attrs(), etc., but we can create them at a
different time and store them in a place that's not root->append_rel_list;
For example, inside the RelOptInfo of the child table. Or perhaps, we
can still add them to root->append_rel_list, but will need to be careful
about the places that depend on the timing of AppendRelInfos being present
there.
So in the end game I think
expand_inherited_rtentry looks approximately like this:1. Calling find_all_inheritors with a new only-lock-the-partitions
flag. This should result in locking all partitioned tables in the
inheritance hierarchy in breadth-first, low-OID-first order. (When
the only-lock-the-partitions isn't specified, all partitioned tables
should still be locked before any unpartitioned tables, so that the
locking order in that case is consistent with what we do here.)I am confused. When "only-lock-the-partitions" is true, do we expect
intermediate partitioned tables to be locked? Why then "only" in the
flag?
I guess Robert meant to say lock-only-"partitioned"-tables?
2. Iterate over the partitioned tables identified in step 1 in the
order in which they were returned. For each one:
- Decide which children can be pruned.
- Lock the unpruned, non-partitioned children in low-OID-first order.3. Make another pass over the inheritance hierarchy, starting at the
root. Traverse the whole hierarchy in breadth-first in *bound* order.
Add RTEs and AppendRelInfos as we go -- these will have rte->inh =
true for partitioned tables and rte->inh = false for leaf partitions.These two seem to be based on the assumption that we have to lock all
the partitioned tables even if they can be pruned.For regular inheritance there is only a single parent, so traversing
the list returned by find_all_inheritors suffices. For partitioned
hierarchy, we need to know the parent of every child, which is not
part of the find_all_inheritors() output list. Even if it returns only
the partitioned children, they themselves may have a parent different
from the root partition. So, we have to discard the output of
find_all_inheritors() for partitioned hierarchy and traverse the
children as per their orders in oids array in PartitionDesc. May be
it's better to separate the guts of expand_inherited_rtentry(), which
create AppendRelInfos, RTEs and rowmarks for the children into a
separate routine. Use that routine in two different functions
expand_inherited_rtentry() and expand_partitioned_rtentry() for
regular inheritance and partitioned inheritance resp. The functions
will use two different traversal methods appropriate for traversing
the children in either case.
I just posted a patch [1]/messages/by-id/098b9c71-1915-1a2a-8d52-1a7a50ce79e8@lab.ntt.co.jp that implements something like this, but
implementation details might seem different. It doesn't however implement
a solution to the problem that you pose that partitioned child tables that
are prune-able are locked.
Thanks,
Amit
[1]: /messages/by-id/098b9c71-1915-1a2a-8d52-1a7a50ce79e8@lab.ntt.co.jp
/messages/by-id/098b9c71-1915-1a2a-8d52-1a7a50ce79e8@lab.ntt.co.jp
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 21, 2017 at 2:10 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
[ new patches ]
I am failing to understand the point of separating PartitionDispatch
into PartitionDispatch and PartitionTableInfo. That seems like an
unnecessary multiplication of entities, as does the introduction of
PartitionKeyInfo. I also think that replacing reldesc with reloid is
not really an improvement; any places that gets the relid has to go
open the relation to get the reldesc, whereas without that it has a
direct pointer to the information it needs.
I suggest that this patch just focus on removing the following things
from PartitionDispatchData: keystate, tupslot, tupmap. Those things
are clearly executor-specific stuff that makes sense to move to a
different structure, what you're calling PartitionTupleRoutingInfo
(not sure that's the best name). The other stuff all seems fine.
You're going to have to open the relation anyway, so keeping the
reldesc around seems like an optimization, if anything. The
PartitionKey and PartitionDesc pointers may not really be needed --
they're just pointers into reldesc -- but they're trivial to compute,
so it doesn't hurt anything to have them either as a
micro-optimization for performance or even just for readability.
That just leaves indexes. In a world where keystate, tupslot, and
tupmap are removed from the PartitionDispatchData, you must need
indexes or there would be no point in constructing a
PartitionDispatchData object in the first place; any application that
needs neither indexes nor the executor-specific stuff could just use
the Relation directly.
Regarding your XXX comments, note that if you've got a lock on a
relation, the pointers to the PartitionKey and PartitionDesc are
stable. The PartitionKey can't change once it's established, and the
PartitionDesc can't change while we've got a lock on the relation
unless we change it ourselves (and any places that do should have
CheckTableNotInUse checks). The keep_partkey and keep_partdesc
handling in relcache.c exists exactly so that we can guarantee that
the pointer won't go stale under us. Now, if we *don't* have a lock
on the relation, then those pointers can easily be invalidated -- so
you can't hang onto a PartitionDispatch for longer than you hang onto
the lock on the Relation. But that shouldn't be a problem. I think
you only need to hang onto PartitionDispatch pointers for the lifetime
of a single query. One can imagine optimizations where we try to
avoid rebuilding that for subsequent queries but I'm not sure there's
any demonstrated need for such a system at present.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/26 3:28, Robert Haas wrote:
On Mon, Aug 21, 2017 at 2:10 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:[ new patches ]
I am failing to understand the point of separating PartitionDispatch
into PartitionDispatch and PartitionTableInfo. That seems like an
unnecessary multiplication of entities, as does the introduction of
PartitionKeyInfo. I also think that replacing reldesc with reloid is
not really an improvement; any places that gets the relid has to go
open the relation to get the reldesc, whereas without that it has a
direct pointer to the information it needs.
I am worried about the open relcache reference in PartitionDispatch when
we start using it in the planner. Whereas there is a ExecEndModifyTable()
as a suitable place to close that reference, there doesn't seem to exist
one within the planner, but I guess we will have to figure something out.
For time being, the second patch closes the same in
expand_inherited_rtentry() right after picking up the OID using
RelationGetRelid(pd->reldesc).
I suggest that this patch just focus on removing the following things
from PartitionDispatchData: keystate, tupslot, tupmap. Those things
are clearly executor-specific stuff that makes sense to move to a
different structure, what you're calling PartitionTupleRoutingInfo
(not sure that's the best name). The other stuff all seems fine.
You're going to have to open the relation anyway, so keeping the
reldesc around seems like an optimization, if anything. The
PartitionKey and PartitionDesc pointers may not really be needed --
they're just pointers into reldesc -- but they're trivial to compute,
so it doesn't hurt anything to have them either as a
micro-optimization for performance or even just for readability.
OK, done this way in the attached updated patch. Any suggestions about a
better name for what the patch calls PartitionTupleRoutingInfo?
That just leaves indexes. In a world where keystate, tupslot, and
tupmap are removed from the PartitionDispatchData, you must need
indexes or there would be no point in constructing a
PartitionDispatchData object in the first place; any application that
needs neither indexes nor the executor-specific stuff could just use
the Relation directly.
Agreed.
Regarding your XXX comments, note that if you've got a lock on a
relation, the pointers to the PartitionKey and PartitionDesc are
stable. The PartitionKey can't change once it's established, and the
PartitionDesc can't change while we've got a lock on the relation
unless we change it ourselves (and any places that do should have
CheckTableNotInUse checks). The keep_partkey and keep_partdesc
handling in relcache.c exists exactly so that we can guarantee that
the pointer won't go stale under us. Now, if we *don't* have a lock
on the relation, then those pointers can easily be invalidated -- so
you can't hang onto a PartitionDispatch for longer than you hang onto
the lock on the Relation. But that shouldn't be a problem. I think
you only need to hang onto PartitionDispatch pointers for the lifetime
of a single query. One can imagine optimizations where we try to
avoid rebuilding that for subsequent queries but I'm not sure there's
any demonstrated need for such a system at present.
Here too.
Attached are the updated patches.
Thanks,
Amit
Attachments:
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload
From fb4bd4818c4faa08b3c4d37709f01dc55f256a46 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor
Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code. In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as executor tuple table
slots, tuple-conversion maps, etc. That makes it harder to use in
places other than where it's currently being used.
After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo().
---
src/backend/catalog/partition.c | 278 +++++++++++++++------------------
src/backend/commands/copy.c | 37 +++--
src/backend/executor/execMain.c | 124 +++++++++++++--
src/backend/executor/nodeModifyTable.c | 37 +++--
src/include/catalog/partition.h | 34 ++--
src/include/executor/executor.h | 4 +-
src/include/nodes/execnodes.h | 40 ++++-
7 files changed, 326 insertions(+), 228 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 96a64ce6b2..25fc4583de 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -981,181 +981,147 @@ get_partition_qual_relid(Oid relid)
}
/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
- do\
- {\
- int i;\
- for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
- {\
- (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
- (parents) = lappend((parents), (rel));\
- }\
- } while(0)
-
-/*
* RelationGetPartitionDispatchInfo
- * Returns information necessary to route tuples down a partition tree
+ * Returns necessary information for each partition in the partition
+ * tree rooted at rel
*
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
+ * A list of PartitionDispatch objects is returned, which contains one object
+ * for each partitioned table in the partition tree (with at least one member,
+ * that is, the one for the root partitioned table). Also, upon return,
+ * *leaf_part_oids will contain a list of the OIDs of all the leaf partitions
+ * in the partition tree.
*
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
+ * We require that the caller has locked at least the partitioned tables in
+ * the partition tree (including 'rel') using at least the AccessShareLock,
+ * because we need to look at their relcache entries to examine its
+ * PartitionDesc.
+ *
+ * It's the responsibility of the caller to close the relation descriptor
+ * reference contained in each PartitionDispatch object.
*/
-PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
- int *num_parted, List **leaf_part_oids)
+List *
+RelationGetPartitionDispatchInfo(Relation rel, List **leaf_part_oids)
{
- PartitionDispatchData **pd;
- List *all_parts = NIL,
- *all_parents = NIL,
- *parted_rels,
- *parted_rel_parents;
+ List *result = NIL,
+ *all_parts,
+ *all_parents;
ListCell *lc1,
*lc2;
int i,
- k,
offset;
/*
* We rely on the relcache to traverse the partition tree to build both
- * the leaf partition OIDs list and the array of PartitionDispatch objects
- * for the partitioned tables in the tree. That means every partitioned
- * table in the tree must be locked, which is fine since we require the
- * caller to lock all the partitions anyway.
+ * the leaf partition OIDs list and the list of PartitionDispatch objects
+ * for the partitioned tables. That means every partitioned table in the
+ * tree must be locked, which is fine since the callers must have done
+ * that already.
*
* For every partitioned table in the tree, starting with the root
* partitioned table, add its relcache entry to parted_rels, while also
* queuing its partitions (in the order in which they appear in the
* partition descriptor) to be looked at later in the same loop. This is
* a bit tricky but works because the foreach() macro doesn't fetch the
- * next list element until the bottom of the loop.
+ * next list element until the bottom of the loop. Non-partitioned tables
+ * are simply added to the leaf partitions list.
*/
- *num_parted = 1;
- parted_rels = list_make1(rel);
- /* Root partitioned table has no parent, so NULL for parent */
- parted_rel_parents = list_make1(NULL);
- APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
+ i = offset = 0;
+ *leaf_part_oids = NIL;
+
+ /* Start with the root table. */
+ all_parts = list_make1_oid(RelationGetRelid(rel));
+ all_parents = list_make1_oid(InvalidOid);
forboth(lc1, all_parts, lc2, all_parents)
{
- Oid partrelid = lfirst_oid(lc1);
- Relation parent = lfirst(lc2);
+ Oid partrelid = lfirst_oid(lc1);
+ Oid parentrelid = lfirst_oid(lc2);
if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
{
- /*
- * Already locked by the caller. Note that it is the
- * responsibility of the caller to close the below relcache entry,
- * once done using the information being collected here (for
- * example, in ExecEndModifyTable).
- */
- Relation partrel = heap_open(partrelid, NoLock);
+ int j,
+ k;
+ Relation partrel;
+ PartitionDesc partdesc;
+ PartitionDispatch pd;
+
+ if (partrelid != RelationGetRelid(rel))
+ partrel = heap_open(partrelid, NoLock);
+ else
+ partrel = rel;
- (*num_parted)++;
- parted_rels = lappend(parted_rels, partrel);
- parted_rel_parents = lappend(parted_rel_parents, parent);
- APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
- }
- }
+ partdesc = RelationGetPartitionDesc(partrel);
+
+ pd = (PartitionDispatchData *)
+ palloc0(sizeof(PartitionDispatchData));
+ pd->reldesc = partrel;
+ pd->parentoid = parentrelid;
- /*
- * We want to create two arrays - one for leaf partitions and another for
- * partitioned tables (including the root table and internal partitions).
- * While we only create the latter here, leaf partition array of suitable
- * objects (such as, ResultRelInfo) is created by the caller using the
- * list of OIDs we return. Indexes into these arrays get assigned in a
- * breadth-first manner, whereby partitions of any given level are placed
- * consecutively in the respective arrays.
- */
- pd = (PartitionDispatchData **) palloc(*num_parted *
- sizeof(PartitionDispatchData *));
- *leaf_part_oids = NIL;
- i = k = offset = 0;
- forboth(lc1, parted_rels, lc2, parted_rel_parents)
- {
- Relation partrel = lfirst(lc1);
- Relation parent = lfirst(lc2);
- PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- int j,
- m;
-
- pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
- pd[i]->reldesc = partrel;
- pd[i]->key = partkey;
- pd[i]->keystate = NIL;
- pd[i]->partdesc = partdesc;
- if (parent != NULL)
- {
/*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
+ * The values contained in the following array correspond to
+ * indexes of this table's partitions in the global sequence of
+ * all the partitions contained in the partition tree rooted at
+ * rel, traversed in a breadh-first manner. The values should be
+ * such that we will be able to distinguish the leaf partitions
+ * from the non-leaf partitions, because they are returned to
+ * to the caller in separate structures from where they will be
+ * accessed. The way that's done is described below:
+ *
+ * Leaf partition OIDs are put into the global leaf_part_oids list,
+ * and for each one, the value stored is its ordinal position in
+ * the list minus 1.
+ *
+ * PartitionDispatch objects corresponding to partitions that
+ * are partitioned tables are put into the global result list,
+ * and for each one, the value stored is its ordinal position in
+ * the list multiplied by -1.
+ *
+ * So, when examining the values in the indexes array, getting a
+ * value >= 0 means the corresponding partition is a leaf
+ * partition. Otherwise, it's a partitioned table.
*/
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
- }
- else
- {
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
- pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-
- /*
- * Indexes corresponding to the internal partitions are multiplied by
- * -1 to distinguish them from those of leaf partitions. Encountering
- * an index >= 0 means we found a leaf partition, which is immediately
- * returned as the partition we are looking for. A negative index
- * means we found a partitioned table, whose PartitionDispatch object
- * is located at the above index multiplied back by -1. Using the
- * PartitionDispatch object, search is continued further down the
- * partition tree.
- */
- m = 0;
- for (j = 0; j < partdesc->nparts; j++)
- {
- Oid partrelid = partdesc->oids[j];
+ pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
- if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
- {
- *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
- pd[i]->indexes[j] = k++;
- }
- else
+ k = 0;
+ for (j = 0; j < partdesc->nparts; j++)
{
+ Oid partrelid = partdesc->oids[j];
+
/*
- * offset denotes the number of partitioned tables of upper
- * levels including those of the current level. Any partition
- * of this table must belong to the next level and hence will
- * be placed after the last partitioned table of this level.
+ * Queue this partition so that it will be processed later
+ * by the outer loop.
*/
- pd[i]->indexes[j] = -(1 + offset + m);
- m++;
+ all_parts = lappend_oid(all_parts, partrelid);
+ all_parents = lappend_oid(all_parents,
+ RelationGetRelid(partrel));
+
+ if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
+ {
+ *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+ pd->indexes[j] = i++;
+ }
+ else
+ {
+ /*
+ * offset denotes the number of partitioned tables that
+ * we have already processed. k counts the number of
+ * partitions of this table that were found to be
+ * partitioned tables.
+ */
+ pd->indexes[j] = -(1 + offset + k);
+ k++;
+ }
}
- }
- i++;
- /*
- * This counts the number of partitioned tables at upper levels
- * including those of the current level.
- */
- offset += m;
+ offset += k;
+
+ result = lappend(result, pd);
+ }
}
- return pd;
+ Assert(i == list_length(*leaf_part_oids));
+ Assert((offset + 1) == list_length(result));
+
+ return result;
}
/* Module-local functions */
@@ -1860,7 +1826,7 @@ generate_partition_qual(Relation rel)
* Construct values[] and isnull[] arrays for the partition key
* of a tuple.
*
- * pd Partition dispatch object of the partitioned table
+ * ptrinfo PartitionTupleRoutingInfo object of the table
* slot Heap tuple from which to extract partition key
* estate executor state for evaluating any partition key
* expressions (must be non-NULL)
@@ -1872,29 +1838,30 @@ generate_partition_qual(Relation rel)
* ----------------
*/
void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionTupleRoutingInfo *ptrinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull)
{
+ PartitionKey key = RelationGetPartitionKey(ptrinfo->pd->reldesc);
ListCell *partexpr_item;
int i;
- if (pd->key->partexprs != NIL && pd->keystate == NIL)
+ if (key->partexprs != NIL && ptrinfo->keystate == NIL)
{
/* Check caller has set up context correctly */
Assert(estate != NULL &&
GetPerTupleExprContext(estate)->ecxt_scantuple == slot);
/* First time through, set up expression evaluation state */
- pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+ ptrinfo->keystate = ExecPrepareExprList(key->partexprs, estate);
}
- partexpr_item = list_head(pd->keystate);
- for (i = 0; i < pd->key->partnatts; i++)
+ partexpr_item = list_head(ptrinfo->keystate);
+ for (i = 0; i < key->partnatts; i++)
{
- AttrNumber keycol = pd->key->partattrs[i];
+ AttrNumber keycol = key->partattrs[i];
Datum datum;
bool isNull;
@@ -1931,13 +1898,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
* the latter case.
*/
int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot)
{
- PartitionDispatch parent;
+ PartitionTupleRoutingInfo *parent;
Datum values[PARTITION_MAX_KEYS];
bool isnull[PARTITION_MAX_KEYS];
int cur_offset,
@@ -1948,11 +1915,12 @@ get_partition_for_tuple(PartitionDispatch *pd,
TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
/* start with the root partitioned table */
- parent = pd[0];
+ parent = ptrinfos[0];
while (true)
{
- PartitionKey key = parent->key;
- PartitionDesc partdesc = parent->partdesc;
+ PartitionDispatch pd = parent->pd;
+ PartitionKey key = RelationGetPartitionKey(pd->reldesc);
+ PartitionDesc partdesc = RelationGetPartitionDesc(pd->reldesc);
TupleTableSlot *myslot = parent->tupslot;
TupleConversionMap *map = parent->tupmap;
@@ -2055,13 +2023,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
*failed_slot = slot;
break;
}
- else if (parent->indexes[cur_index] >= 0)
+ else if (pd->indexes[cur_index] >= 0)
{
- result = parent->indexes[cur_index];
+ result = pd->indexes[cur_index];
break;
}
else
- parent = pd[-parent->indexes[cur_index]];
+ parent = ptrinfos[-pd->indexes[cur_index]];
}
error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f059c2..b0c596345b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
bool volatile_defexprs; /* is any of defexprs volatile? */
List *range_table;
- PartitionDispatch *partition_dispatch_info;
- int num_dispatch; /* Number of entries in the above array */
+ PartitionTupleRoutingInfo **ptrinfos;
+ int num_parted; /* Number of entries in the above array */
int num_partitions; /* Number of members in the following arrays */
ResultRelInfo *partitions; /* Per partition result relation */
TupleConversionMap **partition_tupconv_maps;
@@ -2445,7 +2445,7 @@ CopyFrom(CopyState cstate)
*/
if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -2455,13 +2455,13 @@ CopyFrom(CopyState cstate)
ExecSetupPartitionTupleRouting(cstate->rel,
1,
estate,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- cstate->partition_dispatch_info = partition_dispatch_info;
- cstate->num_dispatch = num_parted;
+ cstate->ptrinfos = ptrinfos;
+ cstate->num_parted = num_parted;
cstate->partitions = partitions;
cstate->num_partitions = num_partitions;
cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2502,7 +2502,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
- cstate->partition_dispatch_info != NULL ||
+ cstate->ptrinfos != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2580,7 +2580,7 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
/* Determine the partition to heap_insert the tuple into */
- if (cstate->partition_dispatch_info)
+ if (cstate->ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -2594,7 +2594,7 @@ CopyFrom(CopyState cstate)
* partition, respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- cstate->partition_dispatch_info,
+ cstate->ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -2826,24 +2826,23 @@ CopyFrom(CopyState cstate)
ExecCloseIndices(resultRelInfo);
- /* Close all the partitioned tables, leaf partitions, and their indices */
- if (cstate->partition_dispatch_info)
+ /* Release some resources that we acquired for tuple-routing. */
+ if (cstate->ptrinfos)
{
int i;
/*
- * Remember cstate->partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is
- * the main target table of COPY that will be closed eventually by
- * DoCopy(). Also, tupslot is NULL for the root partitioned table.
+ * cstate->ptrinfos[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot.
*/
- for (i = 1; i < cstate->num_dispatch; i++)
+ for (i = 1; i < cstate->num_parted; i++)
{
- PartitionDispatch pd = cstate->partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
+
+ /* Close all the leaf partitions and their indices */
for (i = 0; i < cstate->num_partitions; i++)
{
ResultRelInfo *resultRelInfo = cstate->partitions + i;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 2946a0edee..493ade0e78 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3236,8 +3236,8 @@ EvalPlanQualEnd(EPQState *epqstate)
* tuple routing for partitioned tables
*
* Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- * every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ * entry for each partitioned table in the partition tree
* 'partitions' receives an array of ResultRelInfo objects with one entry for
* every leaf partition in the partition tree
* 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3260,7 +3260,7 @@ void
ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
EState *estate,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
@@ -3268,16 +3268,116 @@ ExecSetupPartitionTupleRouting(Relation rel,
{
TupleDesc tupDesc = RelationGetDescr(rel);
List *leaf_parts;
+ List *pdlist;
ListCell *cell;
int i;
ResultRelInfo *leaf_part_rri;
+ Relation parent;
/*
* Get the information about the partition tree after locking all the
* partitions.
*/
(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
- *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+ pdlist = RelationGetPartitionDispatchInfo(rel, &leaf_parts);
+
+ /*
+ * The pdlist list contains PartitionDispatch objects for all the
+ * partitioned tables in the partition tree. Using the information
+ * therein, we construct an array of PartitionTupleRoutingInfo objects
+ * to be used during tuple-routing.
+ */
+ *num_parted = list_length(pdlist);
+ *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+ sizeof(PartitionTupleRoutingInfo *));
+ /*
+ * Free the ptinfos List structure itself as we go through (open-coded
+ * list_free).
+ */
+ i = 0;
+ cell = list_head(pdlist);
+ parent = NULL;
+ while (cell)
+ {
+ ListCell *tmp = cell;
+ PartitionDispatch pd = lfirst(tmp),
+ next_pd = NULL;
+ Relation partrel;
+ PartitionTupleRoutingInfo *ptrinfo;
+
+ if (lnext(tmp))
+ next_pd = lfirst(lnext(tmp));
+
+ partrel = pd->reldesc;
+
+ ptrinfo = (PartitionTupleRoutingInfo *)
+ palloc0(sizeof(PartitionTupleRoutingInfo));
+ /* Stash a reference to this PartitionDispatch. */
+ ptrinfo->pd = pd;
+
+ /* State for extracting partition key from tuples will go here. */
+ ptrinfo->keystate = NIL;
+
+ /*
+ * For every partitioned table other than root, we must store a tuple
+ * table slot initialized with its tuple descriptor and a tuple
+ * conversion map to convert a tuple from its parent's rowtype to its
+ * own. That is to make sure that we are looking at the correct row
+ * using the correct tuple descriptor when computing its partition key
+ * for tuple routing.
+ */
+ if (pd->parentoid != InvalidOid)
+ {
+ TupleDesc tupdesc = RelationGetDescr(partrel);
+
+ /* Open the parent relation descriptor if not already done. */
+ if (pd->parentoid == RelationGetRelid(rel))
+ {
+ parent = rel;
+ }
+ else if (parent == NULL)
+ {
+ /* Locked by RelationGetPartitionDispatchInfo(). */
+ parent = heap_open(pd->parentoid, NoLock);
+ }
+
+ ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+
+ /*
+ * Close the parent descriptor, if the next partitioned table in
+ * the list is not a sibling, because it will have a different
+ * parent if so.
+ */
+ if (parent != NULL && parent != rel &&
+ next_pd != NULL &&
+ next_pd->parentoid != pd->parentoid)
+ {
+ heap_close(parent, NoLock);
+ parent = NULL;
+ }
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ ptrinfo->tupslot = NULL;
+ ptrinfo->tupmap = NULL;
+ }
+
+ (*ptrinfos)[i++] = ptrinfo;
+
+ /* Free the ListCell. */
+ cell = lnext(cell);
+ pfree(tmp);
+ }
+
+ /* Free the List itself. */
+ if (pdlist)
+ pfree(pdlist);
+
+ /* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
@@ -3304,7 +3404,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* Note that each of the relations in *partitions are eventually
* closed by the caller.
*/
- partrel = heap_open(lfirst_oid(cell), NoLock);
+ partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
part_tupdesc = RelationGetDescr(partrel);
/*
@@ -3317,7 +3417,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
* partition from the parent's type to the partition's.
*/
(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
- gettext_noop("could not convert row type"));
+ gettext_noop("could not convert row type"));
InitResultRelInfo(leaf_part_rri,
partrel,
@@ -3354,11 +3454,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
* by get_partition_for_tuple() unchanged.
*/
int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
- TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+ PartitionTupleRoutingInfo **ptrinfos,
+ TupleTableSlot *slot,
+ EState *estate)
{
int result;
- PartitionDispatchData *failed_at;
+ PartitionTupleRoutingInfo *failed_at;
TupleTableSlot *failed_slot;
/*
@@ -3368,7 +3470,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
if (resultRelInfo->ri_PartitionCheck)
ExecPartitionCheck(resultRelInfo, slot, estate);
- result = get_partition_for_tuple(pd, slot, estate,
+ result = get_partition_for_tuple(ptrinfos, slot, estate,
&failed_at, &failed_slot);
if (result < 0)
{
@@ -3378,7 +3480,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
char *val_desc;
ExprContext *ecxt = GetPerTupleExprContext(estate);
- failed_rel = failed_at->reldesc;
+ failed_rel = failed_at->pd->reldesc;
ecxt->ecxt_scantuple = failed_slot;
FormPartitionKeyDatum(failed_at, failed_slot, estate,
key_values, key_isnull);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e12721a9b6..753ee13985 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -278,7 +278,7 @@ ExecInsert(ModifyTableState *mtstate,
resultRelInfo = estate->es_result_relation_info;
/* Determine the partition to heap_insert the tuple into */
- if (mtstate->mt_partition_dispatch_info)
+ if (mtstate->mt_ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -292,7 +292,7 @@ ExecInsert(ModifyTableState *mtstate,
* respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- mtstate->mt_partition_dispatch_info,
+ mtstate->mt_ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -1487,7 +1487,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
int numResultRelInfos;
/* Find the set of partitions so that we can find their TupleDescs. */
- if (mtstate->mt_partition_dispatch_info != NULL)
+ if (mtstate->mt_ptrinfos != NULL)
{
/*
* For INSERT via partitioned table, so we need TupleDescs based
@@ -1911,7 +1911,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
if (operation == CMD_INSERT &&
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1921,13 +1921,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
ExecSetupPartitionTupleRouting(rel,
node->nominalRelation,
estate,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- mtstate->mt_partition_dispatch_info = partition_dispatch_info;
- mtstate->mt_num_dispatch = num_parted;
+ mtstate->mt_ptrinfos = ptrinfos;
+ mtstate->mt_num_parted = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2336,21 +2336,24 @@ ExecEndModifyTable(ModifyTableState *node)
resultRelInfo);
}
+ /* Release some resources that we acquired for tuple-routing. */
+
/*
- * Close all the partitioned tables, leaf partitions, and their indices
- *
- * Remember node->mt_partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is the
- * main target table of the query that will be closed by ExecEndPlan().
- * Also, tupslot is NULL for the root partitioned table.
+ * node->mt_ptrinfos[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot. Also, its relation descriptor will
+ * be closed in ExecEndPlan().
*/
- for (i = 1; i < node->mt_num_dispatch; i++)
+ for (i = 1; i < node->mt_num_parted; i++)
{
- PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ heap_close(ptrinfo->pd->reldesc, NoLock);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
+
+ /*
+ * Close all the leaf partitions and their indices.
+ */
for (i = 0; i < node->mt_num_partitions; i++)
{
ResultRelInfo *resultRelInfo = node->mt_partitions + i;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2283c675e9..73cfb4e937 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -40,31 +40,20 @@ typedef struct PartitionDescData
typedef struct PartitionDescData *PartitionDesc;
/*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
+ * PartitionDispatchData - information of partitions of one partitioned table
+ * in a partition tree
*
* reldesc Relation descriptor of the table
- * key Partition key information of the table
- * keystate Execution state required for expressions in the partition key
- * partdesc Partition descriptor of the table
- * tupslot A standalone TupleTableSlot initialized with this table's tuple
- * descriptor
- * tupmap TupleConversionMap to convert from the parent's rowtype to
- * this table's rowtype (when extracting the partition key of a
- * tuple just before routing it through this table)
- * indexes Array with partdesc->nparts members (for details on what
- * individual members represent, see how they are set in
+ * parentoid OID of the parent table (InvalidOid if root partitioned table)
+ * indexes Array with reldesc->rd_partdesc->nparts members (for details on
+ * what the individual value represents, see the comments in
* RelationGetPartitionDispatchInfo())
*-----------------------
*/
typedef struct PartitionDispatchData
{
Relation reldesc;
- PartitionKey key;
- List *keystate; /* list of ExprState */
- PartitionDesc partdesc;
- TupleTableSlot *tupslot;
- TupleConversionMap *tupmap;
+ Oid parentoid;
int *indexes;
} PartitionDispatchData;
@@ -86,17 +75,18 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
+extern List *RelationGetPartitionDispatchInfo(Relation rel,
+ List **leaf_part_oids);
+
/* For tuple routing */
-extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
- int *num_parted, List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+extern void FormPartitionKeyDatum(PartitionTupleRoutingInfo *ptrinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot);
#endif /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index eacbea3c36..44b7cd0fd6 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -209,13 +209,13 @@ extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
extern void ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
EState *estate,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
int *num_parted, int *num_partitions);
extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
- PartitionDispatch *pd,
+ PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3272c4b315..ab28169a96 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,42 @@ typedef struct ResultRelInfo
Relation ri_PartitionRoot;
} ResultRelInfo;
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ * through one partitioned table in a partition
+ * tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+
+ /* Information about the table's partitions */
+ PartitionDispatch pd;
+
+ /*
+ * The execution state required for expressions contained in the partition
+ * key. It is NIL until initialized by FormPartitionKeyDatum() if and when
+ * it is called; for example, the first time a tuple is routed through this
+ * table.
+ */
+ List *keystate;
+
+ /*
+ * A standalone TupleTableSlot initialized with this table's tuple
+ * descriptor
+ */
+ TupleTableSlot *tupslot;
+
+ /*
+ * TupleConversionMap to convert from the parent's rowtype to this table's
+ * rowtype (when extracting the partition key of a tuple just before
+ * routing it through this table)
+ */
+ TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
/* ----------------
* EState information
*
@@ -973,9 +1009,9 @@ typedef struct ModifyTableState
TupleTableSlot *mt_existing; /* slot to store existing target tuple in */
List *mt_excludedtlist; /* the excluded pseudo relation's tlist */
TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */
- struct PartitionDispatchData **mt_partition_dispatch_info;
/* Tuple-routing support info */
- int mt_num_dispatch; /* Number of entries in the above array */
+ struct PartitionTupleRoutingInfo **mt_ptrinfos;
+ int mt_num_parted; /* Number of entries in the above array */
int mt_num_partitions; /* Number of members in the following
* arrays */
ResultRelInfo *mt_partitions; /* Per partition result relation */
--
2.11.0
0002-Teach-expand_inherited_rtentry-to-use-partition-boun.patchtext/plain; charset=UTF-8; name=0002-Teach-expand_inherited_rtentry-to-use-partition-boun.patchDownload
From b23ce2d7acbb89636c61b5e10c93211b052ef61a Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 9 Aug 2017 15:52:36 +0900
Subject: [PATCH 2/2] Teach expand_inherited_rtentry to use partition bound
order
After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.
---
src/backend/optimizer/prep/prepunion.c | 35 ++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e73c819901..1ae1a851d4 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
#include "access/heapam.h"
#include "access/htup_details.h"
#include "access/sysattr.h"
+#include "catalog/partition.h"
#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -1452,6 +1453,40 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
*/
oldrelation = heap_open(parentOID, NoLock);
+ /*
+ * For partitioned tables, we arrange the child table OIDs such that they
+ * appear in the partition bound order.
+ */
+ if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+ {
+ List *leaf_part_oids,
+ *pdlist;
+
+ /* Discard the original list. */
+ list_free(inhOIDs);
+ inhOIDs = NIL;
+
+ /* Request partitioning information. */
+ pdlist = RelationGetPartitionDispatchInfo(oldrelation,
+ &leaf_part_oids);
+
+ /*
+ * First collect the partitioned child table OIDs, which includes the
+ * root parent at the head.
+ */
+ foreach(l, pdlist)
+ {
+ PartitionDispatch pd = lfirst(l);
+
+ inhOIDs = lappend_oid(inhOIDs, RelationGetRelid(pd->reldesc));
+ if (pd->reldesc != oldrelation)
+ heap_close(pd->reldesc, NoLock);
+ }
+
+ /* Concatenate the leaf partition OIDs. */
+ inhOIDs = list_concat(inhOIDs, leaf_part_oids);
+ }
+
/* Scan the inheritance set and expand it */
appinfos = NIL;
has_child = false;
--
2.11.0
On Mon, Aug 28, 2017 at 6:38 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
I am worried about the open relcache reference in PartitionDispatch when
we start using it in the planner. Whereas there is a ExecEndModifyTable()
as a suitable place to close that reference, there doesn't seem to exist
one within the planner, but I guess we will have to figure something out.
Yes, I think there's no real way to avoid having to figure that out.
OK, done this way in the attached updated patch. Any suggestions about a
better name for what the patch calls PartitionTupleRoutingInfo?
I think this patch could be further simplified by continuing to use
the existing function signature for RelationGetPartitionDispatchInfo
instead of changing it to return a List * rather than an array. I
don't see any benefit to such a change. The current system is more
efficient.
I keep having the feeling that this is a big patch with a small patch
struggling to get out. Is it really necessary to change
RelationGetPartitionDispatchInfo so much or could you just do a really
minimal surgery to remove the code that sets the stuff we don't need?
Like this:
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 96a64ce6b2..4fabcf9f32 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1089,29 +1089,7 @@ RelationGetPartitionDispatchInfo(Relation rel,
pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
pd[i]->reldesc = partrel;
pd[i]->key = partkey;
- pd[i]->keystate = NIL;
pd[i]->partdesc = partdesc;
- if (parent != NULL)
- {
- /*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
- */
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
-
gettext_noop("could not convert row type"));
- }
- else
- {
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
/*
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/29 4:26, Robert Haas wrote:
I think this patch could be further simplified by continuing to use
the existing function signature for RelationGetPartitionDispatchInfo
instead of changing it to return a List * rather than an array. I
don't see any benefit to such a change. The current system is more
efficient.
OK, restored the array way.
I keep having the feeling that this is a big patch with a small patch
struggling to get out. Is it really necessary to change
RelationGetPartitionDispatchInfo so much or could you just do a really
minimal surgery to remove the code that sets the stuff we don't need?
Like this:
Sure, done in the attached updated patch.
Thanks,
Amit
Attachments:
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload
From 9dd8e6f6bd3636f8c125c71e6d1c65bf606a2a22 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 30 Aug 2017 10:02:05 +0900
Subject: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor
Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code. In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as executor tuple table
slots, tuple-conversion maps, etc. That makes it harder to use in
places other than where it's currently being used.
After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo().
---
src/backend/catalog/partition.c | 53 +++++++-------------
src/backend/commands/copy.c | 32 +++++++------
src/backend/executor/execMain.c | 88 ++++++++++++++++++++++++++++++----
src/backend/executor/nodeModifyTable.c | 37 +++++++-------
src/include/catalog/partition.h | 20 +++-----
src/include/executor/executor.h | 4 +-
src/include/nodes/execnodes.h | 40 +++++++++++++++-
7 files changed, 181 insertions(+), 93 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 96a64ce6b2..c92756ecd5 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1081,7 +1081,6 @@ RelationGetPartitionDispatchInfo(Relation rel,
Relation partrel = lfirst(lc1);
Relation parent = lfirst(lc2);
PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
int j,
m;
@@ -1089,29 +1088,12 @@ RelationGetPartitionDispatchInfo(Relation rel,
pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
pd[i]->reldesc = partrel;
pd[i]->key = partkey;
- pd[i]->keystate = NIL;
pd[i]->partdesc = partdesc;
if (parent != NULL)
- {
- /*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
- */
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
- }
+ pd[i]->parentoid = RelationGetRelid(parent);
else
- {
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
+ pd[i]->parentoid = InvalidOid;
+
pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
/*
@@ -1860,7 +1842,7 @@ generate_partition_qual(Relation rel)
* Construct values[] and isnull[] arrays for the partition key
* of a tuple.
*
- * pd Partition dispatch object of the partitioned table
+ * ptrinfo PartitionTupleRoutingInfo object of the table
* slot Heap tuple from which to extract partition key
* estate executor state for evaluating any partition key
* expressions (must be non-NULL)
@@ -1872,26 +1854,27 @@ generate_partition_qual(Relation rel)
* ----------------
*/
void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionTupleRoutingInfo *ptrinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull)
{
+ PartitionDispatch pd = ptrinfo->pd;
ListCell *partexpr_item;
int i;
- if (pd->key->partexprs != NIL && pd->keystate == NIL)
+ if (pd->key->partexprs != NIL && ptrinfo->keystate == NIL)
{
/* Check caller has set up context correctly */
Assert(estate != NULL &&
GetPerTupleExprContext(estate)->ecxt_scantuple == slot);
/* First time through, set up expression evaluation state */
- pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+ ptrinfo->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
}
- partexpr_item = list_head(pd->keystate);
+ partexpr_item = list_head(ptrinfo->keystate);
for (i = 0; i < pd->key->partnatts; i++)
{
AttrNumber keycol = pd->key->partattrs[i];
@@ -1931,13 +1914,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
* the latter case.
*/
int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot)
{
- PartitionDispatch parent;
+ PartitionTupleRoutingInfo *parent;
Datum values[PARTITION_MAX_KEYS];
bool isnull[PARTITION_MAX_KEYS];
int cur_offset,
@@ -1948,11 +1931,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
/* start with the root partitioned table */
- parent = pd[0];
+ parent = ptrinfos[0];
while (true)
{
- PartitionKey key = parent->key;
- PartitionDesc partdesc = parent->partdesc;
+ PartitionKey key = parent->pd->key;
+ PartitionDesc partdesc = parent->pd->partdesc;
TupleTableSlot *myslot = parent->tupslot;
TupleConversionMap *map = parent->tupmap;
@@ -2055,13 +2038,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
*failed_slot = slot;
break;
}
- else if (parent->indexes[cur_index] >= 0)
+ else if (parent->pd->indexes[cur_index] >= 0)
{
- result = parent->indexes[cur_index];
+ result = parent->pd->indexes[cur_index];
break;
}
else
- parent = pd[-parent->indexes[cur_index]];
+ parent = ptrinfos[-parent->pd->indexes[cur_index]];
}
error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f059c2..288d6a1ab2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
bool volatile_defexprs; /* is any of defexprs volatile? */
List *range_table;
- PartitionDispatch *partition_dispatch_info;
- int num_dispatch; /* Number of entries in the above array */
+ PartitionTupleRoutingInfo **ptrinfos;
+ int num_parted; /* Number of entries in the above array */
int num_partitions; /* Number of members in the following arrays */
ResultRelInfo *partitions; /* Per partition result relation */
TupleConversionMap **partition_tupconv_maps;
@@ -2445,7 +2445,7 @@ CopyFrom(CopyState cstate)
*/
if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -2455,13 +2455,13 @@ CopyFrom(CopyState cstate)
ExecSetupPartitionTupleRouting(cstate->rel,
1,
estate,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- cstate->partition_dispatch_info = partition_dispatch_info;
- cstate->num_dispatch = num_parted;
+ cstate->ptrinfos = ptrinfos;
+ cstate->num_parted = num_parted;
cstate->partitions = partitions;
cstate->num_partitions = num_partitions;
cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2502,7 +2502,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
- cstate->partition_dispatch_info != NULL ||
+ cstate->ptrinfos != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2580,7 +2580,7 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
/* Determine the partition to heap_insert the tuple into */
- if (cstate->partition_dispatch_info)
+ if (cstate->ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -2594,7 +2594,7 @@ CopyFrom(CopyState cstate)
* partition, respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- cstate->partition_dispatch_info,
+ cstate->ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -2826,8 +2826,8 @@ CopyFrom(CopyState cstate)
ExecCloseIndices(resultRelInfo);
- /* Close all the partitioned tables, leaf partitions, and their indices */
- if (cstate->partition_dispatch_info)
+ /* Release some resources that we acquired for tuple-routing. */
+ if (cstate->ptrinfos)
{
int i;
@@ -2837,13 +2837,15 @@ CopyFrom(CopyState cstate)
* the main target table of COPY that will be closed eventually by
* DoCopy(). Also, tupslot is NULL for the root partitioned table.
*/
- for (i = 1; i < cstate->num_dispatch; i++)
+ for (i = 1; i < cstate->num_parted; i++)
{
- PartitionDispatch pd = cstate->partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ heap_close(ptrinfo->pd->reldesc, NoLock);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
+
+ /* Close all the leaf partitions and their indices */
for (i = 0; i < cstate->num_partitions; i++)
{
ResultRelInfo *resultRelInfo = cstate->partitions + i;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 2946a0edee..23ed2c55b9 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3236,8 +3236,8 @@ EvalPlanQualEnd(EPQState *epqstate)
* tuple routing for partitioned tables
*
* Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- * every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ * entry for each partitioned table in the partition tree
* 'partitions' receives an array of ResultRelInfo objects with one entry for
* every leaf partition in the partition tree
* 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3260,7 +3260,7 @@ void
ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
EState *estate,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
@@ -3268,16 +3268,84 @@ ExecSetupPartitionTupleRouting(Relation rel,
{
TupleDesc tupDesc = RelationGetDescr(rel);
List *leaf_parts;
+ PartitionDispatch *pds;
ListCell *cell;
int i;
ResultRelInfo *leaf_part_rri;
+ Relation parent;
/*
* Get the information about the partition tree after locking all the
* partitions.
*/
(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
- *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+ pds = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+
+ /*
+ * Construct PartitionTupleRoutingInfo objects, one for each partitioned
+ * table in the tree, using its PartitionDispatch in the pds array.
+ */
+ *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+ sizeof(PartitionTupleRoutingInfo *));
+ parent = NULL;
+ for (i = 0; i < *num_parted; i++)
+ {
+ PartitionTupleRoutingInfo *ptrinfo;
+
+ ptrinfo = (PartitionTupleRoutingInfo *)
+ palloc0(sizeof(PartitionTupleRoutingInfo));
+ /* Stash a reference to this PartitionDispatch. */
+ ptrinfo->pd = pds[i];
+
+ /* State for extracting partition key from tuples will go here. */
+ ptrinfo->keystate = NIL;
+
+ /*
+ * For every partitioned table other than root, we must store a tuple
+ * table slot initialized with its tuple descriptor and a tuple
+ * conversion map to convert a tuple from its parent's rowtype to its
+ * own. That is to make sure that we are looking at the correct row
+ * using the correct tuple descriptor when computing its partition key
+ * for tuple routing.
+ */
+ if (pds[i]->parentoid != InvalidOid)
+ {
+ TupleDesc tupdesc = RelationGetDescr(pds[i]->reldesc);
+
+ /* Open the parent relation descriptor if not already done. */
+ if (pds[i]->parentoid == RelationGetRelid(rel))
+ parent = rel;
+ else if (parent == NULL)
+ /* Locked by RelationGetPartitionDispatchInfo(). */
+ parent = heap_open(pds[i]->parentoid, NoLock);
+
+ ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+ /*
+ * Close the parent descriptor, if the next partitioned table in
+ * the list is not a sibling, because it will have a different
+ * parent if so.
+ */
+ if (parent != NULL && parent != rel && i + 1 < *num_parted &&
+ pds[i + 1]->parentoid != pds[i]->parentoid)
+ {
+ heap_close(parent, NoLock);
+ parent = NULL;
+ }
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ ptrinfo->tupslot = NULL;
+ ptrinfo->tupmap = NULL;
+ }
+
+ (*ptrinfos)[i] = ptrinfo;
+ }
+
+ /* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
@@ -3354,11 +3422,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
* by get_partition_for_tuple() unchanged.
*/
int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
- TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+ PartitionTupleRoutingInfo **ptrinfos,
+ TupleTableSlot *slot,
+ EState *estate)
{
int result;
- PartitionDispatchData *failed_at;
+ PartitionTupleRoutingInfo *failed_at;
TupleTableSlot *failed_slot;
/*
@@ -3368,7 +3438,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
if (resultRelInfo->ri_PartitionCheck)
ExecPartitionCheck(resultRelInfo, slot, estate);
- result = get_partition_for_tuple(pd, slot, estate,
+ result = get_partition_for_tuple(ptrinfos, slot, estate,
&failed_at, &failed_slot);
if (result < 0)
{
@@ -3378,7 +3448,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
char *val_desc;
ExprContext *ecxt = GetPerTupleExprContext(estate);
- failed_rel = failed_at->reldesc;
+ failed_rel = failed_at->pd->reldesc;
ecxt->ecxt_scantuple = failed_slot;
FormPartitionKeyDatum(failed_at, failed_slot, estate,
key_values, key_isnull);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e12721a9b6..753ee13985 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -278,7 +278,7 @@ ExecInsert(ModifyTableState *mtstate,
resultRelInfo = estate->es_result_relation_info;
/* Determine the partition to heap_insert the tuple into */
- if (mtstate->mt_partition_dispatch_info)
+ if (mtstate->mt_ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -292,7 +292,7 @@ ExecInsert(ModifyTableState *mtstate,
* respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- mtstate->mt_partition_dispatch_info,
+ mtstate->mt_ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -1487,7 +1487,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
int numResultRelInfos;
/* Find the set of partitions so that we can find their TupleDescs. */
- if (mtstate->mt_partition_dispatch_info != NULL)
+ if (mtstate->mt_ptrinfos != NULL)
{
/*
* For INSERT via partitioned table, so we need TupleDescs based
@@ -1911,7 +1911,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
if (operation == CMD_INSERT &&
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1921,13 +1921,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
ExecSetupPartitionTupleRouting(rel,
node->nominalRelation,
estate,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- mtstate->mt_partition_dispatch_info = partition_dispatch_info;
- mtstate->mt_num_dispatch = num_parted;
+ mtstate->mt_ptrinfos = ptrinfos;
+ mtstate->mt_num_parted = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2336,21 +2336,24 @@ ExecEndModifyTable(ModifyTableState *node)
resultRelInfo);
}
+ /* Release some resources that we acquired for tuple-routing. */
+
/*
- * Close all the partitioned tables, leaf partitions, and their indices
- *
- * Remember node->mt_partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is the
- * main target table of the query that will be closed by ExecEndPlan().
- * Also, tupslot is NULL for the root partitioned table.
+ * node->mt_ptrinfos[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot. Also, its relation descriptor will
+ * be closed in ExecEndPlan().
*/
- for (i = 1; i < node->mt_num_dispatch; i++)
+ for (i = 1; i < node->mt_num_parted; i++)
{
- PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ heap_close(ptrinfo->pd->reldesc, NoLock);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
+
+ /*
+ * Close all the leaf partitions and their indices.
+ */
for (i = 0; i < node->mt_num_partitions; i++)
{
ResultRelInfo *resultRelInfo = node->mt_partitions + i;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2283c675e9..1091dd572c 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -45,13 +45,8 @@ typedef struct PartitionDescData *PartitionDesc;
*
* reldesc Relation descriptor of the table
* key Partition key information of the table
- * keystate Execution state required for expressions in the partition key
* partdesc Partition descriptor of the table
- * tupslot A standalone TupleTableSlot initialized with this table's tuple
- * descriptor
- * tupmap TupleConversionMap to convert from the parent's rowtype to
- * this table's rowtype (when extracting the partition key of a
- * tuple just before routing it through this table)
+ * parentoid OID of the parent table (InvalidOid if root partitioned table)
* indexes Array with partdesc->nparts members (for details on what
* individual members represent, see how they are set in
* RelationGetPartitionDispatchInfo())
@@ -61,10 +56,8 @@ typedef struct PartitionDispatchData
{
Relation reldesc;
PartitionKey key;
- List *keystate; /* list of ExprState */
PartitionDesc partdesc;
- TupleTableSlot *tupslot;
- TupleConversionMap *tupmap;
+ Oid parentoid;
int *indexes;
} PartitionDispatchData;
@@ -86,17 +79,18 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
-/* For tuple routing */
extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
int *num_parted, List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+
+/* For tuple routing */
+extern void FormPartitionKeyDatum(PartitionTupleRoutingInfo *ptrinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot);
#endif /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index f48a603dae..04422e1a6f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -209,13 +209,13 @@ extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
extern void ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
EState *estate,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
int *num_parted, int *num_partitions);
extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
- PartitionDispatch *pd,
+ PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d1565e7496..c0925c5f76 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,42 @@ typedef struct ResultRelInfo
Relation ri_PartitionRoot;
} ResultRelInfo;
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ * through one partitioned table in a partition
+ * tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+
+ /* Information about the table's partitions */
+ PartitionDispatch pd;
+
+ /*
+ * The execution state required for expressions contained in the partition
+ * key. It is NIL until initialized by FormPartitionKeyDatum() if and when
+ * it is called; for example, the first time a tuple is routed through this
+ * table.
+ */
+ List *keystate;
+
+ /*
+ * A standalone TupleTableSlot initialized with this table's tuple
+ * descriptor
+ */
+ TupleTableSlot *tupslot;
+
+ /*
+ * TupleConversionMap to convert from the parent's rowtype to this table's
+ * rowtype (when extracting the partition key of a tuple just before
+ * routing it through this table)
+ */
+ TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
/* ----------------
* EState information
*
@@ -973,9 +1009,9 @@ typedef struct ModifyTableState
TupleTableSlot *mt_existing; /* slot to store existing target tuple in */
List *mt_excludedtlist; /* the excluded pseudo relation's tlist */
TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */
- struct PartitionDispatchData **mt_partition_dispatch_info;
/* Tuple-routing support info */
- int mt_num_dispatch; /* Number of entries in the above array */
+ struct PartitionTupleRoutingInfo **mt_ptrinfos;
+ int mt_num_parted; /* Number of entries in the above array */
int mt_num_partitions; /* Number of members in the following
* arrays */
ResultRelInfo *mt_partitions; /* Per partition result relation */
--
2.11.0
0002-Teach-expand_inherited_rtentry-to-use-partition-boun.patchtext/plain; charset=UTF-8; name=0002-Teach-expand_inherited_rtentry-to-use-partition-boun.patchDownload
From 2d77335d834e5a20fb0f07c13d647f74e0f39082 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 9 Aug 2017 15:52:36 +0900
Subject: [PATCH 2/2] Teach expand_inherited_rtentry to use partition bound
order
After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.
---
src/backend/optimizer/prep/prepunion.c | 35 ++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e73c819901..2202ad9941 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
#include "access/heapam.h"
#include "access/htup_details.h"
#include "access/sysattr.h"
+#include "catalog/partition.h"
#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -1452,6 +1453,40 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
*/
oldrelation = heap_open(parentOID, NoLock);
+ /*
+ * For partitioned tables, we arrange the child table OIDs such that they
+ * appear in the partition bound order.
+ */
+ if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+ {
+ List *leaf_part_oids;
+ int num_parted,
+ i;
+ PartitionDispatch *pds;
+
+ /* Discard the original list. */
+ list_free(inhOIDs);
+ inhOIDs = NIL;
+
+ /* Request partitioning information. */
+ pds = RelationGetPartitionDispatchInfo(oldrelation, &num_parted,
+ &leaf_part_oids);
+
+ /*
+ * First collect the partitioned child table OIDs, which includes the
+ * root parent at the head.
+ */
+ for (i = 0; i < num_parted; i++)
+ {
+ inhOIDs = lappend_oid(inhOIDs, RelationGetRelid(pds[i]->reldesc));
+ if (pds[i]->reldesc != oldrelation)
+ heap_close(pds[i]->reldesc, NoLock);
+ }
+
+ /* Concatenate the leaf partition OIDs. */
+ inhOIDs = list_concat(inhOIDs, leaf_part_oids);
+ }
+
/* Scan the inheritance set and expand it */
appinfos = NIL;
has_child = false;
--
2.11.0
On Tue, Aug 29, 2017 at 10:36 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
I keep having the feeling that this is a big patch with a small patch
struggling to get out. Is it really necessary to change
RelationGetPartitionDispatchInfo so much or could you just do a really
minimal surgery to remove the code that sets the stuff we don't need?
Like this:Sure, done in the attached updated patch.
On first glance, that looks pretty good. I'll have a deeper look
tomorrow. It strikes me that if PartitionTupleRoutingInfo is an
executor structure, we should probably (as a separate patch) relocate
FormPartitionKeyDatum and get_partition_for_tuple to someplace in
src/backend/executor and rename the accordingly; maybe it's time for
an execPartition.c? catalog/partition.c is getting really bug, so
it's not a bad thing if some of that stuff gets moved elsewhere. A
lingering question in my mind, though, is whether it's really correct
to think of PartitionTupleRoutingInfo as executor-specific. Maybe
we're going to need that for other things too?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/30 12:03, Robert Haas wrote:
On Tue, Aug 29, 2017 at 10:36 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:I keep having the feeling that this is a big patch with a small patch
struggling to get out. Is it really necessary to change
RelationGetPartitionDispatchInfo so much or could you just do a really
minimal surgery to remove the code that sets the stuff we don't need?
Like this:Sure, done in the attached updated patch.
On first glance, that looks pretty good. I'll have a deeper look
tomorrow.
Thanks.
It strikes me that if PartitionTupleRoutingInfo is an
executor structure, we should probably (as a separate patch) relocate
FormPartitionKeyDatum and get_partition_for_tuple to someplace in
src/backend/executor and rename the accordingly; maybe it's time for
an execPartition.c? catalog/partition.c is getting really bug, so
I agree.
It would be a good idea to introduce an execPartition.c so that future
patches in this area (such as executor partition-pruning patch on the
horizon) have a convenient place to park their code.
Will see if I can make a patch for that.
it's not a bad thing if some of that stuff gets moved elsewhere. A
lingering question in my mind, though, is whether it's really correct
to think of PartitionTupleRoutingInfo as executor-specific. Maybe
we're going to need that for other things too?
Hmm. Maybe, a subset of PartitionTupleRoutingInfo's fields are usable
outside the executor (only PartitionDispatch, which is exported by
partition.h anyway?), but not all of it. For example, fields keystate,
tupslot seem pretty executor-specific. In fact, they seem to be required
only for tuple routing.
One idea is to not introduce PartitionTupleRoutingInfo at all and add its
fields directly as ModifyTableState/CopyState fields. We currently have,
for example, a mt_partition_tupconv_maps array with one slot for every
leaf partition. Maybe, there could be following fields in
ModifyTableState (and similarly in CopyState):
int mt_num_parted /* this one exists today */
struct PartitionDispatchData **mt_partition_dispatch_info; /* and this */
List **mt_parted_keystate;
TupleConversionMap **mt_parted_tupconv_maps;
TupleTableSlot **mt_parted_tupslots;
Each of those arrays will have mt_num_parted slots.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 25 August 2017 at 23:58, Robert Haas <robertmhaas@gmail.com> wrote:
That just leaves indexes. In a world where keystate, tupslot, and
tupmap are removed from the PartitionDispatchData, you must need
indexes or there would be no point in constructing a
PartitionDispatchData object in the first place; any application that
needs neither indexes nor the executor-specific stuff could just use
the Relation directly.
But there is expand_inherited_rtentry() which neither requires indexes
nor any executor stuff, but still requires to call
RelationGetPartitionDispatchInfo(), and so these indexes get built
unnecessarily.
Looking at the latest patch, I can see that those indexes can be
separately built in ExecSetupPartitionTupleRouting() where it is
required, instead of in RelationGetPartitionDispatchInfo(). In the
loop which iterates through the pd[] returned from
RelationGetPartitionDispatchInfo(), we can build them using the exact
code currently written to build them in
RelationGetPartitionDispatchInfo().
In the future, if we require such applications where indexes are also
required, we may have a separate function only to build indexes, and
then ExecSetupPartitionTupleRouting() would also call that function.
--------
On 21 August 2017 at 11:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an
array of pointers. Why can't it be an array of
PartitionTupleRoutingInfo structure rather than pointer to that
structure ?AFAIK, assigning pointers is less expensive than assigning struct and we
end up doing a lot of assigning of the members of that array to a local
variable
I didn't get why exactly we would have to copy the structures. We
could just pass the address of ptrinfos[index], no ?
My only point for this was : we would not have to call palloc0() for
each of the element of ptrinfos. Instead, just allocate memory for all
of the elements in a single palloc0(). We anyways have to allocate
memory for *each* of the element.
in get_partition_for_tuple(), for example. Perhaps, we could
avoid those assignments and implement it the way you suggest.
You mean at these 2 places in get_partition_for_tuple() , right ? :
1. /* start with the root partitioned table */
parent = ptrinfos[0];
2. else
parent = ptrinfos[-parent->pd->indexes[cur_index]];
Both of the above places, we could just use &ptrinfos[...] instead of
ptrinfos[...]. But I guess you meant something else.
------------
RelationGetPartitionDispatchInfo() opens all the partitioned tables.
But in ExecSetupPartitionTupleRouting(), it again opens all the
parents, that is all the partitioned tables, and closes them back.
Instead, would it be possible to do this : Instead of the
PartitionDispatch->parentoid field, PartitionDispatch can have
parentrel Relation field, which will point to reldesc field of one of
the pds[] elements.
------------
For me, the CopyStateData->ptrinfos and ModifyTableState.mt_ptrinfos
field names sound confusing. How about part_tuple_routing_info or just
tuple_routing_info ?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 8:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Aug 29, 2017 at 10:36 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:I keep having the feeling that this is a big patch with a small patch
struggling to get out. Is it really necessary to change
RelationGetPartitionDispatchInfo so much or could you just do a really
minimal surgery to remove the code that sets the stuff we don't need?
Like this:Sure, done in the attached updated patch.
On first glance, that looks pretty good. I'll have a deeper look
tomorrow.
In one of your earlier mails on this thread, you had described how
expand_inherited_rtentry() would look like as
-- quote --
1. Calling find_all_inheritors with a new only-lock-the-partitions
flag. This should result in locking all partitioned tables in the
inheritance hierarchy in breadth-first, low-OID-first order. (When
the only-lock-the-partitions isn't specified, all partitioned tables
should still be locked before any unpartitioned tables, so that the
locking order in that case is consistent with what we do here.)
2. Iterate over the partitioned tables identified in step 1 in the
order in which they were returned. For each one:
- Decide which children can be pruned.
- Lock the unpruned, non-partitioned children in low-OID-first order.
3. Make another pass over the inheritance hierarchy, starting at the
root. Traverse the whole hierarchy in breadth-first in *bound* order.
Add RTEs and AppendRelInfos as we go -- these will have rte->inh =
true for partitioned tables and rte->inh = false for leaf partitions.
-- quote ends --
Amit's patches seem to be addressing the third point here. But the
expansion is not happening in breadth-first manner. We are expanding
all the partitioned partitions first and then leaf partitions. So
that's not exactly "breadth-first".
I tried to rebase first patch from partition-wise join patchset [1] on
top of these two patches. I am having hard time applying those
changes. The goal of the my patch is to expand the partitioned table
into an inheritance hierarchy which retains the partition hierarchy.
For that to happen, we need to know which partition belongs to which
partitioned table in the partition hierarchy. PartitionDispatch array
provided by RelationGetPartitionDispatchInfo() provides me the parent
OIDs of partitioned parents but it doesn't do so for the leaf
partitions. So, I changed the signature of that function to return the
list of parent OIDs of leaf partitions. But for building
AppendRelInfos, child RTEs and child row marks, I need parent's RTI,
RTE and row marks, which are not available directly. Given parent's
OID, I need to search root->parse->rtable to find its RTE, RTI and
then using RTI I can find rowmarks. But that seems to defeat the
purpose why partition-wise join needs EIBO i.e. to avoid O(n ^2) loop
in build_simple_rel(). For eliminating that loop we are introducing
another O(n^2) loop in expand_inherited_rtentry(). Even without
considering O(n^2) complexity this looks ugly.
A better way for translating partition hierarchy into inheritance
hierarchy may be to expand all partitions (partitioned or leaf) of a
given parent at a time in breadth-first manner. This allows us to
create child RTE, AppendRelInfo, rowmarks while we have corresponding
parent structures at hand, rather than searching for those. This would
still satisfy Amit Khandekar's requirement to expand leaf partitions
in the same order as their OIDs would be returned by
RelationGetPartitionDispatchInfo(). I have a feeling that, if we go
that route, we will replace almost all the changes that Amit Langote's
patches do to expand_inherited_rtentry().
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 9:22 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
Amit's patches seem to be addressing the third point here. But the
expansion is not happening in breadth-first manner. We are expanding
all the partitioned partitions first and then leaf partitions. So
that's not exactly "breadth-first".
Correct, but I think Amit's ordering is what we actually want:
breadth-first, low-OID-first over interior partitioned tables, and
then breadth-first, low-OID-first again over leaves. If we don't keep
partitioned partitions first, then we're going to have problems
keeping the locking order consistent when we start doing pruning
during expansion.
A better way for translating partition hierarchy into inheritance
hierarchy may be to expand all partitions (partitioned or leaf) of a
given parent at a time in breadth-first manner. This allows us to
create child RTE, AppendRelInfo, rowmarks while we have corresponding
parent structures at hand, rather than searching for those. This would
still satisfy Amit Khandekar's requirement to expand leaf partitions
in the same order as their OIDs would be returned by
RelationGetPartitionDispatchInfo(). I have a feeling that, if we go
that route, we will replace almost all the changes that Amit Langote's
patches do to expand_inherited_rtentry().
I think we will, too, but I think that's basically the problem of the
partition-wise join patch. Either find_all_inheritors is going to
have to return enough additional information to let
expand_inherited_rtentry work efficiently, or else
expand_inherited_rtentry is going to have to duplicate some of the
logic from find_all_inheritors. But that doesn't make what Amit is
doing here a bad idea -- getting stuff that shouldn't be part of
PartitionDispatch removed and getting the expansion order in
expand_inherited_rtentry() changed seem to be the right things to do
even if the way it's implemented has to be revised to meet other
goals.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 30, 2017 at 9:22 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:Amit's patches seem to be addressing the third point here. But the
expansion is not happening in breadth-first manner. We are expanding
all the partitioned partitions first and then leaf partitions. So
that's not exactly "breadth-first".Correct, but I think Amit's ordering is what we actually want:
breadth-first, low-OID-first over interior partitioned tables, and
then breadth-first, low-OID-first again over leaves. If we don't keep
partitioned partitions first, then we're going to have problems
keeping the locking order consistent when we start doing pruning
during expansion.
No, I'm wrong and you're correct. We want the partitions to be locked
first, but we don't want them to be pulled to the front of the
expansion order, because then it's not in bound order anymore and any
optimization that tries to rely on that ordering will break.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 6:08 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 25 August 2017 at 23:58, Robert Haas <robertmhaas@gmail.com> wrote:
That just leaves indexes. In a world where keystate, tupslot, and
tupmap are removed from the PartitionDispatchData, you must need
indexes or there would be no point in constructing a
PartitionDispatchData object in the first place; any application that
needs neither indexes nor the executor-specific stuff could just use
the Relation directly.But there is expand_inherited_rtentry() which neither requires indexes
nor any executor stuff, but still requires to call
RelationGetPartitionDispatchInfo(), and so these indexes get built
unnecessarily.
True, but the reason why expand_inherited_rtentry() needs to call
RelationGetPartitionDispatchInfo() is to get back the leaf partition
OIDs in bound order. If we're using
RelationGetPartitionDispatchInfo() to get the leaf partition OIDs into
bound order, we've got to run the loop that builds leaf_part_oids, and
the same loop constructs indexes. So I don't think we're doing much
redundant work there.
Now, if we made it the job of expand_inherited_rtentry() to loop over
the PartitionDesc, then it really wouldn't need to call
RelationGetPartitionDispatchInfo at all. Maybe that's actually a
better plan anyway, because as Ashutosh points out, we don't want the
partitioned children to show up before the unpartitioned children in
the resulting ordering.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 9:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 30, 2017 at 6:08 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 25 August 2017 at 23:58, Robert Haas <robertmhaas@gmail.com> wrote:
That just leaves indexes. In a world where keystate, tupslot, and
tupmap are removed from the PartitionDispatchData, you must need
indexes or there would be no point in constructing a
PartitionDispatchData object in the first place; any application that
needs neither indexes nor the executor-specific stuff could just use
the Relation directly.But there is expand_inherited_rtentry() which neither requires indexes
nor any executor stuff, but still requires to call
RelationGetPartitionDispatchInfo(), and so these indexes get built
unnecessarily.True, but the reason why expand_inherited_rtentry() needs to call
RelationGetPartitionDispatchInfo() is to get back the leaf partition
OIDs in bound order. If we're using
RelationGetPartitionDispatchInfo() to get the leaf partition OIDs into
bound order, we've got to run the loop that builds leaf_part_oids, and
the same loop constructs indexes. So I don't think we're doing much
redundant work there.Now, if we made it the job of expand_inherited_rtentry() to loop over
the PartitionDesc, then it really wouldn't need to call
RelationGetPartitionDispatchInfo at all. Maybe that's actually a
better plan anyway, because as Ashutosh points out, we don't want the
partitioned children to show up before the unpartitioned children in
the resulting ordering.
+1. I think we should just pull out the OIDs from partition descriptor.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 12:47 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
+1. I think we should just pull out the OIDs from partition descriptor.
Like this? The first patch refactors the expansion of a single child
out into a separate function, and the second patch implements EIBO on
top of it.
I realized while doing this that we really want to expand the
partitioning hierarchy depth-first, not breadth-first. For some
things, like partition-wise join in the case where all bounds match
exactly, we really only need a *predictable* ordering that will be the
same for two equi-partitioned tables. A breadth-first expansion will
give us that. But it's not actually in bound order. For example:
create table foo (a int, b text) partition by list (a);
create table foo1 partition of foo for values in (2);
create table foo2 partition of foo for values in (1) partition by range (b);
create table foo2a partition of foo2 for values from ('b') to ('c');
create table foo2b partition of foo2 for values from ('a') to ('b');
create table foo3 partition of foo for values in (3);
The correct bound-order expansion of this is foo2b - foo2a - foo1 -
foo3, which is indeed what you get with the attached patch. But if we
did the expansion in breadth-first fashion, we'd get foo1 - foo3 -
foo2a, foo2b, which is, well, not in bound order. If the idea is that
you see a > 2 and rule out all partitions that appear before the first
one with an a-value >= 2, it's not going to work.
Mind you, that idea has some problems anyway in the face of default
partitions, null partitions, and list partitions which accept
non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
4, 6). We might need to mark the PartitionDesc to indicate whether
PartitionDesc-order is in fact bound-order in a particular instance,
or something like that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-expand_single_inheritance_child.patchapplication/octet-stream; name=0001-expand_single_inheritance_child.patchDownload
From 6adb696b45bd2bc6814adf52851508706b27f0ad Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 30 Aug 2017 12:53:43 -0400
Subject: [PATCH 1/2] expand_single_inheritance_child
---
src/backend/optimizer/prep/prepunion.c | 225 +++++++++++++++++++--------------
1 file changed, 127 insertions(+), 98 deletions(-)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e73c819901..870a4a6bfd 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -100,6 +100,12 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,
static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
Index rti);
+static void expand_single_inheritance_child(PlannerInfo *root,
+ RangeTblEntry *rte,
+ Index rti, Relation oldrelation,
+ PlanRowMark *oldrc, Relation newrelation,
+ bool *has_child, List **appinfos,
+ List **partitioned_child_rels);
static void make_inh_translation_list(Relation oldrelation,
Relation newrelation,
Index newvarno,
@@ -1459,9 +1465,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
{
Oid childOID = lfirst_oid(l);
Relation newrelation;
- RangeTblEntry *childrte;
- Index childRTindex;
- AppendRelInfo *appinfo;
/* Open rel if needed; we already have required locks */
if (childOID != parentOID)
@@ -1481,101 +1484,10 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
continue;
}
- /*
- * Build an RTE for the child, and attach to query's rangetable list.
- * We copy most fields of the parent's RTE, but replace relation OID
- * and relkind, and set inh = false. Also, set requiredPerms to zero
- * since all required permissions checks are done on the original RTE.
- * Likewise, set the child's securityQuals to empty, because we only
- * want to apply the parent's RLS conditions regardless of what RLS
- * properties individual children may have. (This is an intentional
- * choice to make inherited RLS work like regular permissions checks.)
- * The parent securityQuals will be propagated to children along with
- * other base restriction clauses, so we don't need to do it here.
- */
- childrte = copyObject(rte);
- childrte->relid = childOID;
- childrte->relkind = newrelation->rd_rel->relkind;
- childrte->inh = false;
- childrte->requiredPerms = 0;
- childrte->securityQuals = NIL;
- parse->rtable = lappend(parse->rtable, childrte);
- childRTindex = list_length(parse->rtable);
-
- /*
- * Build an AppendRelInfo for this parent and child, unless the child
- * is a partitioned table.
- */
- if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
- {
- /* Remember if we saw a real child. */
- if (childOID != parentOID)
- has_child = true;
-
- appinfo = makeNode(AppendRelInfo);
- appinfo->parent_relid = rti;
- appinfo->child_relid = childRTindex;
- appinfo->parent_reltype = oldrelation->rd_rel->reltype;
- appinfo->child_reltype = newrelation->rd_rel->reltype;
- make_inh_translation_list(oldrelation, newrelation, childRTindex,
- &appinfo->translated_vars);
- appinfo->parent_reloid = parentOID;
- appinfos = lappend(appinfos, appinfo);
-
- /*
- * Translate the column permissions bitmaps to the child's attnums
- * (we have to build the translated_vars list before we can do
- * this). But if this is the parent table, leave copyObject's
- * result alone.
- *
- * Note: we need to do this even though the executor won't run any
- * permissions checks on the child RTE. The
- * insertedCols/updatedCols bitmaps may be examined for
- * trigger-firing purposes.
- */
- if (childOID != parentOID)
- {
- childrte->selectedCols = translate_col_privs(rte->selectedCols,
- appinfo->translated_vars);
- childrte->insertedCols = translate_col_privs(rte->insertedCols,
- appinfo->translated_vars);
- childrte->updatedCols = translate_col_privs(rte->updatedCols,
- appinfo->translated_vars);
- }
- }
- else
- partitioned_child_rels = lappend_int(partitioned_child_rels,
- childRTindex);
-
- /*
- * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
- */
- if (oldrc)
- {
- PlanRowMark *newrc = makeNode(PlanRowMark);
-
- newrc->rti = childRTindex;
- newrc->prti = rti;
- newrc->rowmarkId = oldrc->rowmarkId;
- /* Reselect rowmark type, because relkind might not match parent */
- newrc->markType = select_rowmark_type(childrte, oldrc->strength);
- newrc->allMarkTypes = (1 << newrc->markType);
- newrc->strength = oldrc->strength;
- newrc->waitPolicy = oldrc->waitPolicy;
-
- /*
- * We mark RowMarks for partitioned child tables as parent
- * RowMarks so that the executor ignores them (except their
- * existence means that the child tables be locked using
- * appropriate mode).
- */
- newrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
-
- /* Include child's rowmark type in parent's allMarkTypes */
- oldrc->allMarkTypes |= newrc->allMarkTypes;
-
- root->rowMarks = lappend(root->rowMarks, newrc);
- }
+ expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
+ newrelation,
+ &has_child, &appinfos,
+ &partitioned_child_rels);
/* Close child relations, but keep locks */
if (childOID != parentOID)
@@ -1621,6 +1533,123 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
}
/*
+ * expand_single_inheritance_child
+ * Expand a single inheritance child, if needed.
+ *
+ * If this is a temp table of another backend, we'll return without doing
+ * anything at all. Otherwise, we'll build a RangeTblEntry and either a
+ * PartitionedChildRelInfo or AppendRelInfo as appropriate, plus maybe a
+ * PlanRowMark.
+ */
+static void
+expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *rte,
+ Index rti, Relation oldrelation,
+ PlanRowMark *oldrc, Relation newrelation,
+ bool *has_child, List **appinfos,
+ List **partitioned_child_rels)
+{
+ Query *parse = root->parse;
+ Oid parentOID = RelationGetRelid(oldrelation);
+ Oid childOID = RelationGetRelid(newrelation);
+ RangeTblEntry *childrte;
+ Index childRTindex;
+ AppendRelInfo *appinfo;
+
+ /*
+ * Build an RTE for the child, and attach to query's rangetable list. We
+ * copy most fields of the parent's RTE, but replace relation OID and
+ * relkind, and set inh = false. Also, set requiredPerms to zero since
+ * all required permissions checks are done on the original RTE. Likewise,
+ * set the child's securityQuals to empty, because we only want to apply
+ * the parent's RLS conditions regardless of what RLS properties
+ * individual children may have. (This is an intentional choice to make
+ * inherited RLS work like regular permissions checks.) The parent
+ * securityQuals will be propagated to children along with other base
+ * restriction clauses, so we don't need to do it here.
+ */
+ childrte = copyObject(rte);
+ childrte->relid = childOID;
+ childrte->relkind = newrelation->rd_rel->relkind;
+ childrte->inh = false;
+ childrte->requiredPerms = 0;
+ childrte->securityQuals = NIL;
+ parse->rtable = lappend(parse->rtable, childrte);
+ childRTindex = list_length(parse->rtable);
+
+ /*
+ * Build an AppendRelInfo for this parent and child, unless the child is a
+ * partitioned table.
+ */
+ if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ /* Remember if we saw a real child. */
+ if (childOID != parentOID)
+ *has_child = true;
+
+ appinfo = makeNode(AppendRelInfo);
+ appinfo->parent_relid = rti;
+ appinfo->child_relid = childRTindex;
+ appinfo->parent_reltype = oldrelation->rd_rel->reltype;
+ appinfo->child_reltype = newrelation->rd_rel->reltype;
+ make_inh_translation_list(oldrelation, newrelation, childRTindex,
+ &appinfo->translated_vars);
+ appinfo->parent_reloid = parentOID;
+ *appinfos = lappend(*appinfos, appinfo);
+
+ /*
+ * Translate the column permissions bitmaps to the child's attnums (we
+ * have to build the translated_vars list before we can do this). But
+ * if this is the parent table, leave copyObject's result alone.
+ *
+ * Note: we need to do this even though the executor won't run any
+ * permissions checks on the child RTE. The insertedCols/updatedCols
+ * bitmaps may be examined for trigger-firing purposes.
+ */
+ if (childOID != parentOID)
+ {
+ childrte->selectedCols = translate_col_privs(rte->selectedCols,
+ appinfo->translated_vars);
+ childrte->insertedCols = translate_col_privs(rte->insertedCols,
+ appinfo->translated_vars);
+ childrte->updatedCols = translate_col_privs(rte->updatedCols,
+ appinfo->translated_vars);
+ }
+ }
+ else
+ *partitioned_child_rels = lappend_int(*partitioned_child_rels,
+ childRTindex);
+
+ /*
+ * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
+ */
+ if (oldrc)
+ {
+ PlanRowMark *newrc = makeNode(PlanRowMark);
+
+ newrc->rti = childRTindex;
+ newrc->prti = rti;
+ newrc->rowmarkId = oldrc->rowmarkId;
+ /* Reselect rowmark type, because relkind might not match parent */
+ newrc->markType = select_rowmark_type(childrte, oldrc->strength);
+ newrc->allMarkTypes = (1 << newrc->markType);
+ newrc->strength = oldrc->strength;
+ newrc->waitPolicy = oldrc->waitPolicy;
+
+ /*
+ * We mark RowMarks for partitioned child tables as parent RowMarks so
+ * that the executor ignores them (except their existence means that
+ * the child tables be locked using appropriate mode).
+ */
+ newrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
+
+ /* Include child's rowmark type in parent's allMarkTypes */
+ oldrc->allMarkTypes |= newrc->allMarkTypes;
+
+ root->rowMarks = lappend(root->rowMarks, newrc);
+ }
+}
+
+/*
* make_inh_translation_list
* Build the list of translations from parent Vars to child Vars for
* an inheritance child.
--
2.11.0 (Apple Git-81)
0002-EIBO.patchapplication/octet-stream; name=0002-EIBO.patchDownload
From a0b1b7c2eadaebd236c48d9017effdf7c20fafc4 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 30 Aug 2017 15:22:08 -0400
Subject: [PATCH 2/2] EIBO
---
src/backend/optimizer/prep/prepunion.c | 126 ++++++++++++++++++++++++++-------
src/test/regress/expected/insert.out | 4 +-
2 files changed, 104 insertions(+), 26 deletions(-)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 870a4a6bfd..3f5138f54e 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
#include "access/heapam.h"
#include "access/htup_details.h"
#include "access/sysattr.h"
+#include "catalog/partition.h"
#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -100,6 +101,13 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,
static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
Index rti);
+static void expand_partitions_recursively(PlannerInfo *root,
+ RangeTblEntry *rte,
+ Index rti, Relation oldrelation,
+ PlanRowMark *oldrc, PartitionDesc partdesc,
+ LOCKMODE lockmode,
+ bool *has_child, List **appinfos,
+ List **partitioned_child_rels);
static void expand_single_inheritance_child(PlannerInfo *root,
RangeTblEntry *rte,
Index rti, Relation oldrelation,
@@ -1461,37 +1469,62 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
/* Scan the inheritance set and expand it */
appinfos = NIL;
has_child = false;
- foreach(l, inhOIDs)
+ if (RelationGetPartitionDesc(oldrelation) != NULL)
{
- Oid childOID = lfirst_oid(l);
- Relation newrelation;
-
- /* Open rel if needed; we already have required locks */
- if (childOID != parentOID)
- newrelation = heap_open(childOID, NoLock);
- else
- newrelation = oldrelation;
-
/*
- * It is possible that the parent table has children that are temp
- * tables of other backends. We cannot safely access such tables
- * (because of buffering issues), and the best thing to do seems to be
- * to silently ignore them.
+ * If this table has partitions, recursively expand them in the order
+ * in which they appear in the PartitionDesc. But first, expand the
+ * parent itself.
*/
- if (childOID != parentOID && RELATION_IS_OTHER_TEMP(newrelation))
- {
- heap_close(newrelation, lockmode);
- continue;
- }
-
expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
- newrelation,
+ oldrelation,
&has_child, &appinfos,
&partitioned_child_rels);
+ expand_partitions_recursively(root, rte, rti, oldrelation, oldrc,
+ RelationGetPartitionDesc(oldrelation),
+ lockmode,
+ &has_child, &appinfos,
+ &partitioned_child_rels);
+ }
+ else
+ {
+ /*
+ * This table has no partitions. Expand any plain inheritance
+ * children in the order the OIDs were returned by
+ * find_all_inheritors.
+ */
+ foreach(l, inhOIDs)
+ {
+ Oid childOID = lfirst_oid(l);
+ Relation newrelation;
- /* Close child relations, but keep locks */
- if (childOID != parentOID)
- heap_close(newrelation, NoLock);
+ /* Open rel if needed; we already have required locks */
+ if (childOID != parentOID)
+ newrelation = heap_open(childOID, NoLock);
+ else
+ newrelation = oldrelation;
+
+ /*
+ * It is possible that the parent table has children that are temp
+ * tables of other backends. We cannot safely access such tables
+ * (because of buffering issues), and the best thing to do seems
+ * to be to silently ignore them.
+ */
+ if (childOID != parentOID && RELATION_IS_OTHER_TEMP(newrelation))
+ {
+ heap_close(newrelation, lockmode);
+ continue;
+ }
+
+ expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
+ newrelation,
+ &has_child, &appinfos,
+ &partitioned_child_rels);
+
+ /* Close child relations, but keep locks */
+ if (childOID != parentOID)
+ heap_close(newrelation, NoLock);
+ }
}
heap_close(oldrelation, NoLock);
@@ -1532,6 +1565,51 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
root->append_rel_list = list_concat(root->append_rel_list, appinfos);
}
+static void
+expand_partitions_recursively(PlannerInfo *root, RangeTblEntry *rte,
+ Index rti, Relation oldrelation,
+ PlanRowMark *oldrc, PartitionDesc partdesc,
+ LOCKMODE lockmode,
+ bool *has_child, List **appinfos,
+ List **partitioned_child_rels)
+{
+ int i;
+
+ check_stack_depth();
+
+ for (i = 0; i < partdesc->nparts; i++)
+ {
+ Oid childOID = partdesc->oids[i];
+ Relation newrelation;
+
+ /* Open rel; we already have required locks */
+ newrelation = heap_open(childOID, NoLock);
+
+ /* As in expand_inherited_rtentry, skip non-local temp tables */
+ if (RELATION_IS_OTHER_TEMP(newrelation))
+ {
+ heap_close(newrelation, lockmode);
+ continue;
+ }
+
+ expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
+ newrelation,
+ has_child, appinfos,
+ partitioned_child_rels);
+
+ /* If this child is itself partitioned, recurse */
+ if (newrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ expand_partitions_recursively(root, rte, rti, oldrelation, oldrc,
+ RelationGetPartitionDesc(newrelation),
+ lockmode,
+ has_child, appinfos,
+ partitioned_child_rels);
+
+ /* Close child relation, but keep locks */
+ heap_close(newrelation, NoLock);
+ }
+}
+
/*
* expand_single_inheritance_child
* Expand a single inheritance child, if needed.
diff --git a/src/test/regress/expected/insert.out b/src/test/regress/expected/insert.out
index a2d9469592..e159d62b66 100644
--- a/src/test/regress/expected/insert.out
+++ b/src/test/regress/expected/insert.out
@@ -278,12 +278,12 @@ select tableoid::regclass, * from list_parted;
-------------+----+----
part_aa_bb | aA |
part_cc_dd | cC | 1
- part_null | | 0
- part_null | | 1
part_ee_ff1 | ff | 1
part_ee_ff1 | EE | 1
part_ee_ff2 | ff | 11
part_ee_ff2 | EE | 10
+ part_null | | 0
+ part_null | | 1
(8 rows)
-- some more tests to exercise tuple-routing with multi-level partitioning
--
2.11.0 (Apple Git-81)
On Thu, Aug 31, 2017 at 1:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 30, 2017 at 12:47 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:+1. I think we should just pull out the OIDs from partition descriptor.
Like this? The first patch refactors the expansion of a single child
out into a separate function, and the second patch implements EIBO on
top of it.I realized while doing this that we really want to expand the
partitioning hierarchy depth-first, not breadth-first. For some
things, like partition-wise join in the case where all bounds match
exactly, we really only need a *predictable* ordering that will be the
same for two equi-partitioned table.
+1. Spotted right!
A breadth-first expansion will
give us that. But it's not actually in bound order. For example:create table foo (a int, b text) partition by list (a);
create table foo1 partition of foo for values in (2);
create table foo2 partition of foo for values in (1) partition by range (b);
create table foo2a partition of foo2 for values from ('b') to ('c');
create table foo2b partition of foo2 for values from ('a') to ('b');
create table foo3 partition of foo for values in (3);The correct bound-order expansion of this is foo2b - foo2a - foo1 -
foo3, which is indeed what you get with the attached patch. But if we
did the expansion in breadth-first fashion, we'd get foo1 - foo3 -
foo2a, foo2b, which is, well, not in bound order. If the idea is that
you see a > 2 and rule out all partitions that appear before the first
one with an a-value >= 2, it's not going to work.
Here are the patches revised a bit. I have esp changed the variable
names and arguments to reflect their true role in the functions. Also
updated prologue of expand_single_inheritance_child() to mention
"has_child". Let me know if those changes look good.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachments:
0001-expand_single_inheritance_child-by-Robert.patchtext/x-patch; charset=US-ASCII; name=0001-expand_single_inheritance_child-by-Robert.patchDownload
From ed494bff369e43ff92128b9bd9c553eb19dffdc6 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 30 Aug 2017 12:53:43 -0400
Subject: [PATCH 1/2] expand_single_inheritance_child by Robert
with
My changes to Rename arguments to and variables in
expand_single_inheritance_child() in accordance to their usage in the
function.
---
src/backend/optimizer/prep/prepunion.c | 225 ++++++++++++++++++--------------
1 file changed, 127 insertions(+), 98 deletions(-)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e73c819..bb8f1ce 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -100,6 +100,12 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,
static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
Index rti);
+static void expand_single_inheritance_child(PlannerInfo *root,
+ RangeTblEntry *parentrte,
+ Index parentRTindex, Relation parentrel,
+ PlanRowMark *parentrc, Relation childrel,
+ bool *has_child, List **appinfos,
+ List **partitioned_child_rels);
static void make_inh_translation_list(Relation oldrelation,
Relation newrelation,
Index newvarno,
@@ -1459,9 +1465,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
{
Oid childOID = lfirst_oid(l);
Relation newrelation;
- RangeTblEntry *childrte;
- Index childRTindex;
- AppendRelInfo *appinfo;
/* Open rel if needed; we already have required locks */
if (childOID != parentOID)
@@ -1481,101 +1484,10 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
continue;
}
- /*
- * Build an RTE for the child, and attach to query's rangetable list.
- * We copy most fields of the parent's RTE, but replace relation OID
- * and relkind, and set inh = false. Also, set requiredPerms to zero
- * since all required permissions checks are done on the original RTE.
- * Likewise, set the child's securityQuals to empty, because we only
- * want to apply the parent's RLS conditions regardless of what RLS
- * properties individual children may have. (This is an intentional
- * choice to make inherited RLS work like regular permissions checks.)
- * The parent securityQuals will be propagated to children along with
- * other base restriction clauses, so we don't need to do it here.
- */
- childrte = copyObject(rte);
- childrte->relid = childOID;
- childrte->relkind = newrelation->rd_rel->relkind;
- childrte->inh = false;
- childrte->requiredPerms = 0;
- childrte->securityQuals = NIL;
- parse->rtable = lappend(parse->rtable, childrte);
- childRTindex = list_length(parse->rtable);
-
- /*
- * Build an AppendRelInfo for this parent and child, unless the child
- * is a partitioned table.
- */
- if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
- {
- /* Remember if we saw a real child. */
- if (childOID != parentOID)
- has_child = true;
-
- appinfo = makeNode(AppendRelInfo);
- appinfo->parent_relid = rti;
- appinfo->child_relid = childRTindex;
- appinfo->parent_reltype = oldrelation->rd_rel->reltype;
- appinfo->child_reltype = newrelation->rd_rel->reltype;
- make_inh_translation_list(oldrelation, newrelation, childRTindex,
- &appinfo->translated_vars);
- appinfo->parent_reloid = parentOID;
- appinfos = lappend(appinfos, appinfo);
-
- /*
- * Translate the column permissions bitmaps to the child's attnums
- * (we have to build the translated_vars list before we can do
- * this). But if this is the parent table, leave copyObject's
- * result alone.
- *
- * Note: we need to do this even though the executor won't run any
- * permissions checks on the child RTE. The
- * insertedCols/updatedCols bitmaps may be examined for
- * trigger-firing purposes.
- */
- if (childOID != parentOID)
- {
- childrte->selectedCols = translate_col_privs(rte->selectedCols,
- appinfo->translated_vars);
- childrte->insertedCols = translate_col_privs(rte->insertedCols,
- appinfo->translated_vars);
- childrte->updatedCols = translate_col_privs(rte->updatedCols,
- appinfo->translated_vars);
- }
- }
- else
- partitioned_child_rels = lappend_int(partitioned_child_rels,
- childRTindex);
-
- /*
- * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
- */
- if (oldrc)
- {
- PlanRowMark *newrc = makeNode(PlanRowMark);
-
- newrc->rti = childRTindex;
- newrc->prti = rti;
- newrc->rowmarkId = oldrc->rowmarkId;
- /* Reselect rowmark type, because relkind might not match parent */
- newrc->markType = select_rowmark_type(childrte, oldrc->strength);
- newrc->allMarkTypes = (1 << newrc->markType);
- newrc->strength = oldrc->strength;
- newrc->waitPolicy = oldrc->waitPolicy;
-
- /*
- * We mark RowMarks for partitioned child tables as parent
- * RowMarks so that the executor ignores them (except their
- * existence means that the child tables be locked using
- * appropriate mode).
- */
- newrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
-
- /* Include child's rowmark type in parent's allMarkTypes */
- oldrc->allMarkTypes |= newrc->allMarkTypes;
-
- root->rowMarks = lappend(root->rowMarks, newrc);
- }
+ expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
+ newrelation,
+ &has_child, &appinfos,
+ &partitioned_child_rels);
/* Close child relations, but keep locks */
if (childOID != parentOID)
@@ -1621,6 +1533,123 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
}
/*
+ * expand_single_inheritance_child
+ * Expand a single inheritance child, if needed.
+ *
+ * If this is a temp table of another backend, we'll return without doing
+ * anything at all. Otherwise, we'll set "has_child" to true, build a
+ * RangeTblEntry and either a PartitionedChildRelInfo or AppendRelInfo as
+ * appropriate, plus maybe a PlanRowMark.
+ */
+static void
+expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
+ Index parentRTindex, Relation parentrel,
+ PlanRowMark *parentrc, Relation childrel,
+ bool *has_child, List **appinfos,
+ List **partitioned_child_rels)
+{
+ Query *parse = root->parse;
+ Oid parentOID = RelationGetRelid(parentrel);
+ Oid childOID = RelationGetRelid(childrel);
+ RangeTblEntry *childrte;
+ Index childRTindex;
+ AppendRelInfo *appinfo;
+
+ /*
+ * Build an RTE for the child, and attach to query's rangetable list. We
+ * copy most fields of the parent's RTE, but replace relation OID and
+ * relkind, and set inh = false. Also, set requiredPerms to zero since
+ * all required permissions checks are done on the original RTE. Likewise,
+ * set the child's securityQuals to empty, because we only want to apply
+ * the parent's RLS conditions regardless of what RLS properties
+ * individual children may have. (This is an intentional choice to make
+ * inherited RLS work like regular permissions checks.) The parent
+ * securityQuals will be propagated to children along with other base
+ * restriction clauses, so we don't need to do it here.
+ */
+ childrte = copyObject(parentrte);
+ childrte->relid = childOID;
+ childrte->relkind = childrel->rd_rel->relkind;
+ childrte->inh = false;
+ childrte->requiredPerms = 0;
+ childrte->securityQuals = NIL;
+ parse->rtable = lappend(parse->rtable, childrte);
+ childRTindex = list_length(parse->rtable);
+
+ /*
+ * Build an AppendRelInfo for this parent and child, unless the child is a
+ * partitioned table.
+ */
+ if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ /* Remember if we saw a real child. */
+ if (childOID != parentOID)
+ *has_child = true;
+
+ appinfo = makeNode(AppendRelInfo);
+ appinfo->parent_relid = parentRTindex;
+ appinfo->child_relid = childRTindex;
+ appinfo->parent_reltype = parentrel->rd_rel->reltype;
+ appinfo->child_reltype = childrel->rd_rel->reltype;
+ make_inh_translation_list(parentrel, childrel, childRTindex,
+ &appinfo->translated_vars);
+ appinfo->parent_reloid = parentOID;
+ *appinfos = lappend(*appinfos, appinfo);
+
+ /*
+ * Translate the column permissions bitmaps to the child's attnums (we
+ * have to build the translated_vars list before we can do this). But
+ * if this is the parent table, leave copyObject's result alone.
+ *
+ * Note: we need to do this even though the executor won't run any
+ * permissions checks on the child RTE. The insertedCols/updatedCols
+ * bitmaps may be examined for trigger-firing purposes.
+ */
+ if (childOID != parentOID)
+ {
+ childrte->selectedCols = translate_col_privs(parentrte->selectedCols,
+ appinfo->translated_vars);
+ childrte->insertedCols = translate_col_privs(parentrte->insertedCols,
+ appinfo->translated_vars);
+ childrte->updatedCols = translate_col_privs(parentrte->updatedCols,
+ appinfo->translated_vars);
+ }
+ }
+ else
+ *partitioned_child_rels = lappend_int(*partitioned_child_rels,
+ childRTindex);
+
+ /*
+ * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
+ */
+ if (parentrc)
+ {
+ PlanRowMark *childrc = makeNode(PlanRowMark);
+
+ childrc->rti = childRTindex;
+ childrc->prti = parentRTindex;
+ childrc->rowmarkId = parentrc->rowmarkId;
+ /* Reselect rowmark type, because relkind might not match parent */
+ childrc->markType = select_rowmark_type(childrte, parentrc->strength);
+ childrc->allMarkTypes = (1 << childrc->markType);
+ childrc->strength = parentrc->strength;
+ childrc->waitPolicy = parentrc->waitPolicy;
+
+ /*
+ * We mark RowMarks for partitioned child tables as parent RowMarks so
+ * that the executor ignores them (except their existence means that
+ * the child tables be locked using appropriate mode).
+ */
+ childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
+
+ /* Include child's rowmark type in parent's allMarkTypes */
+ parentrc->allMarkTypes |= childrc->allMarkTypes;
+
+ root->rowMarks = lappend(root->rowMarks, childrc);
+ }
+}
+
+/*
* make_inh_translation_list
* Build the list of translations from parent Vars to child Vars for
* an inheritance child.
--
1.7.9.5
0002-EIBO-patch-from-Robert.patchtext/x-patch; charset=US-ASCII; name=0002-EIBO-patch-from-Robert.patchDownload
From faac5b1261497c5cceb246bc7d52cdaff2ac5057 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>
Date: Thu, 31 Aug 2017 11:40:11 +0530
Subject: [PATCH 2/2] EIBO patch from Robert
with
my changes to rename arguements to and variables in
expand_partitions_recursively(). Also rename expand_partitions_recursively() to
expand_partitioned_rtentry() inline with expand_inherited_rtentry().
---
src/backend/optimizer/prep/prepunion.c | 127 ++++++++++++++++++++++++++------
src/test/regress/expected/insert.out | 4 +-
2 files changed, 105 insertions(+), 26 deletions(-)
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index bb8f1ce..ccf2145 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
#include "access/heapam.h"
#include "access/htup_details.h"
#include "access/sysattr.h"
+#include "catalog/partition.h"
#include "catalog/pg_inherits_fn.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -100,6 +101,13 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,
static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
Index rti);
+static void expand_partitioned_rtentry(PlannerInfo *root,
+ RangeTblEntry *parentrte,
+ Index parentRTindex, Relation parentrel,
+ PlanRowMark *parentrc, PartitionDesc partdesc,
+ LOCKMODE lockmode,
+ bool *has_child, List **appinfos,
+ List **partitioned_child_rels);
static void expand_single_inheritance_child(PlannerInfo *root,
RangeTblEntry *parentrte,
Index parentRTindex, Relation parentrel,
@@ -1461,37 +1469,62 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
/* Scan the inheritance set and expand it */
appinfos = NIL;
has_child = false;
- foreach(l, inhOIDs)
+ if (RelationGetPartitionDesc(oldrelation) != NULL)
{
- Oid childOID = lfirst_oid(l);
- Relation newrelation;
-
- /* Open rel if needed; we already have required locks */
- if (childOID != parentOID)
- newrelation = heap_open(childOID, NoLock);
- else
- newrelation = oldrelation;
-
/*
- * It is possible that the parent table has children that are temp
- * tables of other backends. We cannot safely access such tables
- * (because of buffering issues), and the best thing to do seems to be
- * to silently ignore them.
+ * If this table has partitions, recursively expand them in the order
+ * in which they appear in the PartitionDesc. But first, expand the
+ * parent itself.
*/
- if (childOID != parentOID && RELATION_IS_OTHER_TEMP(newrelation))
- {
- heap_close(newrelation, lockmode);
- continue;
- }
-
expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
- newrelation,
+ oldrelation,
&has_child, &appinfos,
&partitioned_child_rels);
+ expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
+ RelationGetPartitionDesc(oldrelation),
+ lockmode,
+ &has_child, &appinfos,
+ &partitioned_child_rels);
+ }
+ else
+ {
+ /*
+ * This table has no partitions. Expand any plain inheritance
+ * children in the order the OIDs were returned by
+ * find_all_inheritors.
+ */
+ foreach(l, inhOIDs)
+ {
+ Oid childOID = lfirst_oid(l);
+ Relation newrelation;
- /* Close child relations, but keep locks */
- if (childOID != parentOID)
- heap_close(newrelation, NoLock);
+ /* Open rel if needed; we already have required locks */
+ if (childOID != parentOID)
+ newrelation = heap_open(childOID, NoLock);
+ else
+ newrelation = oldrelation;
+
+ /*
+ * It is possible that the parent table has children that are temp
+ * tables of other backends. We cannot safely access such tables
+ * (because of buffering issues), and the best thing to do seems
+ * to be to silently ignore them.
+ */
+ if (childOID != parentOID && RELATION_IS_OTHER_TEMP(newrelation))
+ {
+ heap_close(newrelation, lockmode);
+ continue;
+ }
+
+ expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
+ newrelation,
+ &has_child, &appinfos,
+ &partitioned_child_rels);
+
+ /* Close child relations, but keep locks */
+ if (childOID != parentOID)
+ heap_close(newrelation, NoLock);
+ }
}
heap_close(oldrelation, NoLock);
@@ -1532,6 +1565,52 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
root->append_rel_list = list_concat(root->append_rel_list, appinfos);
}
+static void
+expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
+ Index parentRTindex, Relation parentrel,
+ PlanRowMark *parentrc, PartitionDesc partdesc,
+ LOCKMODE lockmode,
+ bool *has_child, List **appinfos,
+ List **partitioned_child_rels)
+{
+ int i;
+
+ check_stack_depth();
+
+ for (i = 0; i < partdesc->nparts; i++)
+ {
+ Oid childOID = partdesc->oids[i];
+ Relation childrel;
+
+ /* Open rel; we already have required locks */
+ childrel = heap_open(childOID, NoLock);
+
+ /* As in expand_inherited_rtentry, skip non-local temp tables */
+ if (RELATION_IS_OTHER_TEMP(childrel))
+ {
+ heap_close(childrel, lockmode);
+ continue;
+ }
+
+ expand_single_inheritance_child(root, parentrte, parentRTindex,
+ parentrel, parentrc, childrel,
+ has_child, appinfos,
+ partitioned_child_rels);
+
+ /* If this child is itself partitioned, recurse */
+ if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ expand_partitioned_rtentry(root, parentrte, parentRTindex,
+ parentrel, parentrc,
+ RelationGetPartitionDesc(childrel),
+ lockmode,
+ has_child, appinfos,
+ partitioned_child_rels);
+
+ /* Close child relation, but keep locks */
+ heap_close(childrel, NoLock);
+ }
+}
+
/*
* expand_single_inheritance_child
* Expand a single inheritance child, if needed.
diff --git a/src/test/regress/expected/insert.out b/src/test/regress/expected/insert.out
index a2d9469..e159d62 100644
--- a/src/test/regress/expected/insert.out
+++ b/src/test/regress/expected/insert.out
@@ -278,12 +278,12 @@ select tableoid::regclass, * from list_parted;
-------------+----+----
part_aa_bb | aA |
part_cc_dd | cC | 1
- part_null | | 0
- part_null | | 1
part_ee_ff1 | ff | 1
part_ee_ff1 | EE | 1
part_ee_ff2 | ff | 11
part_ee_ff2 | EE | 10
+ part_null | | 0
+ part_null | | 1
(8 rows)
-- some more tests to exercise tuple-routing with multi-level partitioning
--
1.7.9.5
On 2017/08/31 4:45, Robert Haas wrote:
On Wed, Aug 30, 2017 at 12:47 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:+1. I think we should just pull out the OIDs from partition descriptor.
Like this? The first patch refactors the expansion of a single child
out into a separate function, and the second patch implements EIBO on
top of it.I realized while doing this that we really want to expand the
partitioning hierarchy depth-first, not breadth-first. For some
things, like partition-wise join in the case where all bounds match
exactly, we really only need a *predictable* ordering that will be the
same for two equi-partitioned tables. A breadth-first expansion will
give us that. But it's not actually in bound order. For example:create table foo (a int, b text) partition by list (a);
create table foo1 partition of foo for values in (2);
create table foo2 partition of foo for values in (1) partition by range (b);
create table foo2a partition of foo2 for values from ('b') to ('c');
create table foo2b partition of foo2 for values from ('a') to ('b');
create table foo3 partition of foo for values in (3);The correct bound-order expansion of this is foo2b - foo2a - foo1 -
foo3, which is indeed what you get with the attached patch. But if we
did the expansion in breadth-first fashion, we'd get foo1 - foo3 -
foo2a, foo2b, which is, well, not in bound order. If the idea is that
you see a > 2 and rule out all partitions that appear before the first
one with an a-value >= 2, it's not going to work.
I think, overall, this might be a good idea. Thanks for working on it.
The patches I posted in the "path toward faster partition pruning" achieve
the same end result as your patch that the leaf partitions appear in the
partition bound order in the Append path for a partitioned table. It
achieves that result in a somewhat different way, but let's forget about
that for a moment. One thing the patch on that thread didn't achieve
though is getting the leaf partitions in the same (partition bound) order
in the ModifyTable path for UPDATE/DELETE, because inheritance_planner()
path is not modified in a suitable way (in fact, I'm afraid that there
might be a deadlock bug lurking there, which I must address).
Your patch, OTOH, achieves the same order in both cases, which seems
desirable.
Mind you, that idea has some problems anyway in the face of default
partitions, null partitions, and list partitions which accept
non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
4, 6). We might need to mark the PartitionDesc to indicate whether
PartitionDesc-order is in fact bound-order in a particular instance,
or something like that.
ISTM, the primary motivation for the EIBO patch at this point is to get
the partitions ordered in a predictable manner so that the partition-wise
join patch and update partition key patches could implement certain logic
using O (n) algorithm rather than an O (n^2) one. Neither of them depend
on the actual order in the sense of, say, sticking a PathKey to the
resulting Append. Perhaps, the patch to"Make the optimiser aware of
partitions ordering" [1]https://commitfest.postgresql.org/14/1093/ will have to consider this somehow; maybe by
limiting its scope to only the cases where the root partitioned table is
range partitioned.
Thanks,
Amit
[1]: https://commitfest.postgresql.org/14/1093/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 31, 2017 at 3:36 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
ISTM, the primary motivation for the EIBO patch at this point is to get
the partitions ordered in a predictable manner so that the partition-wise
join patch and update partition key patches could implement certain logic
using O (n) algorithm rather than an O (n^2) one.
That's part of it, but not the whole thing. For example, BASIC
partition-wise join only needs a predictable order, not a
bound-ordered one. But the next step is to be able to match up uneven
bounds - e.g. given [1000, 2000), [3000, 4000), [5000, 6000) on one
side and [1100, 2100), [2900,3900), and [5500,5600) on the other side,
we can still make it work. That greatly benefits from being able to
iterate through all the bounds in order.
Neither of them depend
on the actual order in the sense of, say, sticking a PathKey to the
resulting Append. Perhaps, the patch to"Make the optimiser aware of
partitions ordering" [1] will have to consider this somehow; maybe by
limiting its scope to only the cases where the root partitioned table is
range partitioned.
I think that doing a depth-first traversal as I've done here avoids
the need to limit it to that case. If we did a breadth-first
traversal anything that was subpartitioned would end up having the
subpartitions at the end instead of in the sequence, but the
depth-first traversal avoids that issue.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 31, 2017 at 2:56 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
Here are the patches revised a bit. I have esp changed the variable
names and arguments to reflect their true role in the functions. Also
updated prologue of expand_single_inheritance_child() to mention
"has_child". Let me know if those changes look good.
Sure. Committed as you have it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 31 August 2017 at 13:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Mind you, that idea has some problems anyway in the face of default
partitions, null partitions, and list partitions which accept
non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
4, 6). We might need to mark the PartitionDesc to indicate whether
PartitionDesc-order is in fact bound-order in a particular instance,
or something like that.ISTM, the primary motivation for the EIBO patch at this point is to get
the partitions ordered in a predictable manner so that the partition-wise
join patch and update partition key patches could implement certain logic
using O (n) algorithm rather than an O (n^2) one. Neither of them depend
on the actual order in the sense of, say, sticking a PathKey to the
resulting Append.
Now that the inheritance hierarchy is expanded in depth-first order,
RelationGetPartitionDispatchInfo() needs to be changed to arrange the
PartitionDispatch array and the leaf partitions in depth-first order
(as we know this is a requirement for the update-partition-key patch
for efficiently determining which of the leaf partitions are already
present in the update result rels). Amit, I am not sure if you are
already doing this as part of the patches in this mail thread. Please
let me know. Also, let me know if you think there will be any loss of
efficiency in tuple routing code if we arrange the Partition Dispatch
indexes in depth-first order.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Amit,
On 2017/09/03 16:07, Amit Khandekar wrote:
On 31 August 2017 at 13:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Mind you, that idea has some problems anyway in the face of default
partitions, null partitions, and list partitions which accept
non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
4, 6). We might need to mark the PartitionDesc to indicate whether
PartitionDesc-order is in fact bound-order in a particular instance,
or something like that.ISTM, the primary motivation for the EIBO patch at this point is to get
the partitions ordered in a predictable manner so that the partition-wise
join patch and update partition key patches could implement certain logic
using O (n) algorithm rather than an O (n^2) one. Neither of them depend
on the actual order in the sense of, say, sticking a PathKey to the
resulting Append.Now that the inheritance hierarchy is expanded in depth-first order,
RelationGetPartitionDispatchInfo() needs to be changed to arrange the
PartitionDispatch array and the leaf partitions in depth-first order
(as we know this is a requirement for the update-partition-key patch
for efficiently determining which of the leaf partitions are already
present in the update result rels).
I was thinking the same.
Amit, I am not sure if you are
already doing this as part of the patches in this mail thread. Please
let me know.
Actually, I had thought of changing the expansion order in
RelationGetPartitionDispatchInfo to depth-first after Robert committed his
patch the other day, but haven't got around to doing that yet. Will do
that in the updated patch (the refactoring patch) I will post sometime
later today or tomorrow on a differently titled thread, because the EIBO
work seems to be done.
Also, let me know if you think there will be any loss of
efficiency in tuple routing code if we arrange the Partition Dispatch
indexes in depth-first order.
I don't think there will be any loss in the efficiency of the tuple
routing code itself. It's just that the position of the ResultRelInfos
(of leaf partitions) and PartitionDispatch objects (of partitioned tables)
will be different in their respective arrays, that is, pd->indexes will
now have different values than formerly.
And now because the planner will put leaf partitions subplans / WCOs /
RETURNING projections in that order in the ModifyTable node, we must make
sure that we adapt the same order in the executor, as you already noted.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4 September 2017 at 06:34, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi Amit,
On 2017/09/03 16:07, Amit Khandekar wrote:
On 31 August 2017 at 13:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Mind you, that idea has some problems anyway in the face of default
partitions, null partitions, and list partitions which accept
non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
4, 6). We might need to mark the PartitionDesc to indicate whether
PartitionDesc-order is in fact bound-order in a particular instance,
or something like that.ISTM, the primary motivation for the EIBO patch at this point is to get
the partitions ordered in a predictable manner so that the partition-wise
join patch and update partition key patches could implement certain logic
using O (n) algorithm rather than an O (n^2) one. Neither of them depend
on the actual order in the sense of, say, sticking a PathKey to the
resulting Append.Now that the inheritance hierarchy is expanded in depth-first order,
RelationGetPartitionDispatchInfo() needs to be changed to arrange the
PartitionDispatch array and the leaf partitions in depth-first order
(as we know this is a requirement for the update-partition-key patch
for efficiently determining which of the leaf partitions are already
present in the update result rels).I was thinking the same.
Amit, I am not sure if you are
already doing this as part of the patches in this mail thread. Please
let me know.Actually, I had thought of changing the expansion order in
RelationGetPartitionDispatchInfo to depth-first after Robert committed his
patch the other day, but haven't got around to doing that yet. Will do
that in the updated patch (the refactoring patch) I will post sometime
later today or tomorrow on a differently titled thread, because the EIBO
work seems to be done.
Great, thanks. Just wanted to make sure someone is working on that,
because, as you said, it is no longer an EIBO patch. Since you are
doing that, I won't work on that.
Also, let me know if you think there will be any loss of
efficiency in tuple routing code if we arrange the Partition Dispatch
indexes in depth-first order.I don't think there will be any loss in the efficiency of the tuple
routing code itself. It's just that the position of the ResultRelInfos
(of leaf partitions) and PartitionDispatch objects (of partitioned tables)
will be different in their respective arrays, that is, pd->indexes will
now have different values than formerly.
Ok. Good to hear that.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/05 14:11, Amit Khandekar wrote:
Great, thanks. Just wanted to make sure someone is working on that,
because, as you said, it is no longer an EIBO patch. Since you are
doing that, I won't work on that.
Here is that patch (actually two patches). Sorry it took me a bit.
Description:
[PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor
Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code. In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as executor tuple table
slots, tuple-conversion maps, etc. After this refactoring,
ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo().
[PATCH 2/2] Make RelationGetPartitionDispatch expansion order
depth-first
This is so as it matches what the planner is doing with partitioning
inheritance expansion. Matching with planner order helps because
it helps ease matching the executor's per-partition objects with
planner-created per-partition nodes.
Actually, I'm coming to a conclusion that we should keep any
whole-partition-tree stuff out of partition.c and its interface, as Robert
has also alluded to in an earlier message on this thread [1]/messages/by-id/CA+Tgmoafr=hUrM=cbx-k=BDHOF2OfXaw95HQSNAK4mHBwmSjtw@mail.gmail.com. But since
that's a different topic, I'll shut up about it on this thread and start a
new thread to discuss what kind of code rearrangement is possible.
Thanks,
Amit
[1]: /messages/by-id/CA+Tgmoafr=hUrM=cbx-k=BDHOF2OfXaw95HQSNAK4mHBwmSjtw@mail.gmail.com
/messages/by-id/CA+Tgmoafr=hUrM=cbx-k=BDHOF2OfXaw95HQSNAK4mHBwmSjtw@mail.gmail.com
Attachments:
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload
From 6956ac321df169f6c26c383ddcb5ea48c1a0071b Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 30 Aug 2017 10:02:05 +0900
Subject: [PATCH 1/3] Decouple RelationGetPartitionDispatchInfo() from executor
Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code. In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as executor tuple table
slots, tuple-conversion maps, etc. That makes it harder to use in
places other than where it's currently being used.
After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo().
---
src/backend/catalog/partition.c | 53 +++++++-------------
src/backend/commands/copy.c | 32 +++++++------
src/backend/executor/execMain.c | 88 ++++++++++++++++++++++++++++++----
src/backend/executor/nodeModifyTable.c | 37 +++++++-------
src/include/catalog/partition.h | 20 +++-----
src/include/executor/executor.h | 4 +-
src/include/nodes/execnodes.h | 40 +++++++++++++++-
7 files changed, 181 insertions(+), 93 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index c6bd02f77d..4f594243d3 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1061,7 +1061,6 @@ RelationGetPartitionDispatchInfo(Relation rel,
Relation partrel = lfirst(lc1);
Relation parent = lfirst(lc2);
PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
int j,
m;
@@ -1069,29 +1068,12 @@ RelationGetPartitionDispatchInfo(Relation rel,
pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
pd[i]->reldesc = partrel;
pd[i]->key = partkey;
- pd[i]->keystate = NIL;
pd[i]->partdesc = partdesc;
if (parent != NULL)
- {
- /*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
- */
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
- }
+ pd[i]->parentoid = RelationGetRelid(parent);
else
- {
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
+ pd[i]->parentoid = InvalidOid;
+
pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
/*
@@ -1840,7 +1822,7 @@ generate_partition_qual(Relation rel)
* Construct values[] and isnull[] arrays for the partition key
* of a tuple.
*
- * pd Partition dispatch object of the partitioned table
+ * ptrinfo PartitionTupleRoutingInfo object of the table
* slot Heap tuple from which to extract partition key
* estate executor state for evaluating any partition key
* expressions (must be non-NULL)
@@ -1852,26 +1834,27 @@ generate_partition_qual(Relation rel)
* ----------------
*/
void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionTupleRoutingInfo *ptrinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull)
{
+ PartitionDispatch pd = ptrinfo->pd;
ListCell *partexpr_item;
int i;
- if (pd->key->partexprs != NIL && pd->keystate == NIL)
+ if (pd->key->partexprs != NIL && ptrinfo->keystate == NIL)
{
/* Check caller has set up context correctly */
Assert(estate != NULL &&
GetPerTupleExprContext(estate)->ecxt_scantuple == slot);
/* First time through, set up expression evaluation state */
- pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+ ptrinfo->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
}
- partexpr_item = list_head(pd->keystate);
+ partexpr_item = list_head(ptrinfo->keystate);
for (i = 0; i < pd->key->partnatts; i++)
{
AttrNumber keycol = pd->key->partattrs[i];
@@ -1911,13 +1894,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
* the latter case.
*/
int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot)
{
- PartitionDispatch parent;
+ PartitionTupleRoutingInfo *parent;
Datum values[PARTITION_MAX_KEYS];
bool isnull[PARTITION_MAX_KEYS];
int result;
@@ -1925,11 +1908,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
/* start with the root partitioned table */
- parent = pd[0];
+ parent = ptrinfos[0];
while (true)
{
- PartitionKey key = parent->key;
- PartitionDesc partdesc = parent->partdesc;
+ PartitionKey key = parent->pd->key;
+ PartitionDesc partdesc = parent->pd->partdesc;
TupleTableSlot *myslot = parent->tupslot;
TupleConversionMap *map = parent->tupmap;
int cur_index = -1;
@@ -2039,13 +2022,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
*failed_slot = slot;
break;
}
- else if (parent->indexes[cur_index] >= 0)
+ else if (parent->pd->indexes[cur_index] >= 0)
{
- result = parent->indexes[cur_index];
+ result = parent->pd->indexes[cur_index];
break;
}
else
- parent = pd[-parent->indexes[cur_index]];
+ parent = ptrinfos[-parent->pd->indexes[cur_index]];
}
error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f059c2..288d6a1ab2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
bool volatile_defexprs; /* is any of defexprs volatile? */
List *range_table;
- PartitionDispatch *partition_dispatch_info;
- int num_dispatch; /* Number of entries in the above array */
+ PartitionTupleRoutingInfo **ptrinfos;
+ int num_parted; /* Number of entries in the above array */
int num_partitions; /* Number of members in the following arrays */
ResultRelInfo *partitions; /* Per partition result relation */
TupleConversionMap **partition_tupconv_maps;
@@ -2445,7 +2445,7 @@ CopyFrom(CopyState cstate)
*/
if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -2455,13 +2455,13 @@ CopyFrom(CopyState cstate)
ExecSetupPartitionTupleRouting(cstate->rel,
1,
estate,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- cstate->partition_dispatch_info = partition_dispatch_info;
- cstate->num_dispatch = num_parted;
+ cstate->ptrinfos = ptrinfos;
+ cstate->num_parted = num_parted;
cstate->partitions = partitions;
cstate->num_partitions = num_partitions;
cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2502,7 +2502,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
- cstate->partition_dispatch_info != NULL ||
+ cstate->ptrinfos != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2580,7 +2580,7 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
/* Determine the partition to heap_insert the tuple into */
- if (cstate->partition_dispatch_info)
+ if (cstate->ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -2594,7 +2594,7 @@ CopyFrom(CopyState cstate)
* partition, respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- cstate->partition_dispatch_info,
+ cstate->ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -2826,8 +2826,8 @@ CopyFrom(CopyState cstate)
ExecCloseIndices(resultRelInfo);
- /* Close all the partitioned tables, leaf partitions, and their indices */
- if (cstate->partition_dispatch_info)
+ /* Release some resources that we acquired for tuple-routing. */
+ if (cstate->ptrinfos)
{
int i;
@@ -2837,13 +2837,15 @@ CopyFrom(CopyState cstate)
* the main target table of COPY that will be closed eventually by
* DoCopy(). Also, tupslot is NULL for the root partitioned table.
*/
- for (i = 1; i < cstate->num_dispatch; i++)
+ for (i = 1; i < cstate->num_parted; i++)
{
- PartitionDispatch pd = cstate->partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ heap_close(ptrinfo->pd->reldesc, NoLock);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
+
+ /* Close all the leaf partitions and their indices */
for (i = 0; i < cstate->num_partitions; i++)
{
ResultRelInfo *resultRelInfo = cstate->partitions + i;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4b594d489c..fe186abe69 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3243,8 +3243,8 @@ EvalPlanQualEnd(EPQState *epqstate)
* tuple routing for partitioned tables
*
* Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- * every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ * entry for each partitioned table in the partition tree
* 'partitions' receives an array of ResultRelInfo objects with one entry for
* every leaf partition in the partition tree
* 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3267,7 +3267,7 @@ void
ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
EState *estate,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
@@ -3275,16 +3275,84 @@ ExecSetupPartitionTupleRouting(Relation rel,
{
TupleDesc tupDesc = RelationGetDescr(rel);
List *leaf_parts;
+ PartitionDispatch *pds;
ListCell *cell;
int i;
ResultRelInfo *leaf_part_rri;
+ Relation parent;
/*
* Get the information about the partition tree after locking all the
* partitions.
*/
(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
- *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+ pds = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+
+ /*
+ * Construct PartitionTupleRoutingInfo objects, one for each partitioned
+ * table in the tree, using its PartitionDispatch in the pds array.
+ */
+ *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+ sizeof(PartitionTupleRoutingInfo *));
+ parent = NULL;
+ for (i = 0; i < *num_parted; i++)
+ {
+ PartitionTupleRoutingInfo *ptrinfo;
+
+ ptrinfo = (PartitionTupleRoutingInfo *)
+ palloc0(sizeof(PartitionTupleRoutingInfo));
+ /* Stash a reference to this PartitionDispatch. */
+ ptrinfo->pd = pds[i];
+
+ /* State for extracting partition key from tuples will go here. */
+ ptrinfo->keystate = NIL;
+
+ /*
+ * For every partitioned table other than root, we must store a tuple
+ * table slot initialized with its tuple descriptor and a tuple
+ * conversion map to convert a tuple from its parent's rowtype to its
+ * own. That is to make sure that we are looking at the correct row
+ * using the correct tuple descriptor when computing its partition key
+ * for tuple routing.
+ */
+ if (pds[i]->parentoid != InvalidOid)
+ {
+ TupleDesc tupdesc = RelationGetDescr(pds[i]->reldesc);
+
+ /* Open the parent relation descriptor if not already done. */
+ if (pds[i]->parentoid == RelationGetRelid(rel))
+ parent = rel;
+ else if (parent == NULL)
+ /* Locked by RelationGetPartitionDispatchInfo(). */
+ parent = heap_open(pds[i]->parentoid, NoLock);
+
+ ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+ /*
+ * Close the parent descriptor, if the next partitioned table in
+ * the list is not a sibling, because it will have a different
+ * parent if so.
+ */
+ if (parent != NULL && parent != rel && i + 1 < *num_parted &&
+ pds[i + 1]->parentoid != pds[i]->parentoid)
+ {
+ heap_close(parent, NoLock);
+ parent = NULL;
+ }
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ ptrinfo->tupslot = NULL;
+ ptrinfo->tupmap = NULL;
+ }
+
+ (*ptrinfos)[i] = ptrinfo;
+ }
+
+ /* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
@@ -3361,11 +3429,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
* by get_partition_for_tuple() unchanged.
*/
int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
- TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+ PartitionTupleRoutingInfo **ptrinfos,
+ TupleTableSlot *slot,
+ EState *estate)
{
int result;
- PartitionDispatchData *failed_at;
+ PartitionTupleRoutingInfo *failed_at;
TupleTableSlot *failed_slot;
/*
@@ -3375,7 +3445,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
if (resultRelInfo->ri_PartitionCheck)
ExecPartitionCheck(resultRelInfo, slot, estate);
- result = get_partition_for_tuple(pd, slot, estate,
+ result = get_partition_for_tuple(ptrinfos, slot, estate,
&failed_at, &failed_slot);
if (result < 0)
{
@@ -3385,7 +3455,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
char *val_desc;
ExprContext *ecxt = GetPerTupleExprContext(estate);
- failed_rel = failed_at->reldesc;
+ failed_rel = failed_at->pd->reldesc;
ecxt->ecxt_scantuple = failed_slot;
FormPartitionKeyDatum(failed_at, failed_slot, estate,
key_values, key_isnull);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index bd84778739..06c69c5783 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -278,7 +278,7 @@ ExecInsert(ModifyTableState *mtstate,
resultRelInfo = estate->es_result_relation_info;
/* Determine the partition to heap_insert the tuple into */
- if (mtstate->mt_partition_dispatch_info)
+ if (mtstate->mt_ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -292,7 +292,7 @@ ExecInsert(ModifyTableState *mtstate,
* respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- mtstate->mt_partition_dispatch_info,
+ mtstate->mt_ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -1487,7 +1487,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
int numResultRelInfos;
/* Find the set of partitions so that we can find their TupleDescs. */
- if (mtstate->mt_partition_dispatch_info != NULL)
+ if (mtstate->mt_ptrinfos != NULL)
{
/*
* For INSERT via partitioned table, so we need TupleDescs based
@@ -1911,7 +1911,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
if (operation == CMD_INSERT &&
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1921,13 +1921,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
ExecSetupPartitionTupleRouting(rel,
node->nominalRelation,
estate,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- mtstate->mt_partition_dispatch_info = partition_dispatch_info;
- mtstate->mt_num_dispatch = num_parted;
+ mtstate->mt_ptrinfos = ptrinfos;
+ mtstate->mt_num_parted = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2336,21 +2336,24 @@ ExecEndModifyTable(ModifyTableState *node)
resultRelInfo);
}
+ /* Release some resources that we acquired for tuple-routing. */
+
/*
- * Close all the partitioned tables, leaf partitions, and their indices
- *
- * Remember node->mt_partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is the
- * main target table of the query that will be closed by ExecEndPlan().
- * Also, tupslot is NULL for the root partitioned table.
+ * node->mt_ptrinfos[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot. Also, its relation descriptor will
+ * be closed in ExecEndPlan().
*/
- for (i = 1; i < node->mt_num_dispatch; i++)
+ for (i = 1; i < node->mt_num_parted; i++)
{
- PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ heap_close(ptrinfo->pd->reldesc, NoLock);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
+
+ /*
+ * Close all the leaf partitions and their indices.
+ */
for (i = 0; i < node->mt_num_partitions; i++)
{
ResultRelInfo *resultRelInfo = node->mt_partitions + i;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2283c675e9..1091dd572c 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -45,13 +45,8 @@ typedef struct PartitionDescData *PartitionDesc;
*
* reldesc Relation descriptor of the table
* key Partition key information of the table
- * keystate Execution state required for expressions in the partition key
* partdesc Partition descriptor of the table
- * tupslot A standalone TupleTableSlot initialized with this table's tuple
- * descriptor
- * tupmap TupleConversionMap to convert from the parent's rowtype to
- * this table's rowtype (when extracting the partition key of a
- * tuple just before routing it through this table)
+ * parentoid OID of the parent table (InvalidOid if root partitioned table)
* indexes Array with partdesc->nparts members (for details on what
* individual members represent, see how they are set in
* RelationGetPartitionDispatchInfo())
@@ -61,10 +56,8 @@ typedef struct PartitionDispatchData
{
Relation reldesc;
PartitionKey key;
- List *keystate; /* list of ExprState */
PartitionDesc partdesc;
- TupleTableSlot *tupslot;
- TupleConversionMap *tupmap;
+ Oid parentoid;
int *indexes;
} PartitionDispatchData;
@@ -86,17 +79,18 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
-/* For tuple routing */
extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
int *num_parted, List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+
+/* For tuple routing */
+extern void FormPartitionKeyDatum(PartitionTupleRoutingInfo *ptrinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot);
#endif /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 770881849c..aee7a41b31 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -209,13 +209,13 @@ extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
extern void ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
EState *estate,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
int *num_parted, int *num_partitions);
extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
- PartitionDispatch *pd,
+ PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 90a60abc4d..c554a1b311 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,42 @@ typedef struct ResultRelInfo
Relation ri_PartitionRoot;
} ResultRelInfo;
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ * through one partitioned table in a partition
+ * tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+
+ /* Information about the table's partitions */
+ PartitionDispatch pd;
+
+ /*
+ * The execution state required for expressions contained in the partition
+ * key. It is NIL until initialized by FormPartitionKeyDatum() if and when
+ * it is called; for example, the first time a tuple is routed through this
+ * table.
+ */
+ List *keystate;
+
+ /*
+ * A standalone TupleTableSlot initialized with this table's tuple
+ * descriptor
+ */
+ TupleTableSlot *tupslot;
+
+ /*
+ * TupleConversionMap to convert from the parent's rowtype to this table's
+ * rowtype (when extracting the partition key of a tuple just before
+ * routing it through this table)
+ */
+ TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
/* ----------------
* EState information
*
@@ -973,9 +1009,9 @@ typedef struct ModifyTableState
TupleTableSlot *mt_existing; /* slot to store existing target tuple in */
List *mt_excludedtlist; /* the excluded pseudo relation's tlist */
TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */
- struct PartitionDispatchData **mt_partition_dispatch_info;
/* Tuple-routing support info */
- int mt_num_dispatch; /* Number of entries in the above array */
+ struct PartitionTupleRoutingInfo **mt_ptrinfos;
+ int mt_num_parted; /* Number of entries in the above array */
int mt_num_partitions; /* Number of members in the following
* arrays */
ResultRelInfo *mt_partitions; /* Per partition result relation */
--
2.11.0
0002-Make-RelationGetPartitionDispatch-expansion-order-de.patchtext/plain; charset=UTF-8; name=0002-Make-RelationGetPartitionDispatch-expansion-order-de.patchDownload
From f95b1cb33159620175e012a471513f9a32d64c43 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Fri, 8 Sep 2017 17:35:10 +0900
Subject: [PATCH 2/3] Make RelationGetPartitionDispatch expansion order
depth-first
This is so as it matches what the planner is doing with partitioning
inheritance expansion. Matching with planner order helps because
it helps ease matching the executor's per-partition objects with
planner-created per-partition nodes.
---
src/backend/catalog/partition.c | 191 ++++++++++++++++------------------------
1 file changed, 74 insertions(+), 117 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 4f594243d3..a4abe08088 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -139,6 +139,8 @@ static int32 partition_bound_cmp(PartitionKey key,
static int partition_bound_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
void *probe, bool probe_is_bound, bool *is_equal);
+static void get_partition_dispatch_recurse(Relation rel, Relation parent,
+ List **pds, List **leaf_part_oids);
/*
* RelationBuildPartitionDesc
@@ -961,21 +963,6 @@ get_partition_qual_relid(Oid relid)
}
/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
- do\
- {\
- int i;\
- for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
- {\
- (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
- (parents) = lappend((parents), (rel));\
- }\
- } while(0)
-
-/*
* RelationGetPartitionDispatchInfo
* Returns information necessary to route tuples down a partition tree
*
@@ -991,16 +978,47 @@ PartitionDispatch *
RelationGetPartitionDispatchInfo(Relation rel,
int *num_parted, List **leaf_part_oids)
{
+ List *pdlist;
PartitionDispatchData **pd;
- List *all_parts = NIL,
- *all_parents = NIL,
- *parted_rels,
- *parted_rel_parents;
- ListCell *lc1,
- *lc2;
- int i,
- k,
- offset;
+ ListCell *lc;
+ int i;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ *num_parted = 0;
+ *leaf_part_oids = NIL;
+
+ get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
+ *num_parted = list_length(pdlist);
+ pd = (PartitionDispatchData **) palloc(*num_parted *
+ sizeof(PartitionDispatchData *));
+ i = 0;
+ foreach (lc, pdlist)
+ {
+ pd[i++] = lfirst(lc);
+ }
+
+ return pd;
+}
+
+/*
+ * get_partition_dispatch_recurse
+ * Recursively expand partition tree rooted at rel
+ *
+ * As the partition tree is expanded in a depth-first manner, we mantain two
+ * global lists: of PartitionDispatch objects corresponding to partitioned
+ * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
+ */
+static void
+get_partition_dispatch_recurse(Relation rel, Relation parent,
+ List **pds, List **leaf_part_oids)
+{
+ PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+ PartitionKey partkey = RelationGetPartitionKey(rel);
+ PartitionDispatch pd;
+ int i;
+ int next_leaf_idx = list_length(*leaf_part_oids),
+ next_parted_idx = list_length(*pds);
/*
* We rely on the relcache to traverse the partition tree to build both
@@ -1016,108 +1034,47 @@ RelationGetPartitionDispatchInfo(Relation rel,
* a bit tricky but works because the foreach() macro doesn't fetch the
* next list element until the bottom of the loop.
*/
- *num_parted = 1;
- parted_rels = list_make1(rel);
- /* Root partitioned table has no parent, so NULL for parent */
- parted_rel_parents = list_make1(NULL);
- APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
- forboth(lc1, all_parts, lc2, all_parents)
+ pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
+ *pds = lappend(*pds, pd);
+ pd->reldesc = rel;
+ pd->key = partkey;
+ pd->partdesc = partdesc;
+ if (parent != NULL)
+ pd->parentoid = RelationGetRelid(parent);
+ else
+ pd->parentoid = InvalidOid;
+ pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+ for (i = 0; i < partdesc->nparts; i++)
{
- Oid partrelid = lfirst_oid(lc1);
- Relation parent = lfirst(lc2);
+ Oid partrelid = partdesc->oids[i];
- if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
+ if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
{
- /*
- * Already locked by the caller. Note that it is the
- * responsibility of the caller to close the below relcache entry,
- * once done using the information being collected here (for
- * example, in ExecEndModifyTable).
- */
- Relation partrel = heap_open(partrelid, NoLock);
-
- (*num_parted)++;
- parted_rels = lappend(parted_rels, partrel);
- parted_rel_parents = lappend(parted_rel_parents, parent);
- APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
+ *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+ pd->indexes[i] = next_leaf_idx++;
}
- }
-
- /*
- * We want to create two arrays - one for leaf partitions and another for
- * partitioned tables (including the root table and internal partitions).
- * While we only create the latter here, leaf partition array of suitable
- * objects (such as, ResultRelInfo) is created by the caller using the
- * list of OIDs we return. Indexes into these arrays get assigned in a
- * breadth-first manner, whereby partitions of any given level are placed
- * consecutively in the respective arrays.
- */
- pd = (PartitionDispatchData **) palloc(*num_parted *
- sizeof(PartitionDispatchData *));
- *leaf_part_oids = NIL;
- i = k = offset = 0;
- forboth(lc1, parted_rels, lc2, parted_rel_parents)
- {
- Relation partrel = lfirst(lc1);
- Relation parent = lfirst(lc2);
- PartitionKey partkey = RelationGetPartitionKey(partrel);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- int j,
- m;
-
- pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
- pd[i]->reldesc = partrel;
- pd[i]->key = partkey;
- pd[i]->partdesc = partdesc;
- if (parent != NULL)
- pd[i]->parentoid = RelationGetRelid(parent);
else
- pd[i]->parentoid = InvalidOid;
-
- pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-
- /*
- * Indexes corresponding to the internal partitions are multiplied by
- * -1 to distinguish them from those of leaf partitions. Encountering
- * an index >= 0 means we found a leaf partition, which is immediately
- * returned as the partition we are looking for. A negative index
- * means we found a partitioned table, whose PartitionDispatch object
- * is located at the above index multiplied back by -1. Using the
- * PartitionDispatch object, search is continued further down the
- * partition tree.
- */
- m = 0;
- for (j = 0; j < partdesc->nparts; j++)
{
- Oid partrelid = partdesc->oids[j];
+ Relation partrel = heap_open(partrelid, NoLock);
- if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
- {
- *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
- pd[i]->indexes[j] = k++;
- }
- else
- {
- /*
- * offset denotes the number of partitioned tables of upper
- * levels including those of the current level. Any partition
- * of this table must belong to the next level and hence will
- * be placed after the last partitioned table of this level.
- */
- pd[i]->indexes[j] = -(1 + offset + m);
- m++;
- }
- }
- i++;
+ /*
+ * offset denotes the number of partitioned tables of upper
+ * levels including those of the current level. Any partition
+ * of this table must belong to the next level and hence will
+ * be placed after the last partitioned table of this level.
+ */
+ pd->indexes[i] = -(1 + next_parted_idx);
+ get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
- /*
- * This counts the number of partitioned tables at upper levels
- * including those of the current level.
- */
- offset += m;
+ /*
+ * Fast-forward both leaf partition and parted indexes to account
+ * for the leaf partitions and PartitionDispatch objects just
+ * added.
+ */
+ next_parted_idx += (list_length(*pds) - next_parted_idx - 1);
+ next_leaf_idx += (list_length(*leaf_part_oids) - next_leaf_idx);
+ }
}
-
- return pd;
}
/* Module-local functions */
--
2.11.0
Thanks Amit for the patch. I am still reviewing it, but meanwhile
below are a few comments so far ...
On 8 September 2017 at 15:53, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
[PATCH 2/2] Make RelationGetPartitionDispatch expansion order
depth-firstThis is so as it matches what the planner is doing with partitioning
inheritance expansion. Matching with planner order helps because
it helps ease matching the executor's per-partition objects with
planner-created per-partition nodes.
+ next_parted_idx += (list_length(*pds) - next_parted_idx - 1);
I think this can be replaced just by :
+ next_parted_idx = list_length(*pds) - 1;
Or, how about removing this variable next_parted_idx altogether ?
Instead, we can just do this :
pd->indexes[i] = -(1 + list_length(*pds));
If that is not possible, I may be missing something.
-----------
+ next_leaf_idx += (list_length(*leaf_part_oids) - next_leaf_idx);
Didn't understand why next_leaf_idx needs to be updated in case when
the current partrelid is partitioned. I think it should be incremented
only for leaf partitions, no ? Or for that matter, again, how about
removing the variable 'next_leaf_idx' and doing this :
*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-----------
* For every partitioned table in the tree, starting with the root
* partitioned table, add its relcache entry to parted_rels, while also
* queuing its partitions (in the order in which they appear in the
* partition descriptor) to be looked at later in the same loop. This is
* a bit tricky but works because the foreach() macro doesn't fetch the
* next list element until the bottom of the loop.
I think the above comment needs to be modified with something
explaining the relevant changed code. For e.g. there is no
parted_rels, and the "tricky" part was there earlier because of the
list being iterated and at the same time being appended.
------------
I couldn't see the existing comments like "Indexes corresponding to
the internal partitions are multiplied by" anywhere in the patch. I
think those comments are still valid, and important.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Amit,
On 2017/09/11 16:16, Amit Khandekar wrote:
Thanks Amit for the patch. I am still reviewing it, but meanwhile
below are a few comments so far ...
Thanks for the review.
+ next_parted_idx += (list_length(*pds) - next_parted_idx - 1);
I think this can be replaced just by :
+ next_parted_idx = list_length(*pds) - 1;
Or, how about removing this variable next_parted_idx altogether ?
Instead, we can just do this :
pd->indexes[i] = -(1 + list_length(*pds));
That seems like the simplest possible way to do it.
+ next_leaf_idx += (list_length(*leaf_part_oids) - next_leaf_idx);
Didn't understand why next_leaf_idx needs to be updated in case when
the current partrelid is partitioned. I think it should be incremented
only for leaf partitions, no ? Or for that matter, again, how about
removing the variable 'next_leaf_idx' and doing this :
*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
pd->indexes[i] = list_length(*leaf_part_oids) - 1;
Yep.
Attached updated patch does it that way for both partitioned table indexes
and leaf partition indexes. Thanks for pointing it out.
-----------
* For every partitioned table in the tree, starting with the root
* partitioned table, add its relcache entry to parted_rels, while also
* queuing its partitions (in the order in which they appear in the
* partition descriptor) to be looked at later in the same loop. This is
* a bit tricky but works because the foreach() macro doesn't fetch the
* next list element until the bottom of the loop.I think the above comment needs to be modified with something
explaining the relevant changed code. For e.g. there is no
parted_rels, and the "tricky" part was there earlier because of the
list being iterated and at the same time being appended.------------
I think I forgot to update this comment.
I couldn't see the existing comments like "Indexes corresponding to
the internal partitions are multiplied by" anywhere in the patch. I
think those comments are still valid, and important.
Again, I failed to keep this comment. Anyway, I reworded the comments a
bit to describe what the code is doing more clearly. Hope you find it so too.
Thanks,
Amit
Attachments:
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload
From 0e04cee14a5168e0652c2aa646c169789ae41e8e Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 30 Aug 2017 10:02:05 +0900
Subject: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor
Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code. In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as executor tuple table
slots, tuple-conversion maps, etc. That makes it harder to use in
places other than where it's currently being used.
After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo().
---
src/backend/catalog/partition.c | 53 +++++++-------------
src/backend/commands/copy.c | 32 +++++++------
src/backend/executor/execMain.c | 88 ++++++++++++++++++++++++++++++----
src/backend/executor/nodeModifyTable.c | 37 +++++++-------
src/include/catalog/partition.h | 20 +++-----
src/include/executor/executor.h | 4 +-
src/include/nodes/execnodes.h | 40 +++++++++++++++-
7 files changed, 181 insertions(+), 93 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 73eff17202..555b7c21c7 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1292,7 +1292,6 @@ RelationGetPartitionDispatchInfo(Relation rel,
Relation partrel = lfirst(lc1);
Relation parent = lfirst(lc2);
PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
int j,
m;
@@ -1300,29 +1299,12 @@ RelationGetPartitionDispatchInfo(Relation rel,
pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
pd[i]->reldesc = partrel;
pd[i]->key = partkey;
- pd[i]->keystate = NIL;
pd[i]->partdesc = partdesc;
if (parent != NULL)
- {
- /*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
- */
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
- }
+ pd[i]->parentoid = RelationGetRelid(parent);
else
- {
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
+ pd[i]->parentoid = InvalidOid;
+
pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
/*
@@ -2233,7 +2215,7 @@ generate_partition_qual(Relation rel)
* Construct values[] and isnull[] arrays for the partition key
* of a tuple.
*
- * pd Partition dispatch object of the partitioned table
+ * ptrinfo PartitionTupleRoutingInfo object of the table
* slot Heap tuple from which to extract partition key
* estate executor state for evaluating any partition key
* expressions (must be non-NULL)
@@ -2245,26 +2227,27 @@ generate_partition_qual(Relation rel)
* ----------------
*/
void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionTupleRoutingInfo *ptrinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull)
{
+ PartitionDispatch pd = ptrinfo->pd;
ListCell *partexpr_item;
int i;
- if (pd->key->partexprs != NIL && pd->keystate == NIL)
+ if (pd->key->partexprs != NIL && ptrinfo->keystate == NIL)
{
/* Check caller has set up context correctly */
Assert(estate != NULL &&
GetPerTupleExprContext(estate)->ecxt_scantuple == slot);
/* First time through, set up expression evaluation state */
- pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+ ptrinfo->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
}
- partexpr_item = list_head(pd->keystate);
+ partexpr_item = list_head(ptrinfo->keystate);
for (i = 0; i < pd->key->partnatts; i++)
{
AttrNumber keycol = pd->key->partattrs[i];
@@ -2304,13 +2287,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
* the latter case.
*/
int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot)
{
- PartitionDispatch parent;
+ PartitionTupleRoutingInfo *parent;
Datum values[PARTITION_MAX_KEYS];
bool isnull[PARTITION_MAX_KEYS];
int result;
@@ -2318,11 +2301,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
/* start with the root partitioned table */
- parent = pd[0];
+ parent = ptrinfos[0];
while (true)
{
- PartitionKey key = parent->key;
- PartitionDesc partdesc = parent->partdesc;
+ PartitionKey key = parent->pd->key;
+ PartitionDesc partdesc = parent->pd->partdesc;
TupleTableSlot *myslot = parent->tupslot;
TupleConversionMap *map = parent->tupmap;
int cur_index = -1;
@@ -2458,13 +2441,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
*failed_slot = slot;
break;
}
- else if (parent->indexes[cur_index] >= 0)
+ else if (parent->pd->indexes[cur_index] >= 0)
{
- result = parent->indexes[cur_index];
+ result = parent->pd->indexes[cur_index];
break;
}
else
- parent = pd[-parent->indexes[cur_index]];
+ parent = ptrinfos[-parent->pd->indexes[cur_index]];
}
error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f059c2..288d6a1ab2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
bool volatile_defexprs; /* is any of defexprs volatile? */
List *range_table;
- PartitionDispatch *partition_dispatch_info;
- int num_dispatch; /* Number of entries in the above array */
+ PartitionTupleRoutingInfo **ptrinfos;
+ int num_parted; /* Number of entries in the above array */
int num_partitions; /* Number of members in the following arrays */
ResultRelInfo *partitions; /* Per partition result relation */
TupleConversionMap **partition_tupconv_maps;
@@ -2445,7 +2445,7 @@ CopyFrom(CopyState cstate)
*/
if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -2455,13 +2455,13 @@ CopyFrom(CopyState cstate)
ExecSetupPartitionTupleRouting(cstate->rel,
1,
estate,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- cstate->partition_dispatch_info = partition_dispatch_info;
- cstate->num_dispatch = num_parted;
+ cstate->ptrinfos = ptrinfos;
+ cstate->num_parted = num_parted;
cstate->partitions = partitions;
cstate->num_partitions = num_partitions;
cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2502,7 +2502,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
- cstate->partition_dispatch_info != NULL ||
+ cstate->ptrinfos != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2580,7 +2580,7 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
/* Determine the partition to heap_insert the tuple into */
- if (cstate->partition_dispatch_info)
+ if (cstate->ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -2594,7 +2594,7 @@ CopyFrom(CopyState cstate)
* partition, respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- cstate->partition_dispatch_info,
+ cstate->ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -2826,8 +2826,8 @@ CopyFrom(CopyState cstate)
ExecCloseIndices(resultRelInfo);
- /* Close all the partitioned tables, leaf partitions, and their indices */
- if (cstate->partition_dispatch_info)
+ /* Release some resources that we acquired for tuple-routing. */
+ if (cstate->ptrinfos)
{
int i;
@@ -2837,13 +2837,15 @@ CopyFrom(CopyState cstate)
* the main target table of COPY that will be closed eventually by
* DoCopy(). Also, tupslot is NULL for the root partitioned table.
*/
- for (i = 1; i < cstate->num_dispatch; i++)
+ for (i = 1; i < cstate->num_parted; i++)
{
- PartitionDispatch pd = cstate->partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ heap_close(ptrinfo->pd->reldesc, NoLock);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
+
+ /* Close all the leaf partitions and their indices */
for (i = 0; i < cstate->num_partitions; i++)
{
ResultRelInfo *resultRelInfo = cstate->partitions + i;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4b594d489c..fe186abe69 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3243,8 +3243,8 @@ EvalPlanQualEnd(EPQState *epqstate)
* tuple routing for partitioned tables
*
* Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- * every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ * entry for each partitioned table in the partition tree
* 'partitions' receives an array of ResultRelInfo objects with one entry for
* every leaf partition in the partition tree
* 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3267,7 +3267,7 @@ void
ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
EState *estate,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
@@ -3275,16 +3275,84 @@ ExecSetupPartitionTupleRouting(Relation rel,
{
TupleDesc tupDesc = RelationGetDescr(rel);
List *leaf_parts;
+ PartitionDispatch *pds;
ListCell *cell;
int i;
ResultRelInfo *leaf_part_rri;
+ Relation parent;
/*
* Get the information about the partition tree after locking all the
* partitions.
*/
(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
- *pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+ pds = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
+
+ /*
+ * Construct PartitionTupleRoutingInfo objects, one for each partitioned
+ * table in the tree, using its PartitionDispatch in the pds array.
+ */
+ *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+ sizeof(PartitionTupleRoutingInfo *));
+ parent = NULL;
+ for (i = 0; i < *num_parted; i++)
+ {
+ PartitionTupleRoutingInfo *ptrinfo;
+
+ ptrinfo = (PartitionTupleRoutingInfo *)
+ palloc0(sizeof(PartitionTupleRoutingInfo));
+ /* Stash a reference to this PartitionDispatch. */
+ ptrinfo->pd = pds[i];
+
+ /* State for extracting partition key from tuples will go here. */
+ ptrinfo->keystate = NIL;
+
+ /*
+ * For every partitioned table other than root, we must store a tuple
+ * table slot initialized with its tuple descriptor and a tuple
+ * conversion map to convert a tuple from its parent's rowtype to its
+ * own. That is to make sure that we are looking at the correct row
+ * using the correct tuple descriptor when computing its partition key
+ * for tuple routing.
+ */
+ if (pds[i]->parentoid != InvalidOid)
+ {
+ TupleDesc tupdesc = RelationGetDescr(pds[i]->reldesc);
+
+ /* Open the parent relation descriptor if not already done. */
+ if (pds[i]->parentoid == RelationGetRelid(rel))
+ parent = rel;
+ else if (parent == NULL)
+ /* Locked by RelationGetPartitionDispatchInfo(). */
+ parent = heap_open(pds[i]->parentoid, NoLock);
+
+ ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+ /*
+ * Close the parent descriptor, if the next partitioned table in
+ * the list is not a sibling, because it will have a different
+ * parent if so.
+ */
+ if (parent != NULL && parent != rel && i + 1 < *num_parted &&
+ pds[i + 1]->parentoid != pds[i]->parentoid)
+ {
+ heap_close(parent, NoLock);
+ parent = NULL;
+ }
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ ptrinfo->tupslot = NULL;
+ ptrinfo->tupmap = NULL;
+ }
+
+ (*ptrinfos)[i] = ptrinfo;
+ }
+
+ /* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */
*num_partitions = list_length(leaf_parts);
*partitions = (ResultRelInfo *) palloc(*num_partitions *
sizeof(ResultRelInfo));
@@ -3361,11 +3429,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
* by get_partition_for_tuple() unchanged.
*/
int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
- TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+ PartitionTupleRoutingInfo **ptrinfos,
+ TupleTableSlot *slot,
+ EState *estate)
{
int result;
- PartitionDispatchData *failed_at;
+ PartitionTupleRoutingInfo *failed_at;
TupleTableSlot *failed_slot;
/*
@@ -3375,7 +3445,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
if (resultRelInfo->ri_PartitionCheck)
ExecPartitionCheck(resultRelInfo, slot, estate);
- result = get_partition_for_tuple(pd, slot, estate,
+ result = get_partition_for_tuple(ptrinfos, slot, estate,
&failed_at, &failed_slot);
if (result < 0)
{
@@ -3385,7 +3455,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
char *val_desc;
ExprContext *ecxt = GetPerTupleExprContext(estate);
- failed_rel = failed_at->reldesc;
+ failed_rel = failed_at->pd->reldesc;
ecxt->ecxt_scantuple = failed_slot;
FormPartitionKeyDatum(failed_at, failed_slot, estate,
key_values, key_isnull);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 49586a3c03..61ea9afa01 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -278,7 +278,7 @@ ExecInsert(ModifyTableState *mtstate,
resultRelInfo = estate->es_result_relation_info;
/* Determine the partition to heap_insert the tuple into */
- if (mtstate->mt_partition_dispatch_info)
+ if (mtstate->mt_ptrinfos)
{
int leaf_part_index;
TupleConversionMap *map;
@@ -292,7 +292,7 @@ ExecInsert(ModifyTableState *mtstate,
* respectively.
*/
leaf_part_index = ExecFindPartition(resultRelInfo,
- mtstate->mt_partition_dispatch_info,
+ mtstate->mt_ptrinfos,
slot,
estate);
Assert(leaf_part_index >= 0 &&
@@ -1487,7 +1487,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
int numResultRelInfos;
/* Find the set of partitions so that we can find their TupleDescs. */
- if (mtstate->mt_partition_dispatch_info != NULL)
+ if (mtstate->mt_ptrinfos != NULL)
{
/*
* For INSERT via partitioned table, so we need TupleDescs based
@@ -1911,7 +1911,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
if (operation == CMD_INSERT &&
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
- PartitionDispatch *partition_dispatch_info;
+ PartitionTupleRoutingInfo **ptrinfos;
ResultRelInfo *partitions;
TupleConversionMap **partition_tupconv_maps;
TupleTableSlot *partition_tuple_slot;
@@ -1921,13 +1921,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
ExecSetupPartitionTupleRouting(rel,
node->nominalRelation,
estate,
- &partition_dispatch_info,
+ &ptrinfos,
&partitions,
&partition_tupconv_maps,
&partition_tuple_slot,
&num_parted, &num_partitions);
- mtstate->mt_partition_dispatch_info = partition_dispatch_info;
- mtstate->mt_num_dispatch = num_parted;
+ mtstate->mt_ptrinfos = ptrinfos;
+ mtstate->mt_num_parted = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2342,21 +2342,24 @@ ExecEndModifyTable(ModifyTableState *node)
resultRelInfo);
}
+ /* Release some resources that we acquired for tuple-routing. */
+
/*
- * Close all the partitioned tables, leaf partitions, and their indices
- *
- * Remember node->mt_partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is the
- * main target table of the query that will be closed by ExecEndPlan().
- * Also, tupslot is NULL for the root partitioned table.
+ * node->mt_ptrinfos[0] corresponds to the root partitioned table, for
+ * which we didn't create tupslot. Also, its relation descriptor will
+ * be closed in ExecEndPlan().
*/
- for (i = 1; i < node->mt_num_dispatch; i++)
+ for (i = 1; i < node->mt_num_parted; i++)
{
- PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+ PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
- heap_close(pd->reldesc, NoLock);
- ExecDropSingleTupleTableSlot(pd->tupslot);
+ heap_close(ptrinfo->pd->reldesc, NoLock);
+ ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
}
+
+ /*
+ * Close all the leaf partitions and their indices.
+ */
for (i = 0; i < node->mt_num_partitions; i++)
{
ResultRelInfo *resultRelInfo = node->mt_partitions + i;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 454a940a23..ebf82b55cf 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -45,13 +45,8 @@ typedef struct PartitionDescData *PartitionDesc;
*
* reldesc Relation descriptor of the table
* key Partition key information of the table
- * keystate Execution state required for expressions in the partition key
* partdesc Partition descriptor of the table
- * tupslot A standalone TupleTableSlot initialized with this table's tuple
- * descriptor
- * tupmap TupleConversionMap to convert from the parent's rowtype to
- * this table's rowtype (when extracting the partition key of a
- * tuple just before routing it through this table)
+ * parentoid OID of the parent table (InvalidOid if root partitioned table)
* indexes Array with partdesc->nparts members (for details on what
* individual members represent, see how they are set in
* RelationGetPartitionDispatchInfo())
@@ -61,10 +56,8 @@ typedef struct PartitionDispatchData
{
Relation reldesc;
PartitionKey key;
- List *keystate; /* list of ExprState */
PartitionDesc partdesc;
- TupleTableSlot *tupslot;
- TupleConversionMap *tupmap;
+ Oid parentoid;
int *indexes;
} PartitionDispatchData;
@@ -86,18 +79,19 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
extern List *RelationGetPartitionQual(Relation rel);
extern Expr *get_partition_qual_relid(Oid relid);
-/* For tuple routing */
extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
int *num_parted, List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+
+/* For tuple routing */
+extern void FormPartitionKeyDatum(PartitionTupleRoutingInfo *ptrinfo,
TupleTableSlot *slot,
EState *estate,
Datum *values,
bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate,
- PartitionDispatchData **failed_at,
+ PartitionTupleRoutingInfo **failed_at,
TupleTableSlot **failed_slot);
extern Oid get_default_oid_from_partdesc(PartitionDesc partdesc);
extern Oid get_default_partition_oid(Oid parentId);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 770881849c..aee7a41b31 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -209,13 +209,13 @@ extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
extern void ExecSetupPartitionTupleRouting(Relation rel,
Index resultRTindex,
EState *estate,
- PartitionDispatch **pd,
+ PartitionTupleRoutingInfo ***ptrinfos,
ResultRelInfo **partitions,
TupleConversionMap ***tup_conv_maps,
TupleTableSlot **partition_tuple_slot,
int *num_parted, int *num_partitions);
extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
- PartitionDispatch *pd,
+ PartitionTupleRoutingInfo **ptrinfos,
TupleTableSlot *slot,
EState *estate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 90a60abc4d..c554a1b311 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,42 @@ typedef struct ResultRelInfo
Relation ri_PartitionRoot;
} ResultRelInfo;
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ * through one partitioned table in a partition
+ * tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+
+ /* Information about the table's partitions */
+ PartitionDispatch pd;
+
+ /*
+ * The execution state required for expressions contained in the partition
+ * key. It is NIL until initialized by FormPartitionKeyDatum() if and when
+ * it is called; for example, the first time a tuple is routed through this
+ * table.
+ */
+ List *keystate;
+
+ /*
+ * A standalone TupleTableSlot initialized with this table's tuple
+ * descriptor
+ */
+ TupleTableSlot *tupslot;
+
+ /*
+ * TupleConversionMap to convert from the parent's rowtype to this table's
+ * rowtype (when extracting the partition key of a tuple just before
+ * routing it through this table)
+ */
+ TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
/* ----------------
* EState information
*
@@ -973,9 +1009,9 @@ typedef struct ModifyTableState
TupleTableSlot *mt_existing; /* slot to store existing target tuple in */
List *mt_excludedtlist; /* the excluded pseudo relation's tlist */
TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */
- struct PartitionDispatchData **mt_partition_dispatch_info;
/* Tuple-routing support info */
- int mt_num_dispatch; /* Number of entries in the above array */
+ struct PartitionTupleRoutingInfo **mt_ptrinfos;
+ int mt_num_parted; /* Number of entries in the above array */
int mt_num_partitions; /* Number of members in the following
* arrays */
ResultRelInfo *mt_partitions; /* Per partition result relation */
--
2.11.0
0002-Make-RelationGetPartitionDispatch-expansion-order-de.patchtext/plain; charset=UTF-8; name=0002-Make-RelationGetPartitionDispatch-expansion-order-de.patchDownload
From e980019fdc688321af809d7d3547d25a3d6ff15f Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Fri, 8 Sep 2017 17:35:10 +0900
Subject: [PATCH 2/2] Make RelationGetPartitionDispatch expansion order
depth-first
This is so as it matches what the planner is doing with partitioning
inheritance expansion. Matching with planner order helps because
it helps ease matching the executor's per-partition objects with
planner-created per-partition nodes.
---
src/backend/catalog/partition.c | 211 ++++++++++++++++------------------------
1 file changed, 83 insertions(+), 128 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 555b7c21c7..84c63a9ffe 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -147,6 +147,8 @@ static int32 partition_bound_cmp(PartitionKey key,
static int partition_bound_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
void *probe, bool probe_is_bound, bool *is_equal);
+static void get_partition_dispatch_recurse(Relation rel, Relation parent,
+ List **pds, List **leaf_part_oids);
/*
* RelationBuildPartitionDesc
@@ -1192,21 +1194,6 @@ get_partition_qual_relid(Oid relid)
}
/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
- do\
- {\
- int i;\
- for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
- {\
- (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
- (parents) = lappend((parents), (rel));\
- }\
- } while(0)
-
-/*
* RelationGetPartitionDispatchInfo
* Returns information necessary to route tuples down a partition tree
*
@@ -1222,133 +1209,101 @@ PartitionDispatch *
RelationGetPartitionDispatchInfo(Relation rel,
int *num_parted, List **leaf_part_oids)
{
+ List *pdlist;
PartitionDispatchData **pd;
- List *all_parts = NIL,
- *all_parents = NIL,
- *parted_rels,
- *parted_rel_parents;
- ListCell *lc1,
- *lc2;
- int i,
- k,
- offset;
+ ListCell *lc;
+ int i;
- /*
- * We rely on the relcache to traverse the partition tree to build both
- * the leaf partition OIDs list and the array of PartitionDispatch objects
- * for the partitioned tables in the tree. That means every partitioned
- * table in the tree must be locked, which is fine since we require the
- * caller to lock all the partitions anyway.
- *
- * For every partitioned table in the tree, starting with the root
- * partitioned table, add its relcache entry to parted_rels, while also
- * queuing its partitions (in the order in which they appear in the
- * partition descriptor) to be looked at later in the same loop. This is
- * a bit tricky but works because the foreach() macro doesn't fetch the
- * next list element until the bottom of the loop.
- */
- *num_parted = 1;
- parted_rels = list_make1(rel);
- /* Root partitioned table has no parent, so NULL for parent */
- parted_rel_parents = list_make1(NULL);
- APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
- forboth(lc1, all_parts, lc2, all_parents)
- {
- Oid partrelid = lfirst_oid(lc1);
- Relation parent = lfirst(lc2);
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
- if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
- {
- /*
- * Already locked by the caller. Note that it is the
- * responsibility of the caller to close the below relcache entry,
- * once done using the information being collected here (for
- * example, in ExecEndModifyTable).
- */
- Relation partrel = heap_open(partrelid, NoLock);
+ *num_parted = 0;
+ *leaf_part_oids = NIL;
- (*num_parted)++;
- parted_rels = lappend(parted_rels, partrel);
- parted_rel_parents = lappend(parted_rel_parents, parent);
- APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
- }
+ get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
+ *num_parted = list_length(pdlist);
+ pd = (PartitionDispatchData **) palloc(*num_parted *
+ sizeof(PartitionDispatchData *));
+ i = 0;
+ foreach (lc, pdlist)
+ {
+ pd[i++] = lfirst(lc);
}
+ return pd;
+}
+
+/*
+ * get_partition_dispatch_recurse
+ * Recursively expand partition tree rooted at rel
+ *
+ * As the partition tree is expanded in a depth-first manner, we mantain two
+ * global lists: of PartitionDispatch objects corresponding to partitioned
+ * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
+ */
+static void
+get_partition_dispatch_recurse(Relation rel, Relation parent,
+ List **pds, List **leaf_part_oids)
+{
+ PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+ PartitionKey partkey = RelationGetPartitionKey(rel);
+ PartitionDispatch pd;
+ int i;
+
+ /* Build a PartitionDispatch for this table and add it to *pds. */
+ pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
+ *pds = lappend(*pds, pd);
+ pd->reldesc = rel;
+ pd->key = partkey;
+ pd->partdesc = partdesc;
+ if (parent != NULL)
+ pd->parentoid = RelationGetRelid(parent);
+ else
+ pd->parentoid = InvalidOid;
+
/*
- * We want to create two arrays - one for leaf partitions and another for
- * partitioned tables (including the root table and internal partitions).
- * While we only create the latter here, leaf partition array of suitable
- * objects (such as, ResultRelInfo) is created by the caller using the
- * list of OIDs we return. Indexes into these arrays get assigned in a
- * breadth-first manner, whereby partitions of any given level are placed
- * consecutively in the respective arrays.
+ * Go look at each partition of this table. If it's a leaf partition,
+ * simply add its OID to *leaf_part_oids. If it's a partitioned table,
+ * recursively call get_partition_dispatch_recurse(), so that its
+ * partitions are processed as well and a corresponding PartitionDispatch
+ * object gets added to *pds.
+ *
+ * About the values in pd->indexes: for a leaf partition, it contains the
+ * leaf partition's position in the global list *leaf_part_oids minus 1,
+ * whereas for a partitioned table partition, it contains the partition's
+ * position in the global list *pds multiplied by -1. The latter is
+ * multiplied by -1 to distinguish partitioned tables from leaf partitions
+ * when going through the values in pd->indexes. So, for example, when
+ * using it during tuple-routing, encountering a value >= 0 means we found
+ * a leaf partition. It is immediately returned as the index in the array
+ * of ResultRelInfos of all the leaf partitions, using which we insert the
+ * tuple into that leaf partition. A negative value means we found a
+ * partitioned table. The value multiplied back by -1 is returned as the
+ * index in the array of PartitionDispatch objects of all partitioned
+ * tables in the tree, using which, search is continued further down the
+ * partition tree.
*/
- pd = (PartitionDispatchData **) palloc(*num_parted *
- sizeof(PartitionDispatchData *));
- *leaf_part_oids = NIL;
- i = k = offset = 0;
- forboth(lc1, parted_rels, lc2, parted_rel_parents)
+ pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+ for (i = 0; i < partdesc->nparts; i++)
{
- Relation partrel = lfirst(lc1);
- Relation parent = lfirst(lc2);
- PartitionKey partkey = RelationGetPartitionKey(partrel);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- int j,
- m;
-
- pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
- pd[i]->reldesc = partrel;
- pd[i]->key = partkey;
- pd[i]->partdesc = partdesc;
- if (parent != NULL)
- pd[i]->parentoid = RelationGetRelid(parent);
- else
- pd[i]->parentoid = InvalidOid;
-
- pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+ Oid partrelid = partdesc->oids[i];
- /*
- * Indexes corresponding to the internal partitions are multiplied by
- * -1 to distinguish them from those of leaf partitions. Encountering
- * an index >= 0 means we found a leaf partition, which is immediately
- * returned as the partition we are looking for. A negative index
- * means we found a partitioned table, whose PartitionDispatch object
- * is located at the above index multiplied back by -1. Using the
- * PartitionDispatch object, search is continued further down the
- * partition tree.
- */
- m = 0;
- for (j = 0; j < partdesc->nparts; j++)
+ if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
{
- Oid partrelid = partdesc->oids[j];
-
- if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
- {
- *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
- pd[i]->indexes[j] = k++;
- }
- else
- {
- /*
- * offset denotes the number of partitioned tables of upper
- * levels including those of the current level. Any partition
- * of this table must belong to the next level and hence will
- * be placed after the last partitioned table of this level.
- */
- pd[i]->indexes[j] = -(1 + offset + m);
- m++;
- }
+ *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+ pd->indexes[i] = list_length(*leaf_part_oids) - 1;
}
- i++;
+ else
+ {
+ /*
+ * We assume all tables in the partition tree were already
+ * locked by the caller.
+ */
+ Relation partrel = heap_open(partrelid, NoLock);
- /*
- * This counts the number of partitioned tables at upper levels
- * including those of the current level.
- */
- offset += m;
+ pd->indexes[i] = -list_length(*pds);
+ get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
+ }
}
-
- return pd;
}
/* Module-local functions */
--
2.11.0
On 2017/09/11 18:56, Amit Langote wrote:
Attached updated patch does it that way for both partitioned table indexes
and leaf partition indexes. Thanks for pointing it out.
It seems to me we don't really need the first patch all that much. That
is, let's keep PartitionDispatchData the way it is for now, since we don't
really have any need for it beside tuple-routing (EIBO as committed didn't
need it for one). So, let's forget about "decoupling
RelationGetPartitionDispatchInfo() from the executor" thing for now and
move on.
So, attached is just the patch to make RelationGetPartitionDispatchInfo()
traverse the partition tree in depth-first manner to be applied on HEAD.
Thoughts?
Thanks,
Amit
Attachments:
0001-Make-RelationGetPartitionDispatch-expansion-order-de.patchtext/plain; charset=UTF-8; name=0001-Make-RelationGetPartitionDispatch-expansion-order-de.patchDownload
From 1e99c776eda30c29fdb0e48570d6b3acd6b9a05d Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Fri, 8 Sep 2017 17:35:10 +0900
Subject: [PATCH] Make RelationGetPartitionDispatch expansion order depth-first
This is so as it matches what the planner is doing with partitioning
inheritance expansion. Matching with planner order helps because
it helps ease matching the executor's per-partition objects with
planner-created per-partition nodes.
---
src/backend/catalog/partition.c | 242 ++++++++++++++++------------------------
1 file changed, 99 insertions(+), 143 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 73eff17202..ddb46a80cb 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -147,6 +147,8 @@ static int32 partition_bound_cmp(PartitionKey key,
static int partition_bound_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
void *probe, bool probe_is_bound, bool *is_equal);
+static void get_partition_dispatch_recurse(Relation rel, Relation parent,
+ List **pds, List **leaf_part_oids);
/*
* RelationBuildPartitionDesc
@@ -1192,21 +1194,6 @@ get_partition_qual_relid(Oid relid)
}
/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
- do\
- {\
- int i;\
- for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
- {\
- (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
- (parents) = lappend((parents), (rel));\
- }\
- } while(0)
-
-/*
* RelationGetPartitionDispatchInfo
* Returns information necessary to route tuples down a partition tree
*
@@ -1222,151 +1209,120 @@ PartitionDispatch *
RelationGetPartitionDispatchInfo(Relation rel,
int *num_parted, List **leaf_part_oids)
{
+ List *pdlist;
PartitionDispatchData **pd;
- List *all_parts = NIL,
- *all_parents = NIL,
- *parted_rels,
- *parted_rel_parents;
- ListCell *lc1,
- *lc2;
- int i,
- k,
- offset;
+ ListCell *lc;
+ int i;
- /*
- * We rely on the relcache to traverse the partition tree to build both
- * the leaf partition OIDs list and the array of PartitionDispatch objects
- * for the partitioned tables in the tree. That means every partitioned
- * table in the tree must be locked, which is fine since we require the
- * caller to lock all the partitions anyway.
- *
- * For every partitioned table in the tree, starting with the root
- * partitioned table, add its relcache entry to parted_rels, while also
- * queuing its partitions (in the order in which they appear in the
- * partition descriptor) to be looked at later in the same loop. This is
- * a bit tricky but works because the foreach() macro doesn't fetch the
- * next list element until the bottom of the loop.
- */
- *num_parted = 1;
- parted_rels = list_make1(rel);
- /* Root partitioned table has no parent, so NULL for parent */
- parted_rel_parents = list_make1(NULL);
- APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
- forboth(lc1, all_parts, lc2, all_parents)
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ *num_parted = 0;
+ *leaf_part_oids = NIL;
+
+ get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
+ *num_parted = list_length(pdlist);
+ pd = (PartitionDispatchData **) palloc(*num_parted *
+ sizeof(PartitionDispatchData *));
+ i = 0;
+ foreach (lc, pdlist)
{
- Oid partrelid = lfirst_oid(lc1);
- Relation parent = lfirst(lc2);
+ pd[i++] = lfirst(lc);
+ }
- if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
- {
- /*
- * Already locked by the caller. Note that it is the
- * responsibility of the caller to close the below relcache entry,
- * once done using the information being collected here (for
- * example, in ExecEndModifyTable).
- */
- Relation partrel = heap_open(partrelid, NoLock);
+ return pd;
+}
- (*num_parted)++;
- parted_rels = lappend(parted_rels, partrel);
- parted_rel_parents = lappend(parted_rel_parents, parent);
- APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
- }
+/*
+ * get_partition_dispatch_recurse
+ * Recursively expand partition tree rooted at rel
+ *
+ * As the partition tree is expanded in a depth-first manner, we mantain two
+ * global lists: of PartitionDispatch objects corresponding to partitioned
+ * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
+ */
+static void
+get_partition_dispatch_recurse(Relation rel, Relation parent,
+ List **pds, List **leaf_part_oids)
+{
+ TupleDesc tupdesc = RelationGetDescr(rel);
+ PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+ PartitionKey partkey = RelationGetPartitionKey(rel);
+ PartitionDispatch pd;
+ int i;
+
+ /* Build a PartitionDispatch for this table and add it to *pds. */
+ pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
+ *pds = lappend(*pds, pd);
+ pd->reldesc = rel;
+ pd->key = partkey;
+ pd->keystate = NIL;
+ pd->partdesc = partdesc;
+ if (parent != NULL)
+ {
+ /*
+ * For every partitioned table other than root, we must store a
+ * tuple table slot initialized with its tuple descriptor and a
+ * tuple conversion map to convert a tuple from its parent's
+ * rowtype to its own. That is to make sure that we are looking at
+ * the correct row using the correct tuple descriptor when
+ * computing its partition key for tuple routing.
+ */
+ pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ pd->tupslot = NULL;
+ pd->tupmap = NULL;
}
/*
- * We want to create two arrays - one for leaf partitions and another for
- * partitioned tables (including the root table and internal partitions).
- * While we only create the latter here, leaf partition array of suitable
- * objects (such as, ResultRelInfo) is created by the caller using the
- * list of OIDs we return. Indexes into these arrays get assigned in a
- * breadth-first manner, whereby partitions of any given level are placed
- * consecutively in the respective arrays.
+ * Go look at each partition of this table. If it's a leaf partition,
+ * simply add its OID to *leaf_part_oids. If it's a partitioned table,
+ * recursively call get_partition_dispatch_recurse(), so that its
+ * partitions are processed as well and a corresponding PartitionDispatch
+ * object gets added to *pds.
+ *
+ * About the values in pd->indexes: for a leaf partition, it contains the
+ * leaf partition's position in the global list *leaf_part_oids minus 1,
+ * whereas for a partitioned table partition, it contains the partition's
+ * position in the global list *pds multiplied by -1. The latter is
+ * multiplied by -1 to distinguish partitioned tables from leaf partitions
+ * when going through the values in pd->indexes. So, for example, when
+ * using it during tuple-routing, encountering a value >= 0 means we found
+ * a leaf partition. It is immediately returned as the index in the array
+ * of ResultRelInfos of all the leaf partitions, using which we insert the
+ * tuple into that leaf partition. A negative value means we found a
+ * partitioned table. The value multiplied back by -1 is returned as the
+ * index in the array of PartitionDispatch objects of all partitioned
+ * tables in the tree, using which, search is continued further down the
+ * partition tree.
*/
- pd = (PartitionDispatchData **) palloc(*num_parted *
- sizeof(PartitionDispatchData *));
- *leaf_part_oids = NIL;
- i = k = offset = 0;
- forboth(lc1, parted_rels, lc2, parted_rel_parents)
+ pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+ for (i = 0; i < partdesc->nparts; i++)
{
- Relation partrel = lfirst(lc1);
- Relation parent = lfirst(lc2);
- PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- int j,
- m;
-
- pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
- pd[i]->reldesc = partrel;
- pd[i]->key = partkey;
- pd[i]->keystate = NIL;
- pd[i]->partdesc = partdesc;
- if (parent != NULL)
+ Oid partrelid = partdesc->oids[i];
+
+ if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
{
- /*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
- */
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
+ *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+ pd->indexes[i] = list_length(*leaf_part_oids) - 1;
}
else
{
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
- pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-
- /*
- * Indexes corresponding to the internal partitions are multiplied by
- * -1 to distinguish them from those of leaf partitions. Encountering
- * an index >= 0 means we found a leaf partition, which is immediately
- * returned as the partition we are looking for. A negative index
- * means we found a partitioned table, whose PartitionDispatch object
- * is located at the above index multiplied back by -1. Using the
- * PartitionDispatch object, search is continued further down the
- * partition tree.
- */
- m = 0;
- for (j = 0; j < partdesc->nparts; j++)
- {
- Oid partrelid = partdesc->oids[j];
+ /*
+ * We assume all tables in the partition tree were already
+ * locked by the caller.
+ */
+ Relation partrel = heap_open(partrelid, NoLock);
- if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
- {
- *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
- pd[i]->indexes[j] = k++;
- }
- else
- {
- /*
- * offset denotes the number of partitioned tables of upper
- * levels including those of the current level. Any partition
- * of this table must belong to the next level and hence will
- * be placed after the last partitioned table of this level.
- */
- pd[i]->indexes[j] = -(1 + offset + m);
- m++;
- }
+ pd->indexes[i] = -list_length(*pds);
+ get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
}
- i++;
-
- /*
- * This counts the number of partitioned tables at upper levels
- * including those of the current level.
- */
- offset += m;
}
-
- return pd;
}
/* Module-local functions */
--
2.11.0
On 13 September 2017 at 15:32, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/09/11 18:56, Amit Langote wrote:
Attached updated patch does it that way for both partitioned table indexes
and leaf partition indexes. Thanks for pointing it out.It seems to me we don't really need the first patch all that much. That
is, let's keep PartitionDispatchData the way it is for now, since we don't
really have any need for it beside tuple-routing (EIBO as committed didn't
need it for one). So, let's forget about "decoupling
RelationGetPartitionDispatchInfo() from the executor" thing for now and
move on.So, attached is just the patch to make RelationGetPartitionDispatchInfo()
traverse the partition tree in depth-first manner to be applied on HEAD.Thoughts?
+1. If at all we need the decoupling later for some reason, we can do
that incrementally.
Will review your latest patch by tomorrow.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 13, 2017 at 6:02 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
It seems to me we don't really need the first patch all that much. That
is, let's keep PartitionDispatchData the way it is for now, since we don't
really have any need for it beside tuple-routing (EIBO as committed didn't
need it for one). So, let's forget about "decoupling
RelationGetPartitionDispatchInfo() from the executor" thing for now and
move on.So, attached is just the patch to make RelationGetPartitionDispatchInfo()
traverse the partition tree in depth-first manner to be applied on HEAD.
I like this patch. Not only does it improve the behavior, but it's
actually less code than we have now, and in my opinion, the new code
is easier to understand, too.
A few suggestions:
- I think get_partition_dispatch_recurse() get a check_stack_depth()
call just like expand_partitioned_rtentry() did, and for the same
reasons: if the catalog contents are corrupted so that we have an
infinite loop in the partitioning hierarchy, we want to error out, not
crash.
- I think we should add a comment explaining that we're careful to do
this in the same order as expand_partitioned_rtentry() so that callers
can assume that the N'th entry in the leaf_part_oids array will also
be the N'th child of an Append node.
+ * For every partitioned table other than root, we must store a
other than the root
+ * partitioned table. The value multiplied back by -1 is returned as the
multiplied by -1, not multiplied back by -1
+ * tables in the tree, using which, search is continued further down the
+ * partition tree.
Period after "in the tree". Then continue: "This value is used to
continue the search in the next level of the partition tree."
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/14 1:42, Robert Haas wrote:
On Wed, Sep 13, 2017 at 6:02 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:It seems to me we don't really need the first patch all that much. That
is, let's keep PartitionDispatchData the way it is for now, since we don't
really have any need for it beside tuple-routing (EIBO as committed didn't
need it for one). So, let's forget about "decoupling
RelationGetPartitionDispatchInfo() from the executor" thing for now and
move on.So, attached is just the patch to make RelationGetPartitionDispatchInfo()
traverse the partition tree in depth-first manner to be applied on HEAD.I like this patch. Not only does it improve the behavior, but it's
actually less code than we have now, and in my opinion, the new code
is easier to understand, too.A few suggestions:
Thanks for the review.
- I think get_partition_dispatch_recurse() get a check_stack_depth()
call just like expand_partitioned_rtentry() did, and for the same
reasons: if the catalog contents are corrupted so that we have an
infinite loop in the partitioning hierarchy, we want to error out, not
crash.
Ah, missed that. Done.
- I think we should add a comment explaining that we're careful to do
this in the same order as expand_partitioned_rtentry() so that callers
can assume that the N'th entry in the leaf_part_oids array will also
be the N'th child of an Append node.
Done. Since the Append/ModifyTable may skip some leaf partitions due to
pruning, added a note about that too.
+ * For every partitioned table other than root, we must store a
other than the root
+ * partitioned table. The value multiplied back by -1 is returned as the
multiplied by -1, not multiplied back by -1
+ * tables in the tree, using which, search is continued further down the + * partition tree.Period after "in the tree". Then continue: "This value is used to
continue the search in the next level of the partition tree."
Fixed.
Attached updated patch.
Thanks,
Amit
Attachments:
0001-Make-RelationGetPartitionDispatch-expansion-order-de.patchtext/plain; charset=UTF-8; name=0001-Make-RelationGetPartitionDispatch-expansion-order-de.patchDownload
From c2599d52267cc362e059452efe69ddd09b94c083 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Fri, 8 Sep 2017 17:35:10 +0900
Subject: [PATCH] Make RelationGetPartitionDispatch expansion order depth-first
This is so as it matches what the planner is doing with partitioning
inheritance expansion. Matching with planner order helps because
it helps ease matching the executor's per-partition objects with
planner-created per-partition nodes.
---
src/backend/catalog/partition.c | 252 +++++++++++++++++-----------------------
1 file changed, 109 insertions(+), 143 deletions(-)
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 73eff17202..36f52ddb98 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -147,6 +147,8 @@ static int32 partition_bound_cmp(PartitionKey key,
static int partition_bound_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
void *probe, bool probe_is_bound, bool *is_equal);
+static void get_partition_dispatch_recurse(Relation rel, Relation parent,
+ List **pds, List **leaf_part_oids);
/*
* RelationBuildPartitionDesc
@@ -1192,21 +1194,6 @@ get_partition_qual_relid(Oid relid)
}
/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
- do\
- {\
- int i;\
- for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
- {\
- (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
- (parents) = lappend((parents), (rel));\
- }\
- } while(0)
-
-/*
* RelationGetPartitionDispatchInfo
* Returns information necessary to route tuples down a partition tree
*
@@ -1222,151 +1209,130 @@ PartitionDispatch *
RelationGetPartitionDispatchInfo(Relation rel,
int *num_parted, List **leaf_part_oids)
{
+ List *pdlist;
PartitionDispatchData **pd;
- List *all_parts = NIL,
- *all_parents = NIL,
- *parted_rels,
- *parted_rel_parents;
- ListCell *lc1,
- *lc2;
- int i,
- k,
- offset;
+ ListCell *lc;
+ int i;
- /*
- * We rely on the relcache to traverse the partition tree to build both
- * the leaf partition OIDs list and the array of PartitionDispatch objects
- * for the partitioned tables in the tree. That means every partitioned
- * table in the tree must be locked, which is fine since we require the
- * caller to lock all the partitions anyway.
- *
- * For every partitioned table in the tree, starting with the root
- * partitioned table, add its relcache entry to parted_rels, while also
- * queuing its partitions (in the order in which they appear in the
- * partition descriptor) to be looked at later in the same loop. This is
- * a bit tricky but works because the foreach() macro doesn't fetch the
- * next list element until the bottom of the loop.
- */
- *num_parted = 1;
- parted_rels = list_make1(rel);
- /* Root partitioned table has no parent, so NULL for parent */
- parted_rel_parents = list_make1(NULL);
- APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
- forboth(lc1, all_parts, lc2, all_parents)
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ *num_parted = 0;
+ *leaf_part_oids = NIL;
+
+ get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
+ *num_parted = list_length(pdlist);
+ pd = (PartitionDispatchData **) palloc(*num_parted *
+ sizeof(PartitionDispatchData *));
+ i = 0;
+ foreach (lc, pdlist)
{
- Oid partrelid = lfirst_oid(lc1);
- Relation parent = lfirst(lc2);
+ pd[i++] = lfirst(lc);
+ }
- if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
- {
- /*
- * Already locked by the caller. Note that it is the
- * responsibility of the caller to close the below relcache entry,
- * once done using the information being collected here (for
- * example, in ExecEndModifyTable).
- */
- Relation partrel = heap_open(partrelid, NoLock);
+ return pd;
+}
- (*num_parted)++;
- parted_rels = lappend(parted_rels, partrel);
- parted_rel_parents = lappend(parted_rel_parents, parent);
- APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
- }
+/*
+ * get_partition_dispatch_recurse
+ * Recursively expand partition tree rooted at rel
+ *
+ * As the partition tree is expanded in a depth-first manner, we mantain two
+ * global lists: of PartitionDispatch objects corresponding to partitioned
+ * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
+ *
+ * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
+ * the order in which the planner's expand_partitioned_rtentry() processes
+ * them. So, the N'th entry in leaf_part_oids will correspond to the N'th
+ * child of the Append/ModifyTable node for rel, provided the latter contains
+ * all leaf partitions of rel. If the latter skips some leaf partitions,
+ * because they were pruned by the planner, simply skip the corresponding
+ * entries from leaf_part_oids.
+ */
+static void
+get_partition_dispatch_recurse(Relation rel, Relation parent,
+ List **pds, List **leaf_part_oids)
+{
+ TupleDesc tupdesc = RelationGetDescr(rel);
+ PartitionDesc partdesc = RelationGetPartitionDesc(rel);
+ PartitionKey partkey = RelationGetPartitionKey(rel);
+ PartitionDispatch pd;
+ int i;
+
+ check_stack_depth();
+
+ /* Build a PartitionDispatch for this table and add it to *pds. */
+ pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
+ *pds = lappend(*pds, pd);
+ pd->reldesc = rel;
+ pd->key = partkey;
+ pd->keystate = NIL;
+ pd->partdesc = partdesc;
+ if (parent != NULL)
+ {
+ /*
+ * For every partitioned table other than the root, we must store a
+ * tuple table slot initialized with its tuple descriptor and a
+ * tuple conversion map to convert a tuple from its parent's
+ * rowtype to its own. That is to make sure that we are looking at
+ * the correct row using the correct tuple descriptor when
+ * computing its partition key for tuple routing.
+ */
+ pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+ pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+ tupdesc,
+ gettext_noop("could not convert row type"));
+ }
+ else
+ {
+ /* Not required for the root partitioned table */
+ pd->tupslot = NULL;
+ pd->tupmap = NULL;
}
/*
- * We want to create two arrays - one for leaf partitions and another for
- * partitioned tables (including the root table and internal partitions).
- * While we only create the latter here, leaf partition array of suitable
- * objects (such as, ResultRelInfo) is created by the caller using the
- * list of OIDs we return. Indexes into these arrays get assigned in a
- * breadth-first manner, whereby partitions of any given level are placed
- * consecutively in the respective arrays.
+ * Go look at each partition of this table. If it's a leaf partition,
+ * simply add its OID to *leaf_part_oids. If it's a partitioned table,
+ * recursively call get_partition_dispatch_recurse(), so that its
+ * partitions are processed as well and a corresponding PartitionDispatch
+ * object gets added to *pds.
+ *
+ * About the values in pd->indexes: for a leaf partition, it contains the
+ * leaf partition's position in the global list *leaf_part_oids minus 1,
+ * whereas for a partitioned table partition, it contains the partition's
+ * position in the global list *pds multiplied by -1. The latter is
+ * multiplied by -1 to distinguish partitioned tables from leaf partitions
+ * when going through the values in pd->indexes. So, for example, when
+ * using it during tuple-routing, encountering a value >= 0 means we found
+ * a leaf partition. It is immediately returned as the index in the array
+ * of ResultRelInfos of all the leaf partitions, using which we insert the
+ * tuple into that leaf partition. A negative value means we found a
+ * partitioned table. The value multiplied by -1 is returned as the index
+ * in the array of PartitionDispatch objects of all partitioned tables in
+ * the tree. This value is used to continue the search in the next level
+ * of the partition tree.
*/
- pd = (PartitionDispatchData **) palloc(*num_parted *
- sizeof(PartitionDispatchData *));
- *leaf_part_oids = NIL;
- i = k = offset = 0;
- forboth(lc1, parted_rels, lc2, parted_rel_parents)
+ pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+ for (i = 0; i < partdesc->nparts; i++)
{
- Relation partrel = lfirst(lc1);
- Relation parent = lfirst(lc2);
- PartitionKey partkey = RelationGetPartitionKey(partrel);
- TupleDesc tupdesc = RelationGetDescr(partrel);
- PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
- int j,
- m;
-
- pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
- pd[i]->reldesc = partrel;
- pd[i]->key = partkey;
- pd[i]->keystate = NIL;
- pd[i]->partdesc = partdesc;
- if (parent != NULL)
+ Oid partrelid = partdesc->oids[i];
+
+ if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
{
- /*
- * For every partitioned table other than root, we must store a
- * tuple table slot initialized with its tuple descriptor and a
- * tuple conversion map to convert a tuple from its parent's
- * rowtype to its own. That is to make sure that we are looking at
- * the correct row using the correct tuple descriptor when
- * computing its partition key for tuple routing.
- */
- pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
- pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
- tupdesc,
- gettext_noop("could not convert row type"));
+ *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+ pd->indexes[i] = list_length(*leaf_part_oids) - 1;
}
else
{
- /* Not required for the root partitioned table */
- pd[i]->tupslot = NULL;
- pd[i]->tupmap = NULL;
- }
- pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-
- /*
- * Indexes corresponding to the internal partitions are multiplied by
- * -1 to distinguish them from those of leaf partitions. Encountering
- * an index >= 0 means we found a leaf partition, which is immediately
- * returned as the partition we are looking for. A negative index
- * means we found a partitioned table, whose PartitionDispatch object
- * is located at the above index multiplied back by -1. Using the
- * PartitionDispatch object, search is continued further down the
- * partition tree.
- */
- m = 0;
- for (j = 0; j < partdesc->nparts; j++)
- {
- Oid partrelid = partdesc->oids[j];
+ /*
+ * We assume all tables in the partition tree were already
+ * locked by the caller.
+ */
+ Relation partrel = heap_open(partrelid, NoLock);
- if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
- {
- *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
- pd[i]->indexes[j] = k++;
- }
- else
- {
- /*
- * offset denotes the number of partitioned tables of upper
- * levels including those of the current level. Any partition
- * of this table must belong to the next level and hence will
- * be placed after the last partitioned table of this level.
- */
- pd[i]->indexes[j] = -(1 + offset + m);
- m++;
- }
+ pd->indexes[i] = -list_length(*pds);
+ get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
}
- i++;
-
- /*
- * This counts the number of partitioned tables at upper levels
- * including those of the current level.
- */
- offset += m;
}
-
- return pd;
}
/* Module-local functions */
--
2.11.0
On 14 September 2017 at 06:43, Amit Langote
Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated patch.
@@ -1222,151 +1209,130 @@ PartitionDispatch *
RelationGetPartitionDispatchInfo(Relation rel,
int
*num_parted, List **leaf_part_oids)
{
+ List *pdlist;
PartitionDispatchData **pd;
+ get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
Above, pdlist is passed uninitialized. And then inside
get_partition_dispatch_recurse(), it is used here :
*pds = lappend(*pds, pd);
--------
pg_indent says more alignments needed. For e.g. gettext_noop() call
below needs to be aligned:
pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
tupdesc,
gettext_noop("could not convert row type"));
--------
Other than that, the patch looks good to me. I verified that the leaf
oids are ordered exaclty in the order of the UPDATE subplans output.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 14, 2017 at 7:56 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 14 September 2017 at 06:43, Amit Langote
Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated patch.@@ -1222,151 +1209,130 @@ PartitionDispatch *
RelationGetPartitionDispatchInfo(Relation rel,
int
*num_parted, List **leaf_part_oids)
{
+ List *pdlist;
PartitionDispatchData **pd;+ get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
Above, pdlist is passed uninitialized. And then inside
get_partition_dispatch_recurse(), it is used here :
*pds = lappend(*pds, pd);--------
pg_indent says more alignments needed. For e.g. gettext_noop() call
below needs to be aligned:
pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
tupdesc,
gettext_noop("could not convert row type"));--------
Other than that, the patch looks good to me. I verified that the leaf
oids are ordered exaclty in the order of the UPDATE subplans output.
Committed with fixes for those issues and a few other cosmetic changes.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/15 1:37, Robert Haas wrote:
On Thu, Sep 14, 2017 at 7:56 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 14 September 2017 at 06:43, Amit Langote
Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated patch.@@ -1222,151 +1209,130 @@ PartitionDispatch *
RelationGetPartitionDispatchInfo(Relation rel,
int
*num_parted, List **leaf_part_oids)
{
+ List *pdlist;
PartitionDispatchData **pd;+ get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
Above, pdlist is passed uninitialized. And then inside
get_partition_dispatch_recurse(), it is used here :
*pds = lappend(*pds, pd);--------
pg_indent says more alignments needed. For e.g. gettext_noop() call
below needs to be aligned:
pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
tupdesc,
gettext_noop("could not convert row type"));--------
Other than that, the patch looks good to me. I verified that the leaf
oids are ordered exaclty in the order of the UPDATE subplans output.Committed with fixes for those issues and a few other cosmetic changes.
Thanks Amit for the review and Robert for committing.
Regards,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers