expanding inheritance in partition bound order
The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.
For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1]/messages/by-id/CA+TgmoajC0J50=2FqnZLvB10roY+68HgFWhso=V_StkC6PWujQ@mail.gmail.com.
So attached are two WIP patches:
0001 implements two interface functions:
List *get_all_partition_oids(Oid, LOCKMODE)
List *get_partition_oids(Oid, LOCKMODE)
that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expect that users call them only for partitioned tables.
Needless to mention, OIDs are returned with canonical order determined by
that of the partition bounds and they way partition tree structure is
traversed (top-down, breadth-first-left-to-right). Patch replaces all the
calls of the old interface functions with the respective new ones for
partitioned table parents. That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.
In its implementation, get_all_partition_oids() calls
RelationGetPartitionDispatchInfo(), which is useful to generate the result
list in the desired partition bound order. But the current interface and
implementation of RelationGetPartitionDispatchInfo() needs some rework,
because it's too closely coupled with the executor's tuple routing code.
Applying just 0001 will satisfy the requirements stated in [1]/messages/by-id/CA+TgmoajC0J50=2FqnZLvB10roY+68HgFWhso=V_StkC6PWujQ@mail.gmail.com, but it
won't look pretty as is for too long.
So, 0002 is a patch to refactor RelationGetPartitionDispatchInfo() and
relevant data structures. For example, PartitionDispatchData has now been
simplified to contain only the partition key, partition descriptor and
indexes array, whereas previously it also stored the relation descriptor,
partition key execution state, tuple table slot, tuple conversion map
which are required for tuple-routing. RelationGetPartitionDispatchInfo()
no longer generates those things, but returns just enough information so
that a caller can generate and manage those things by itself. This
simplification makes it less cumbersome to call
RelationGetPartitionDispatchInfo() in other places.
Thanks,
Amit
[1]: /messages/by-id/CA+TgmoajC0J50=2FqnZLvB10roY+68HgFWhso=V_StkC6PWujQ@mail.gmail.com
/messages/by-id/CA+TgmoajC0J50=2FqnZLvB10roY+68HgFWhso=V_StkC6PWujQ@mail.gmail.com
Attachments:
0001-Add-get_all_partition_oids-and-get_partition_oids-v1.patchtext/plain; charset=UTF-8; name=0001-Add-get_all_partition_oids-and-get_partition_oids-v1.patchDownload+213-39
0002-Decouple-RelationGetPartitionDispatchInfo-from-execu-v1.patchtext/plain; charset=UTF-8; name=0002-Decouple-RelationGetPartitionDispatchInfo-from-execu-v1.patchDownload+426-273
On Fri, Aug 4, 2017 at 1:08 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].
Thanks a lot for working on this. Partition-wise join can benefit from
this as well. See comment about build_simple_rel's matching algorithm
in [1]/messages/by-id/CA+TgmobeRUTu4osXA_UA4AORho83WxAvFG8n1NQcoFuujbeh7A@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company. It will become O(n) instead of O(n^2).
So attached are two WIP patches:
0001 implements two interface functions:
List *get_all_partition_oids(Oid, LOCKMODE)
List *get_partition_oids(Oid, LOCKMODE)that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expect that users call them only for partitioned tables.
Needless to mention, OIDs are returned with canonical order determined by
that of the partition bounds and they way partition tree structure is
traversed (top-down, breadth-first-left-to-right). Patch replaces all the
calls of the old interface functions with the respective new ones for
partitioned table parents. That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.In its implementation, get_all_partition_oids() calls
RelationGetPartitionDispatchInfo(), which is useful to generate the result
list in the desired partition bound order. But the current interface and
implementation of RelationGetPartitionDispatchInfo() needs some rework,
because it's too closely coupled with the executor's tuple routing code.
May be we want to implement get_all_partition_oids() calling
get_partition_oids() on each new entry that gets added, similar to
find_all_inheritors(). That might avoid changes to DispatchInfo() and
also dependency on that structure.
Also almost every place which called find_all_inheritors() or
find_inheritance_children() is changed to if () else case calling
those functions or the new function as required. May be we should
create macros/functions to do that routing to keep the code readable.
May be find_all_inheritors() and find_inheritance_children()
themselves become the routing function and their original code moves
into new functions get_all_inheritors() and
get_inheritance_children(). We may choose other names for functions.
The idea is to create routing functions/macros instead of sprinkling
code with if () else conditions.
[1]: /messages/by-id/CA+TgmobeRUTu4osXA_UA4AORho83WxAvFG8n1NQcoFuujbeh7A@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].
I guess I don't really understand why we want to change the locking
order. That is bound to make expanding the inheritance hierarchy more
expensive. If we use this approach in all cases, it seems to me we're
bound to reintroduce the problem we fixed in commit
c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
the same vein. But I don't see that there's any necessary relation
between the order of locking and the order of expansion inside the
relcache entry/plan/whatever else -- so I propose that we keep the
existing locking order and only change the other stuff.
While reading related code this morning, I noticed that
RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
*already* changed the locking order for certain operations, because
the PartitionDesc's OID array is bound-ordered not OID-ordered. That
means that when RelationGetPartitionDispatchInfo uses the
PartitionDesc's OID arra to figure out what to lock, it's potentially
going to encounter partitions in a different order than would have
been the case if it had used find_all_inheritors directly. I'm
tempted to think that RelationGetPartitionDispatchInfo shouldn't
really be doing locking at all. The best way to have the locking
always happen in the same order is to have only one piece of code that
determines that order - and I vote for find_all_inheritors. Aside
from the fact that it's the way we've always done it (and still do it
in most other places), that code includes infinite-loop defenses which
the new code you've introduced lacks.
Concretely, my proposal is:
1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).
2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.
4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
elog(ERROR, ...). Might be interesting to stuff that check into the
relation_open(..., NoLock) path, too.
One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row. However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list. And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly. I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID. It
can also return the number of initial elements of its return value
which have that flag set.
Then, in find_all_inheritors, we split rels_list into
partitioned_rels_list and other_rels_list, and process
partitioned_rels_list in its entirety before touching other_rels_list;
they get concatenated at the end.
Now, find_all_inheritors and find_inheritance_children can also grow a
flag bool only_partitioned_children; if set, then we skip the
unpartitioned children entirely.
With all that in place, you can call find_all_inheritors(blah blah,
false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
true) to lock just the partitioned tables in the hierarchy. You get a
consistent lock order either way, and if you start with only the
partitioned tables and later want the leaf partitions too, you just go
through the partitioned children in the order they were returned and
find_inheritance_children(blah blah, false) on each one of them and
the lock order is exactly consistent with what you would have gotten
if you'd done find_all_inheritors(blah blah, false) originally.
Thoughts?
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4 August 2017 at 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
I agree. I think overall, we should keep
RelationGetPartitionDispatchInfo() only for preparing the dispatch
info in the planner, and generate the locked oids (using
find_all_inheritors() or get_partitioned_oids() or whatever) *without*
using RelationGetPartitionDispatchInfo(), since
RelationGetPartitionDispatchInfo() is generating the pd structure
which we don't want in every expansion.
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.
Yes, this way we need to open only the partitioned tables.
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.
True. I also think, RelationGetPartitionDispatchInfo () should be
called while preparing the ModifyTable plan; the PartitionDispatch
data structure returned by RelationGetPartitionDispatchInfo() should
be stored in that plan, and then the execution-time fields in
PartitionDispatch would be populated in ExecInitModifyTable().
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/04 20:28, Ashutosh Bapat wrote:
On Fri, Aug 4, 2017 at 1:08 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].Thanks a lot for working on this. Partition-wise join can benefit from
this as well. See comment about build_simple_rel's matching algorithm
in [1]. It will become O(n) instead of O(n^2).
Nice. It seems that we have a good demand for $subject. :)
So attached are two WIP patches:
0001 implements two interface functions:
List *get_all_partition_oids(Oid, LOCKMODE)
List *get_partition_oids(Oid, LOCKMODE)that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expect that users call them only for partitioned tables.
Needless to mention, OIDs are returned with canonical order determined by
that of the partition bounds and they way partition tree structure is
traversed (top-down, breadth-first-left-to-right). Patch replaces all the
calls of the old interface functions with the respective new ones for
partitioned table parents. That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.In its implementation, get_all_partition_oids() calls
RelationGetPartitionDispatchInfo(), which is useful to generate the result
list in the desired partition bound order. But the current interface and
implementation of RelationGetPartitionDispatchInfo() needs some rework,
because it's too closely coupled with the executor's tuple routing code.May be we want to implement get_all_partition_oids() calling
get_partition_oids() on each new entry that gets added, similar to
find_all_inheritors(). That might avoid changes to DispatchInfo() and
also dependency on that structure.Also almost every place which called find_all_inheritors() or
find_inheritance_children() is changed to if () else case calling
those functions or the new function as required. May be we should
create macros/functions to do that routing to keep the code readable.
May be find_all_inheritors() and find_inheritance_children()
themselves become the routing function and their original code moves
into new functions get_all_inheritors() and
get_inheritance_children(). We may choose other names for functions.
The idea is to create routing functions/macros instead of sprinkling
code with if () else conditions.
Given the Robert's comments [1]/messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com, it seems that we might have to abandon
the idea to separate the interface for partitioned and non-partitioned
inheritance cases. I'm thinking about the issues and alternatives he
mentioned in his email.
Thanks,
Amit
[1]: /messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com
/messages/by-id/CA+Tgmobwbh12OJerqAGyPEjb_+2y7T0nqRKTcjed6L4NTET6Fg@mail.gmail.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 4, 2017 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].I guess I don't really understand why we want to change the locking
order. That is bound to make expanding the inheritance hierarchy more
expensive. If we use this approach in all cases, it seems to me we're
bound to reintroduce the problem we fixed in commit
c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
the same vein.
I initially didn't understand this, but I think now I understand it.
Establishing the order of children by partition bounds requires
building the relcache entry right now. That's what is expensive and
would introduce the same problems as the commit you have quoted.
But I don't see that there's any necessary relation
between the order of locking and the order of expansion inside the
relcache entry/plan/whatever else -- so I propose that we keep the
existing locking order and only change the other stuff.While reading related code this morning, I noticed that
RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
*already* changed the locking order for certain operations, because
the PartitionDesc's OID array is bound-ordered not OID-ordered. That
means that when RelationGetPartitionDispatchInfo uses the
PartitionDesc's OID arra to figure out what to lock, it's potentially
going to encounter partitions in a different order than would have
been the case if it had used find_all_inheritors directly. I'm
tempted to think that RelationGetPartitionDispatchInfo shouldn't
really be doing locking at all. The best way to have the locking
always happen in the same order is to have only one piece of code that
determines that order - and I vote for find_all_inheritors. Aside
from the fact that it's the way we've always done it (and still do it
in most other places), that code includes infinite-loop defenses which
the new code you've introduced lacks.
+1.
Concretely, my proposal is:
1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
elog(ERROR, ...). Might be interesting to stuff that check into the
relation_open(..., NoLock) path, too.One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row. However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list. And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly. I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID. It
can also return the number of initial elements of its return value
which have that flag set.
I am always uncomfortable, when we save the same information in two
places without having a way to make sure that they are in sync. That
means we have to add explicit code to make sure that that information
is kept in sync. Somebody forgetting to add that code wherever
necessary means we have contradictory information persisted in the
databases without an idea of which of them is correct. Not necessarily
in this case, but usually it is an indication of something going wrong
with the way schema is designed. May be it's better to use your idea
of using get_rel_relkind() or find a way to check that the flag is in
sync with the relkind, like when building the relcache.
Then, in find_all_inheritors, we split rels_list into
partitioned_rels_list and other_rels_list, and process
partitioned_rels_list in its entirety before touching other_rels_list;
they get concatenated at the end.Now, find_all_inheritors and find_inheritance_children can also grow a
flag bool only_partitioned_children; if set, then we skip the
unpartitioned children entirely.With all that in place, you can call find_all_inheritors(blah blah,
false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
true) to lock just the partitioned tables in the hierarchy. You get a
consistent lock order either way, and if you start with only the
partitioned tables and later want the leaf partitions too, you just go
through the partitioned children in the order they were returned and
find_inheritance_children(blah blah, false) on each one of them and
the lock order is exactly consistent with what you would have gotten
if you'd done find_all_inheritors(blah blah, false) originally.Thoughts?
I noticed that find_all_inheritors() uses a hash table to eliminate
duplicates arising out of multiple inheritance. Partition hierarchy is
never going to have multiple inheritance, and doesn't need to
eliminate duplicates and so doesn't need the hash table. It will be
good, if we can eliminate that overhead. But that's separate task than
what this thread is about.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 7, 2017 at 11:18 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row. However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list. And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly. I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID. It
can also return the number of initial elements of its return value
which have that flag set.I am always uncomfortable, when we save the same information in two
places without having a way to make sure that they are in sync. That
means we have to add explicit code to make sure that that information
is kept in sync. Somebody forgetting to add that code wherever
necessary means we have contradictory information persisted in the
databases without an idea of which of them is correct. Not necessarily
in this case, but usually it is an indication of something going wrong
with the way schema is designed. May be it's better to use your idea
of using get_rel_relkind() or find a way to check that the flag is in
sync with the relkind, like when building the relcache.
Said all that, I think we will use this code quite often and so the
performance benefits by replicating the information are worth the
trouble of maintaining code to sync and check the duplicate
information.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/05 2:25, Robert Haas wrote:
On Fri, Aug 4, 2017 at 3:38 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context. They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code. That's good.For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations. For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm. For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].I guess I don't really understand why we want to change the locking
order. That is bound to make expanding the inheritance hierarchy more
expensive. If we use this approach in all cases, it seems to me we're
bound to reintroduce the problem we fixed in commit
c1e0e7e1d790bf18c913e6a452dea811e858b554 and maybe add a few more in
the same vein. But I don't see that there's any necessary relation
between the order of locking and the order of expansion inside the
relcache entry/plan/whatever else -- so I propose that we keep the
existing locking order and only change the other stuff.
Hmm, okay.
I guess I was trying to fit one solution to what might be two (or worse,
more) problems of the current implementation, which is not good.
While reading related code this morning, I noticed that
RelationBuildPartitionDesc and RelationGetPartitionDispatchInfo have
*already* changed the locking order for certain operations, because
the PartitionDesc's OID array is bound-ordered not OID-ordered. That
means that when RelationGetPartitionDispatchInfo uses the
PartitionDesc's OID arra to figure out what to lock, it's potentially
going to encounter partitions in a different order than would have
been the case if it had used find_all_inheritors directly.
I think Amit Khandekar mentioned this on the UPDATE partition key thread [1]/messages/by-id/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB=wY9ZLfe8cQOfG4bTqVGyQ@mail.gmail.com.
I'm
tempted to think that RelationGetPartitionDispatchInfo shouldn't
really be doing locking at all. The best way to have the locking
always happen in the same order is to have only one piece of code that
determines that order - and I vote for find_all_inheritors. Aside
from the fact that it's the way we've always done it (and still do it
in most other places), that code includes infinite-loop defenses which
the new code you've introduced lacks.
As long as find_all_inheritors() is a place only to determine the order in
which partitions will be locked, it's fine. My concern is about the time
of actual locking, which in the current planner implementation is too soon
that we end up needlessly locking all the partitions.
(Also in the current implementation, we open all the partitions to
construct Var translation lists, which are actually unused through most of
the planner stages, but admittedly it's a separate issue.)
The locking-partitions-too-soon issue, I think, is an important one and
may need discussing separately, but thought I'd mention it anyway. It
also seems somewhat related to this discussion, but I may be wrong.
Concretely, my proposal is:
1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).
ISTM, we'd want to lock the partitions after we've determined the specific
ones a query needs to scan using the information returned by
RelationGetPartitionDispatchInfo. That means the latter had better locked
the relations whose cached partition descriptors will be used to determine
the result that it produces. One way to do that might be to lock all the
tables in the list returned by find_all_inheritors that are partitioned
tables before calling RelationGetPartitionDispatchInfo. It seems what the
approach you've outlined below will let us do that.
BTW, IIUC, there will be two lists of OIDs we'll have: one in the
find_all_inheritors order, say, L1 and the other determined by using
partitioning-specific information for the given query, say L2.
To lock, we iterate L1 and if a given member is in L2, we lock it. It
might be possible to make it as cheap as O(nlogn).
L2 is the order we put leaf partitions into a given plan.
2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.
That's what the proposed refactoring patch 0002 actually does.
4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
elog(ERROR, ...). Might be interesting to stuff that check into the
relation_open(..., NoLock) path, too.One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row. However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list. And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly. I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID. It
can also return the number of initial elements of its return value
which have that flag set.
Maybe, we can make the initial patch use syscache to get the relkind for a
given child. If the syscache bloat is unbearable, we go with the
denormalization approach.
Then, in find_all_inheritors, we split rels_list into
partitioned_rels_list and other_rels_list, and process
partitioned_rels_list in its entirety before touching other_rels_list;
they get concatenated at the end.Now, find_all_inheritors and find_inheritance_children can also grow a
flag bool only_partitioned_children; if set, then we skip the
unpartitioned children entirely.With all that in place, you can call find_all_inheritors(blah blah,
false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
true) to lock just the partitioned tables in the hierarchy. You get a
consistent lock order either way, and if you start with only the
partitioned tables and later want the leaf partitions too, you just go
through the partitioned children in the order they were returned and
find_inheritance_children(blah blah, false) on each one of them and
the lock order is exactly consistent with what you would have gotten
if you'd done find_all_inheritors(blah blah, false) originally.Thoughts?
So, with this in place:
1. Call find_all_inheritors to lock partitioned tables in the tree in an
order prescribed by OIDs
2. Call RelationGetPartitionDispatchInfo at an opportune time, which will
generate minimal information about the partition tree that it can do
without having to worry about locking anything
3. Determine the list of which leaf partitions will need to be scanned
using the information obtained in 2, if possible to do that at all [2]We can do that in set_append_rel_size(), but not in inheritance_planner().
4. Lock leaf partitions in the find_inheritance_children prescribed order,
but only those that are in the list built in 3.
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.
Thanks.
Regards,
Amit
[1]: /messages/by-id/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB=wY9ZLfe8cQOfG4bTqVGyQ@mail.gmail.com
/messages/by-id/CAJ3gD9fdjk2O8aPMXidCeYeB-mFB=wY9ZLfe8cQOfG4bTqVGyQ@mail.gmail.com
[2]: We can do that in set_append_rel_size(), but not in inheritance_planner()
inheritance_planner()
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 7, 2017 at 1:48 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
with the way schema is designed. May be it's better to use your idea
of using get_rel_relkind() or find a way to check that the flag is in
sync with the relkind, like when building the relcache.
That's got the same problem as building a full relcache entry: cache
bloat. It will have *less* cache bloat, but still some. Maybe it's
little enough to be tolerable; not sure. But we want this system to
scale to LOTS of partitions eventually, so building on a design that
we know has scaling problems seems a bit short-sighted.
I noticed that find_all_inheritors() uses a hash table to eliminate
duplicates arising out of multiple inheritance. Partition hierarchy is
never going to have multiple inheritance, and doesn't need to
eliminate duplicates and so doesn't need the hash table. It will be
good, if we can eliminate that overhead. But that's separate task than
what this thread is about.
I don't want to eliminate that overhead. If the catalog is manually
modified or corrupted, the problem could still occur, and result in
backend crashes or, at best, incomprehensible errors. The comments
allude to this problem.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 7, 2017 at 2:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
I think Amit Khandekar mentioned this on the UPDATE partition key thread [1].
Yes, similar discussion.
As long as find_all_inheritors() is a place only to determine the order in
which partitions will be locked, it's fine. My concern is about the time
of actual locking, which in the current planner implementation is too soon
that we end up needlessly locking all the partitions.
I don't think avoiding that problem is going to be easy. We need a
bunch of per-relation information, like the size of each relation, and
what indexes it has, and how big they are, and the statistics for each
one. It was at one point proposed by someone that every partition
should be required to have the same indexes, but (1) we didn't
implement it like that and (2) if we had done that it wouldn't solve
this problem anyway because the sizes are still going to vary.
Note that I'm not saying this isn't a good problem to solve, just that
it's likely to be a very hard problem to solve.
The locking-partitions-too-soon issue, I think, is an important one and
ISTM, we'd want to lock the partitions after we've determined the specific
ones a query needs to scan using the information returned by
RelationGetPartitionDispatchInfo. That means the latter had better locked
the relations whose cached partition descriptors will be used to determine
the result that it produces. One way to do that might be to lock all the
tables in the list returned by find_all_inheritors that are partitioned
tables before calling RelationGetPartitionDispatchInfo. It seems what the
approach you've outlined below will let us do that.
Yeah, I think so. I think we could possibly open and lock partitioned
children only, then prune away leaf partitions that we can determine
aren't needed, then open and lock the leaf partitions that are needed.
BTW, IIUC, there will be two lists of OIDs we'll have: one in the
find_all_inheritors order, say, L1 and the other determined by using
partitioning-specific information for the given query, say L2.To lock, we iterate L1 and if a given member is in L2, we lock it. It
might be possible to make it as cheap as O(nlogn).
Commonly, we'll prune no partitions or all but one; and we should be
able to make those cases very fast. Other cases can cost a little
more, but I'll certainly complain about anything more than O(n lg n).
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.That's what the proposed refactoring patch 0002 actually does.
Great.
Maybe, we can make the initial patch use syscache to get the relkind for a
given child. If the syscache bloat is unbearable, we go with the
denormalization approach.
Yeah. Maybe if you write that patch, you can also test it to see how
bad the bloat is.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/07 14:37, Amit Khandekar wrote:
On 4 August 2017 at 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.True. I also think, RelationGetPartitionDispatchInfo () should be
called while preparing the ModifyTable plan; the PartitionDispatch
data structure returned by RelationGetPartitionDispatchInfo() should
be stored in that plan, and then the execution-time fields in
PartitionDispatch would be populated in ExecInitModifyTable().
I'm not sure if we could ever store the PartitionDispatch structure itself
in the plan.
Planner would build and use it to put the leaf partition sub-plans in the
canonical order in the resulting plan (Append, ModifyTable, etc.).
Executor will have to rebuild the PartitionDispatch info again if and when
it needs the same (for example, in ExecSetupPartitionTupleRouting for
insert or update tuple routing).
The refactoring patch that I've proposed (0002) makes PartitionDispatch
structure itself contain a lot less information/state than it currently
does. So RelationGetPartitionDispatchInfo's job per the revised patch is
to reveal the partition tree structure and the information of each
partitioned table that the tree contains. The original design whereby it
builds and puts into PartitionDispatchData thing like partition key
execution state (ExprState), TupleTableSlot, TupleConversionMap seems
wrong to me in retrospect and we should somehow revise it. Those things I
mentioned are only needed for tuple-routing, so they should be built and
managed by the executor, not partition.c. Any feedback on the proposed
patch is welcome. :)
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/08 4:34, Robert Haas wrote:
On Mon, Aug 7, 2017 at 2:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:As long as find_all_inheritors() is a place only to determine the order in
which partitions will be locked, it's fine. My concern is about the time
of actual locking, which in the current planner implementation is too soon
that we end up needlessly locking all the partitions.I don't think avoiding that problem is going to be easy. We need a
bunch of per-relation information, like the size of each relation, and
what indexes it has, and how big they are, and the statistics for each
one. It was at one point proposed by someone that every partition
should be required to have the same indexes, but (1) we didn't
implement it like that and (2) if we had done that it wouldn't solve
this problem anyway because the sizes are still going to vary.
Sorry, I didn't mean to say we shouldn't lock and open partitions at all.
We do need their relation descriptors for planning and there is no doubt
about that. I was just saying that we should do that only for the
partitions that are not pruned. But, as you say, I can see that the
planner changes required to be able to do that might be hard.
The locking-partitions-too-soon issue, I think, is an important one and
ISTM, we'd want to lock the partitions after we've determined the specific
ones a query needs to scan using the information returned by
RelationGetPartitionDispatchInfo. That means the latter had better locked
the relations whose cached partition descriptors will be used to determine
the result that it produces. One way to do that might be to lock all the
tables in the list returned by find_all_inheritors that are partitioned
tables before calling RelationGetPartitionDispatchInfo. It seems what the
approach you've outlined below will let us do that.Yeah, I think so. I think we could possibly open and lock partitioned
children only, then prune away leaf partitions that we can determine
aren't needed, then open and lock the leaf partitions that are needed.
Yes.
BTW, IIUC, there will be two lists of OIDs we'll have: one in the
find_all_inheritors order, say, L1 and the other determined by using
partitioning-specific information for the given query, say L2.To lock, we iterate L1 and if a given member is in L2, we lock it. It
might be possible to make it as cheap as O(nlogn).Commonly, we'll prune no partitions or all but one; and we should be
able to make those cases very fast.
Agreed.
Maybe, we can make the initial patch use syscache to get the relkind for a
given child. If the syscache bloat is unbearable, we go with the
denormalization approach.Yeah. Maybe if you write that patch, you can also test it to see how
bad the bloat is.
I will try and see, but maybe the syscache solution doesn't get us past
the proof-of-concept stage.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/05 2:25, Robert Haas wrote:
Concretely, my proposal is:
1. Before calling RelationGetPartitionDispatchInfo, the calling code
should use find_all_inheritors to lock all the relevant relations (or
the planner could use find_all_inheritors to get a list of relation
OIDs, store it in the plan in order, and then at execution time we
visit them in that order and lock them).2. RelationGetPartitionDispatchInfo assumes the relations are already locked.
3. While we're optimizing, in the first loop inside of
RelationGetPartitionDispatchInfo, don't call heap_open(). Instead,
use get_rel_relkind() to see whether we've got a partitioned table; if
so, open it. If not, there's no need.4. For safety, add a function bool RelationLockHeldByMe(Oid) and add
to this loop a check if (!RelationLockHeldByMe(lfirst_oid(lc1))
elog(ERROR, ...). Might be interesting to stuff that check into the
relation_open(..., NoLock) path, too.One objection to this line of attack is that there might be a good
case for locking only the partitioned inheritors first and then going
back and locking the leaf nodes in a second pass, or even only when
required for a particular row. However, that doesn't require putting
everything in bound order - it only requires moving the partitioned
children to the beginning of the list. And I think rather than having
new logic for that, we should teach find_inheritance_children() to do
that directly. I have a feeling Ashutosh is going to cringe at this
suggestion, but my idea is to do this by denormalizing: add a column
to pg_inherits indicating whether the child is of
RELKIND_PARTITIONED_TABLE. Then, when find_inheritance_children scans
pg_inherits, it can pull that flag out for free along with the
relation OID, and qsort() first by the flag and then by the OID. It
can also return the number of initial elements of its return value
which have that flag set.Then, in find_all_inheritors, we split rels_list into
partitioned_rels_list and other_rels_list, and process
partitioned_rels_list in its entirety before touching other_rels_list;
they get concatenated at the end.Now, find_all_inheritors and find_inheritance_children can also grow a
flag bool only_partitioned_children; if set, then we skip the
unpartitioned children entirely.With all that in place, you can call find_all_inheritors(blah blah,
false) to lock the whole hierarchy, or find_all_inheritors(blah blah,
true) to lock just the partitioned tables in the hierarchy. You get a
consistent lock order either way, and if you start with only the
partitioned tables and later want the leaf partitions too, you just go
through the partitioned children in the order they were returned and
find_inheritance_children(blah blah, false) on each one of them and
the lock order is exactly consistent with what you would have gotten
if you'd done find_all_inheritors(blah blah, false) originally.
I tried implementing this in the attached set of patches.
[PATCH 2/5] Teach pg_inherits.c a bit about partitioning
Both find_inheritance_children and find_all_inheritors now list
partitioned child tables before non-partitioned ones and return
the number of partitioned tables in an optional output argument
[PATCH 3/5] Relieve RelationGetPartitionDispatchInfo() of doing locking
Anyone who wants to call RelationGetPartitionDispatchInfo() must first
acquire locks using find_all_inheritors.
TODO: Add RelationLockHeldByMe() and put if (!RelationLockHeldByMe())
elog(ERROR, ...) check in RelationGetPartitionDispatchInfo()
[PATCH 4/5] Teach expand_inherited_rtentry to use partition bound order
After locking the child tables using find_all_inheritors, we discard
the list of child table OIDs that it generates and rebuild the same
using the information returned by RelationGetPartitionDispatchInfo.
[PATCH 5/5] Store in pg_inherits if a child is a partitioned table
Catalog changes so that is_partitioned property of child tables is now
stored in pg_inherits. This avoids consulting syscache to get that
property as is currently implemented in patch 2/5.
I haven't yet done anything about changing the timing of opening and
locking leaf partitions, because it will require some more thinking about
the required planner changes. But the above set of patches will get us
far enough to get leaf partition sub-plans appear in the partition bound
order (same order as what partition tuple-routing uses in the executor).
With the above patches, we get the desired order of child sub-plans in
Append and ModifyTable plans for partitioned tables:
create table p (a int) partition by range (a);
create table p4 partition of p for values from (30) to (40);
create table p3 partition of p for values from (20) to (30);
create table p2 partition of p for values from (10) to (20);
create table p1 partition of p for values from (1) to (10);
create table p0 partition of p for values from (minvalue) to (1) partition
by list (a);
create table p00 partition of p0 for values in (0);
create table p01 partition of p0 for values in (-1);
create table p02 partition of p0 for values in (-2);
explain select count(*) from p;
QUERY PLAN
-------------------------------------------------------------------
Aggregate (cost=293.12..293.13 rows=1 width=8)
-> Append (cost=0.00..248.50 rows=17850 width=0)
-> Seq Scan on p1 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p2 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p3 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p4 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p02 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p01 (cost=0.00..35.50 rows=2550 width=0)
-> Seq Scan on p00 (cost=0.00..35.50 rows=2550 width=0)
explain update p set a = a;
QUERY PLAN
--------------------------------------------------------------
Update on p (cost=0.00..248.50 rows=17850 width=10)
Update on p1
Update on p2
Update on p3
Update on p4
Update on p02
Update on p01
Update on p00
-> Seq Scan on p1 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p2 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p3 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p4 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p02 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p01 (cost=0.00..35.50 rows=2550 width=10)
-> Seq Scan on p00 (cost=0.00..35.50 rows=2550 width=10)
(15 rows)
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.
I put this patch ahead in the list and so it's now 0001.
Thanks,
Amit
Attachments:
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload+409-248
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patchtext/plain; charset=UTF-8; name=0002-Teach-pg_inherits.c-a-bit-about-partitioning.patchDownload+162-56
0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchtext/plain; charset=UTF-8; name=0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchDownload+33-34
0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patchtext/plain; charset=UTF-8; name=0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patchDownload+32-1
0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patchtext/plain; charset=UTF-8; name=0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patchDownload+31-15
Hi Amit,
On Thu, Aug 10, 2017 at 7:41 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/08/05 2:25, Robert Haas wrote:
Concretely, my proposal is:
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.I put this patch ahead in the list and so it's now 0001.
FYI, 0001 patch throws the warning:
execMain.c: In function ‘ExecSetupPartitionTupleRouting’:
execMain.c:3342:16: warning: ‘next_ptinfo’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
next_ptinfo->parentid != ptinfo->parentid)
--
Beena Emerson
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 9, 2017 at 10:11 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.I put this patch ahead in the list and so it's now 0001.
I think what you've currently got as
0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patch is a
bug fix that probably needs to be back-patched into v10, so it should
come first.
I think 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch and
0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patch should
be merged into one patch and that should come next, followed by
0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patch and
finally what you now have as
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patch.
This patch series is blocking a bunch of other things, so it would be
nice if you could press forward with this quickly.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/08/10 18:52, Beena Emerson wrote:
Hi Amit,
On Thu, Aug 10, 2017 at 7:41 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:On 2017/08/05 2:25, Robert Haas wrote:
Concretely, my proposal is:
P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.I put this patch ahead in the list and so it's now 0001.
FYI, 0001 patch throws the warning:
execMain.c: In function ‘ExecSetupPartitionTupleRouting’:
execMain.c:3342:16: warning: ‘next_ptinfo’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
next_ptinfo->parentid != ptinfo->parentid)
Thanks for the review. Will fix in the updated version of the patch I
will post sometime later today.
Regards,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks for the review.
On 2017/08/16 2:27, Robert Haas wrote:
On Wed, Aug 9, 2017 at 10:11 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:P.S. While I haven't reviewed 0002 in detail, I think the concept of
minimizing what needs to be built in RelationGetPartitionDispatchInfo
is a very good idea.I put this patch ahead in the list and so it's now 0001.
I think what you've currently got as
0003-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patch is a
bug fix that probably needs to be back-patched into v10, so it should
come first.
That makes sense. That patch is now 0001. Checked that it can be
back-patched to REL_10_STABLE.
I think 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch and
0005-Store-in-pg_inherits-if-a-child-is-a-partitioned-tab.patch should
be merged into one patch and that should come next,
Merged the two into one: attached 0002.
followed by
0004-Teach-expand_inherited_rtentry-to-use-partition-boun.patch and
This one is now 0003.
finally what you now have as
0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patch.
And 0004.
This patch series is blocking a bunch of other things, so it would be
nice if you could press forward with this quickly.
Attached updated patches.
Thanks,
Amit
Attachments:
0001-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchtext/plain; charset=UTF-8; name=0001-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchDownload+42-35
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patchtext/plain; charset=UTF-8; name=0002-Teach-pg_inherits.c-a-bit-about-partitioning.patchDownload+187-64
0003-Teach-expand_inherited_rtentry-to-use-partition-boun.patchtext/plain; charset=UTF-8; name=0003-Teach-expand_inherited_rtentry-to-use-partition-boun.patchDownload+51-1
0004-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0004-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload+399-262
On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated patches.
Thanks Amit for the patches.
I too agree with the overall approach taken for keeping the locking
order consistent: it's best to do the locking with the existing
find_all_inheritors() since it is much cheaper than to lock them in
partition-bound order, the later being expensive since it requires
opening partitioned tables.
I haven't yet done anything about changing the timing of opening and
locking leaf partitions, because it will require some more thinking about
the required planner changes. But the above set of patches will get us
far enough to get leaf partition sub-plans appear in the partition bound
order (same order as what partition tuple-routing uses in the executor).
So, I believe none of the changes done in pg_inherits.c are essential
for expanding the inheritence tree in bound order, right ? I am not
sure whether we are planning to commit these two things together or
incrementally :
1. Expand in bound order
2. Allow for locking only the partitioned tables first.
For #1, the changes in pg_inherits.c are not required (viz, keeping
partitioned tables at the head of the list, adding inhchildparted
column, etc).
If we are going to do #2 together with #1, then in the patch set there
is no one using the capability introduced by #2. That is, there are no
callers of find_all_inheritors() that make use of the new
num_partitioned_children parameter. Also, there is no boolean
parameter for find_all_inheritors() to be used to lock only the
partitioned tables.
I feel we should think about
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
first get the review done for the other patches.
-------
I see that RelationGetPartitionDispatchInfo() now returns quite a
small subset of what it used to return, which is good. But I feel for
expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
still a bit heavy. We only require the oids, so the
PartitionedTableInfo data structure (including the pd->indexes array)
gets discarded.
Also, RelationGetPartitionDispatchInfo() has to call get_rel_relkind()
for each child. In expand_inherited_rtentry(), we anyway have to open
all the child tables, so we get the partition descriptors for each of
the children for free. So how about, in expand_inherited_rtentry(), we
traverse the partition tree using these descriptors similar to how it
is traversed in RelationGetPartitionDispatchInfo() ? May be to avoid
code duplication for traversing, we can have a common API.
Still looking at RelationGetPartitionDispatchInfo() changes ...
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 16, 2017 at 11:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
This patch series is blocking a bunch of other things, so it would be
nice if you could press forward with this quickly.Attached updated patches.
Review for 0001. The attached patch has some minor changes to the
comments and code.
+ * All the relations in the partition tree (including 'rel') must have been
+ * locked (using at least the AccessShareLock) by the caller.
It would be good if we can Assert this in the function. But I couldn't find a
way to check whether the backend holds a lock of required strength. Is there
any?
/*
- * We locked all the partitions above including the leaf partitions.
- * Note that each of the relations in *partitions are eventually
- * closed by the caller.
+ * All the partitions were locked above. Note that the relcache
+ * entries will be closed by ExecEndModifyTable().
*/
I don't see much value in this hunk, so removed it in the attached patch.
+ list_free(all_parts);
ExecSetupPartitionTupleRouting() will be called only once per DML statement.
Leaking the memory for the duration of DML may be worth the time spent
in the traversing
the list and freeing each cell independently. So removed the hunk in the
attached set.
0002 review
+
+ <row>
+ <entry><structfield>inhchildparted</structfield></entry>
+ <entry><type>bool</type></entry>
+ <entry></entry>
+ <entry>
+ This is <literal>true</> if the child table is a partitioned table,
+ <literal>false</> otherwise
+ </entry>
+ </row>
In the catalogs we are using full "partitioned" e.g. pg_partitioned_table. May
be we should name the column as "inhchildpartitioned".
+#define OID_CMP(o1, o2) \
+ ((o1) < (o2) ? -1 : ((o1) > (o2) ? 1 : 0));
Instead of duplicating the logic in this macro and oid_cmp(), we may want to
call this macro in oid_cmp()? Or simply call oid_cmp() from inhchildinfo_cmp()
passing pointers to the OIDs; a pointer indirection would be costly, but still
maintainable.
+ if (num_partitioned_children)
+ *num_partitioned_children = my_num_partitioned_children;
+
Instead of this conditional, why not to make every caller pass a pointer to
integer. The callers will just ignore the value if they don't want it. If we do
this change, we can get rid of my_num_partitioned_children variable and
directly update the passed in pointer variable.
inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
- if (numoids >= maxoids)
+ is_partitioned = ((Form_pg_inherits)
+ GETSTRUCT(inheritsTuple))->inhchildparted;
Now that we are fetching two members from Form_pg_inherits structure, may be we
should use a local variable
Form_pg_inherits inherits_tuple = GETSTRUCT(inheritsTuple);
and use that to fetch its members.
I am still reviewing changes in this patch.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachments:
0001-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchtext/x-patch; charset=US-ASCII; name=0001-Relieve-RelationGetPartitionDispatchInfo-of-doing-an.patchDownload+29-34
Hi Amit,
Thanks for the comments.
On 2017/08/16 20:30, Amit Khandekar wrote:
On 16 August 2017 at 11:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated patches.
Thanks Amit for the patches.
I too agree with the overall approach taken for keeping the locking
order consistent: it's best to do the locking with the existing
find_all_inheritors() since it is much cheaper than to lock them in
partition-bound order, the later being expensive since it requires
opening partitioned tables.
Yeah. Per the Robert's design idea, we will always do the *locking* in
the order determined by find_all_inheritors/find_inheritance_children.
I haven't yet done anything about changing the timing of opening and
locking leaf partitions, because it will require some more thinking about
the required planner changes. But the above set of patches will get us
far enough to get leaf partition sub-plans appear in the partition bound
order (same order as what partition tuple-routing uses in the executor).So, I believe none of the changes done in pg_inherits.c are essential
for expanding the inheritence tree in bound order, right ?
Right.
The changes to pg_inherits.c are only about recognizing partitioned tables
in an inheritance hierarchy and putting them ahead in the returned list.
Now that I think of it, the title of the patch (teach pg_inherits.c about
"partitioning") sounds a bit confusing. In particular, the patch does not
teach it things like partition bound order, just that some tables in the
hierarchy could be partitioned tables.
I am not
sure whether we are planning to commit these two things together or
incrementally :
1. Expand in bound order
2. Allow for locking only the partitioned tables first.For #1, the changes in pg_inherits.c are not required (viz, keeping
partitioned tables at the head of the list, adding inhchildparted
column, etc).
Yes. Changes to pg_inherits.c can be committed completely independently
of either 1 or 2, although then there would be nobody using that capability.
About 2: I think the capability to lock only the partitioned tables in
expand_inherited_rtentry() will only be used once we have the patch to do
the necessary planner restructuring that will allow us to defer child
table locking to some place that is not expand_inherited_rtentry(). I am
working on that patch now and should be able to show something soon.
If we are going to do #2 together with #1, then in the patch set there
is no one using the capability introduced by #2. That is, there are no
callers of find_all_inheritors() that make use of the new
num_partitioned_children parameter. Also, there is no boolean
parameter for find_all_inheritors() to be used to lock only the
partitioned tables.I feel we should think about
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
first get the review done for the other patches.
I think that makes sense.
I see that RelationGetPartitionDispatchInfo() now returns quite a
small subset of what it used to return, which is good. But I feel for
expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
still a bit heavy. We only require the oids, so the
PartitionedTableInfo data structure (including the pd->indexes array)
gets discarded.
Maybe we could make the output argument optional, but I don't see much
point in being too conservative here. Building the indexes array does not
cost us that much and if a not-too-distant-in-future patch could use that
information somehow, it could do so for free.
Also, RelationGetPartitionDispatchInfo() has to call get_rel_relkind()
for each child. In expand_inherited_rtentry(), we anyway have to open
all the child tables, so we get the partition descriptors for each of
the children for free. So how about, in expand_inherited_rtentry(), we
traverse the partition tree using these descriptors similar to how it
is traversed in RelationGetPartitionDispatchInfo() ? May be to avoid
code duplication for traversing, we can have a common API.
As mentioned, one goal I'm seeking is to avoid having to open the child
tables in expand_inherited_rtentry().
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers