Partitioned tables and relfilenode
The new partitioned tables do not contain any data by themselves. Any
data inserted into a partitioned table is routed to and stored in one of
its partitions. In fact, it is impossible to insert *any* data before a
partition (to be precise, a leaf partition) is created. It seems wasteful
then to allocate physical storage (files) for partitioned tables. If we
do not allocate the storage, then we must make sure that the right thing
happens when a command that is intended to manipulate a table's storage
encounters a partitioned table, the "right thing" here being that the
command's code either throws an error or warning (in some cases) if the
specified table is a partitioned table or ignores any partitioned tables
when it reads the list of relations to process from pg_class. Commands
that need to be taught about this are vacuum, analyze, truncate, and alter
table. Specifically:
- In case of vacuum, specifying a partitioned table causes a warning
- In case of analyze, we do not throw an error or warning but simply
avoid calling do_analyze_rel() *non-recursively*. Further in
acquire_inherited_sample_rows(), any partitioned tables in the list
returned by find_all_inheritors() are skipped.
- In case of truncate, only the part which manipulates table's physical
storage is skipped for partitioned tables.
- ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
tables, because there is nothing to be done.
- Since we cannot create indexes on partitioned tables anyway, there is
no need to handle cluster and reindex (they throw a meaningful error
already due to the lack of indexes.)
Patches 0001 and 0002 of the attached implement the above part. 0001
teaches the above mentioned commands to do the "right thing" as described
above and 0002 teaches heap_create() and heap_create_with_catalog() to not
create any physical storage (none of the forks) for partitioned tables.
Then comes 0003, which concerns inheritance planning. In a regular
inheritance set (ie, the inheritance set corresponding to an inheritance
hierarchy whose root is a regular table), the inheritance parents are
included in their role as child members, because they might contribute
rows to the final result. So AppendRelInfo's are created for each such
table by the planner prep phase, which the later planning steps use to
create a scan plan for those tables as the Append's child plans.
Currently, the partitioned tables are also processed by the optimizer as
inheritance sets. Partitioned table inheritance parents do not own any
storage, so we *must* not create scan plans for them. So we do not need
to process them as child members of the inheritance set. 0003 teaches
expand_inherited_rtentry() to not add partitioned tables as child members.
Also, since the root partitioned table RTE is no longer added to the
Append list as the 1st child member, inheritance_planner() cannot assume
that it can install the 1st child RTE as the nominalRelation of a given
ModifyTable node, instead the original root parent table RTE is installed
as the nominalRelation.
Together the above patches implement the first item listed in "Some
notes:" part of an email [1]/messages/by-id/CA+Tgmob6DJEQBd=qH2wZG1onYZ9R1Sji3q+GqCRUqQi+ug28Rw@mail.gmail.com on the original declarative partitioning
thread, which says:
"We should try to teach the executor never to scan the parent. That's
never necessary with this system, and it might add significant overhead.
We should also try to get rid of the idea of the parent having storage
(i.e. a relfilenode)."
Thoughts, comments?
Thanks,
Amit
[1]: /messages/by-id/CA+Tgmob6DJEQBd=qH2wZG1onYZ9R1Sji3q+GqCRUqQi+ug28Rw@mail.gmail.com
/messages/by-id/CA+Tgmob6DJEQBd=qH2wZG1onYZ9R1Sji3q+GqCRUqQi+ug28Rw@mail.gmail.com
Attachments:
0001-Partitioned-tables-are-empty-themselves.patchtext/x-diff; name=0001-Partitioned-tables-are-empty-themselves.patchDownload+48-21
0002-Do-not-allocate-storage-for-partitioned-tables.patchtext/x-diff; name=0002-Do-not-allocate-storage-for-partitioned-tables.patchDownload+7-2
0003-Always-plan-partitioned-tables-as-inheritance-sets.patchtext/x-diff; name=0003-Always-plan-partitioned-tables-as-inheritance-sets.patchDownload+66-87
On 10 February 2017 at 06:19, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
The new partitioned tables do not contain any data by themselves. Any
data inserted into a partitioned table is routed to and stored in one of
its partitions. In fact, it is impossible to insert *any* data before a
partition (to be precise, a leaf partition) is created.
Where is all of this documented? All I see at the moment is old docs.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/02/10 15:58, Simon Riggs wrote:
On 10 February 2017 at 06:19, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:The new partitioned tables do not contain any data by themselves. Any
data inserted into a partitioned table is routed to and stored in one of
its partitions. In fact, it is impossible to insert *any* data before a
partition (to be precise, a leaf partition) is created.Where is all of this documented? All I see at the moment is old docs.
Last week I sent documentation patches in a separate message titled
"Documentation improvements for partitioning" [1]/messages/by-id/bf9ebbb3-fb1e-3206-b67c-e7a803a747d9@lab.ntt.co.jp. One of the patches
adds a separate section in the DDL chapter describing partitioned tables.
Thanks,
Amit
[1]: /messages/by-id/bf9ebbb3-fb1e-3206-b67c-e7a803a747d9@lab.ntt.co.jp
/messages/by-id/bf9ebbb3-fb1e-3206-b67c-e7a803a747d9@lab.ntt.co.jp
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 10, 2017 at 3:19 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
The new partitioned tables do not contain any data by themselves. Any
data inserted into a partitioned table is routed to and stored in one of
its partitions. In fact, it is impossible to insert *any* data before a
partition (to be precise, a leaf partition) is created. It seems wasteful
then to allocate physical storage (files) for partitioned tables. If we
do not allocate the storage, then we must make sure that the right thing
happens when a command that is intended to manipulate a table's storage
encounters a partitioned table, the "right thing" here being that the
command's code either throws an error or warning (in some cases) if the
specified table is a partitioned table or ignores any partitioned tables
when it reads the list of relations to process from pg_class. Commands
that need to be taught about this are vacuum, analyze, truncate, and alter
table. Specifically:
Thanks. I have been looking a bit at this set of patches.
- In case of vacuum, specifying a partitioned table causes a warning
- In case of analyze, we do not throw an error or warning but simply
avoid calling do_analyze_rel() *non-recursively*. Further in
acquire_inherited_sample_rows(), any partitioned tables in the list
returned by find_all_inheritors() are skipped.
- In case of truncate, only the part which manipulates table's physical
storage is skipped for partitioned tables.
I am wondering if instead those should not just be errors saying that
such operations are just not support. This could be relaxed in the
future to mean that a vacuum, truncate or analyze on the partitioned
table triggers an operator in cascade.
- ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
tables, because there is nothing to be done.
Perhaps that makes sense, foreign tables do that.
- Since we cannot create indexes on partitioned tables anyway, there is
no need to handle cluster and reindex (they throw a meaningful error
already due to the lack of indexes.)
Yep.
Patches 0001 and 0002 of the attached implement the above part. 0001
teaches the above mentioned commands to do the "right thing" as described
above and 0002 teaches heap_create() and heap_create_with_catalog() to not
create any physical storage (none of the forks) for partitioned tables.
- else
+ /*
+ * Although we cannot analyze partitioned tables themselves, we are
+ * still be able to do the recursive ANALYZE.
+ */
+ else if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
/* No need for a WARNING if we already complained during VACUUM */
if (!(options & VACOPT_VACUUM))
It seems to me that it is better style to use an empty else if with
only the comment. Comments related to a condition that are outside
this condition should be conditional in their formulations. Comments
within a condition can be affirmations when they refer to a decided
state of things.
From patch 2:
@@ -1351,7 +1352,12 @@ heap_create_with_catalog(const char *relname,
relkind == RELKIND_TOASTVALUE ||
relkind == RELKIND_PARTITIONED_TABLE);
- heap_create_init_fork(new_rel_desc);
+ /*
+ * We do not want to create any storage objects for a partitioned
+ * table.
+ */
+ if (relkind != RELKIND_PARTITIONED_TABLE)
+ heap_create_init_fork(new_rel_desc);
}
This does not make much sense with an assertion telling exactly the
contrary a couple of lines before. I think that it would make more
sense to move the assertion on relkind directly in
heap_create_init_fork().
Then comes 0003, which concerns inheritance planning. In a regular
inheritance set (ie, the inheritance set corresponding to an inheritance
hierarchy whose root is a regular table), the inheritance parents are
included in their role as child members, because they might contribute
rows to the final result. So AppendRelInfo's are created for each such
table by the planner prep phase, which the later planning steps use to
create a scan plan for those tables as the Append's child plans.
Currently, the partitioned tables are also processed by the optimizer as
inheritance sets. Partitioned table inheritance parents do not own any
storage, so we *must* not create scan plans for them. So we do not need
to process them as child members of the inheritance set. 0003 teaches
expand_inherited_rtentry() to not add partitioned tables as child members.
Also, since the root partitioned table RTE is no longer added to the
Append list as the 1st child member, inheritance_planner() cannot assume
that it can install the 1st child RTE as the nominalRelation of a given
ModifyTable node, instead the original root parent table RTE is installed
as the nominalRelation.
This is a collection of checks on relkind == RELKIND_PARTITIONED_TABLE
to avoid interactions with partition tables. Did you consider doing
something in the executor instead?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 15, 2017 at 2:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
This is a collection of checks on relkind == RELKIND_PARTITIONED_TABLE
to avoid interactions with partition tables. Did you consider doing
something in the executor instead?
That seems inferior, because the planner really ought to know that the
partitioning root doesn't need to be scanned. That way, it can do a
better job getting the cost and selectivity estimates right.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10 February 2017 at 06:19, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
the "right thing" here being that the
command's code either throws an error or warning (in some cases) if the
specified table is a partitioned table or ignores any partitioned tables
when it reads the list of relations to process from pg_class.
This is a massive assumption and deserves major discussion.
My expectation is that "partitioned tables" are "tables". Anything
else seems to fly in the face of both the SQL Standard and the POLA
principle for users coming from other database systems.
IMHO all the main actions should all "just work" not throw errors.
Commands
that need to be taught about this are vacuum, analyze, truncate, and alter
table. Specifically:- In case of vacuum, specifying a partitioned table causes a warning
Why not vacuum all partitions?
- In case of analyze, we do not throw an error or warning but simply
avoid calling do_analyze_rel() *non-recursively*. Further in
acquire_inherited_sample_rows(), any partitioned tables in the list
returned by find_all_inheritors() are skipped.
Why not analyze all partitions?
- In case of truncate, only the part which manipulates table's physical
storage is skipped for partitioned tables.
Truncate all partitions
- ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
tables, because there is nothing to be done.- Since we cannot create indexes on partitioned tables anyway, there is
no need to handle cluster and reindex (they throw a meaningful error
already due to the lack of indexes.)
Create index on all partitions
Thoughts, comments?
We already have Inheritance. Anybody that likes that behaviour can use it.
Most people don't like that behaviour and wish to see change. They
want the same behaviour as they get from other database products where
a partitioned table is a table and you can do stuff to it just like
other tables. We have the opportunity to change things here and we
should do so.
(It also seems like wasted effort to try to remove the overhead caused
by a parent table for partitioning. Why introduce a special case to
save a few bytes? Premature optimization, surely?)
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 16, 2017 at 6:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Why not vacuum all partitions?
Why not analyze all partitions?
Truncate all partitions
I agree. But, we need to be careful that a database-wide VACUUM or
ANALYZE doesn't hit the partitions multiple times, once for the parent
and again for each child. Actually, a database-wide VACUUM should hit
each partition individually and do nothing for the parents, but a
database-wide ANALYZE should process the parents and do nothing for
the children, so that the inheritance statistics get updated.
- ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
tables, because there is nothing to be done.- Since we cannot create indexes on partitioned tables anyway, there is
no need to handle cluster and reindex (they throw a meaningful error
already due to the lack of indexes.)Create index on all partitions
That one's more complicated, per what I wrote in
/messages/by-id/CA+TgmoZUwj=QYnaK+F7xEf4w_e2g3XxdMnSNZMZjuinHRcOB8A@mail.gmail.com
(It also seems like wasted effort to try to remove the overhead caused
by a parent table for partitioning. Why introduce a special case to
save a few bytes? Premature optimization, surely?)
I don't think it's wasted effort, actually. My concern isn't so much
the empty file on disk (which is stupid, but maybe harmless) as
eliminating the dummy scan from the query plan. I believe the
do-nothing scan can actually be a noticeable drag on performance in
some cases - e.g. if the scan of the partitioned table is on the
inside of a nested loop, so that instead of repeatedly doing an index
scan on each of 4 partitions, you repeatedly do an index scan on each
of 4 partitions and a sequential scan of an empty table. A zero-page
sequential scan is pretty fast, but not free. An even bigger problem
is that the planner may think that always-empty parent can contain
some rows, throwing planner estimates off and messing up the whole
plan. We've been living with that problem for a long time, but now
that we have an opportunity to fix it, it would be good to do so.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/02/15 16:14, Michael Paquier wrote:
On Fri, Feb 10, 2017 at 3:19 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:The new partitioned tables do not contain any data by themselves. Any
data inserted into a partitioned table is routed to and stored in one of
its partitions. In fact, it is impossible to insert *any* data before a
partition (to be precise, a leaf partition) is created. It seems wasteful
then to allocate physical storage (files) for partitioned tables. If we
do not allocate the storage, then we must make sure that the right thing
happens when a command that is intended to manipulate a table's storage
encounters a partitioned table, the "right thing" here being that the
command's code either throws an error or warning (in some cases) if the
specified table is a partitioned table or ignores any partitioned tables
when it reads the list of relations to process from pg_class. Commands
that need to be taught about this are vacuum, analyze, truncate, and alter
table. Specifically:Thanks. I have been looking a bit at this set of patches.
Thanks for reviewing!
- In case of vacuum, specifying a partitioned table causes a warning
- In case of analyze, we do not throw an error or warning but simply
avoid calling do_analyze_rel() *non-recursively*. Further in
acquire_inherited_sample_rows(), any partitioned tables in the list
returned by find_all_inheritors() are skipped.
- In case of truncate, only the part which manipulates table's physical
storage is skipped for partitioned tables.I am wondering if instead those should not just be errors saying that
such operations are just not support. This could be relaxed in the
future to mean that a vacuum, truncate or analyze on the partitioned
table triggers an operator in cascade.
As the discussion down-thread suggests, that might be a path worth
considering. That is, vacuum or analyze on a partitioned table causes all
the (leaf) partitions to be vacuumed or analyzed.
Truncate already does that (that is, recurse to leaf partitions). This
patch simply prevents ExecuteTruncate() from trying to manipulate the
partitioned table's relfilenode.
- Since we cannot create indexes on partitioned tables anyway, there is
no need to handle cluster and reindex (they throw a meaningful error
already due to the lack of indexes.)Yep.
This one is still under discussion.
Patches 0001 and 0002 of the attached implement the above part. 0001
teaches the above mentioned commands to do the "right thing" as described
above and 0002 teaches heap_create() and heap_create_with_catalog() to not
create any physical storage (none of the forks) for partitioned tables.- else + /* + * Although we cannot analyze partitioned tables themselves, we are + * still be able to do the recursive ANALYZE. + */ + else if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) { /* No need for a WARNING if we already complained during VACUUM */ if (!(options & VACOPT_VACUUM)) It seems to me that it is better style to use an empty else if with only the comment. Comments related to a condition that are outside this condition should be conditional in their formulations. Comments within a condition can be affirmations when they refer to a decided state of things.
Not sure if this will survive based on the decisions on what to do about
vacuum and analyze, but how about the following rewrite of the above:
else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
/*
* Although we cannot analyze partitioned tables themselves, we are
* still be able to do the recursive ANALYZE, which we do below.
*/
}
else
{
From patch 2:
@@ -1351,7 +1352,12 @@ heap_create_with_catalog(const char *relname,
relkind == RELKIND_TOASTVALUE ||
relkind == RELKIND_PARTITIONED_TABLE);- heap_create_init_fork(new_rel_desc); + /* + * We do not want to create any storage objects for a partitioned + * table. + */ + if (relkind != RELKIND_PARTITIONED_TABLE) + heap_create_init_fork(new_rel_desc); } This does not make much sense with an assertion telling exactly the contrary a couple of lines before. I think that it would make more sense to move the assertion on relkind directly in heap_create_init_fork().
Hmm, perhaps the below is saner, with the Assert moved to the start of
heap_create_init_fork() as shown below:
/*
* We do not want to create any storage objects for a partitioned
* table, including the init fork.
*/
if (relpersistence == RELPERSISTENCE_UNLOGGED &&
relkind != RELKIND_PARTITIONED_TABLE)
heap_create_init_fork(new_rel_desc);
@@ -1382,6 +1376,8 @@ heap_create_with_catalog(const char *relname,
void
heap_create_init_fork(Relation rel)
{
+ Assert(relkind == RELKIND_RELATION || relkind == RELKIND_MATVIEW ||
+ relkind == RELKIND_TOASTVALUE);
Then comes 0003, which concerns inheritance planning. In a regular
inheritance set (ie, the inheritance set corresponding to an inheritance
hierarchy whose root is a regular table), the inheritance parents are
included in their role as child members, because they might contribute
rows to the final result. So AppendRelInfo's are created for each such
table by the planner prep phase, which the later planning steps use to
create a scan plan for those tables as the Append's child plans.
Currently, the partitioned tables are also processed by the optimizer as
inheritance sets. Partitioned table inheritance parents do not own any
storage, so we *must* not create scan plans for them. So we do not need
to process them as child members of the inheritance set. 0003 teaches
expand_inherited_rtentry() to not add partitioned tables as child members.
Also, since the root partitioned table RTE is no longer added to the
Append list as the 1st child member, inheritance_planner() cannot assume
that it can install the 1st child RTE as the nominalRelation of a given
ModifyTable node, instead the original root parent table RTE is installed
as the nominalRelation.This is a collection of checks on relkind == RELKIND_PARTITIONED_TABLE
to avoid interactions with partition tables. Did you consider doing
something in the executor instead?
Getting rid of the parent relation (of which there could be many if a
given partition tree is multi-level) well within the optimizer means that
the optimizer makes a plan that does not have redundant Scan nodes for
partitioned tables (and if the 2nd patch is worthy of consideration, we
absolutely cannot create Scan nodes for them at all). And Robert's point
about optimizer's cost and selectivity estimations.
I will post the updated patches after addressing other comments in the thread.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/02/16 23:40, Robert Haas wrote:
On Thu, Feb 16, 2017 at 6:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Why not vacuum all partitions?
Why not analyze all partitions?
Truncate all partitionsI agree. But, we need to be careful that a database-wide VACUUM or
ANALYZE doesn't hit the partitions multiple times, once for the parent
and again for each child. Actually, a database-wide VACUUM should hit
each partition individually and do nothing for the parents, but a
This is what would happen even without the patch. Patch only modifies
what happens when a partitioned table is specified in the vacuum command.
It emits a warning:
WARNING: skipping "%s" --- cannot vacuum partitioned tables
It seems both you and Simon agree that instead of this warning, we should
recurse to process the leaf partitions (ignoring any partitioned tables in
the hierarchy for which there is nothing to do). If that's right, I will
modify the patch to do that.
database-wide ANALYZE should process the parents and do nothing for
the children, so that the inheritance statistics get updated.
Currently vacuum() processes the following relations:
/*
* Process all plain relations and materialized views listed in
* pg_class
*/
while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
{
Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
if (classForm->relkind != RELKIND_RELATION &&
classForm->relkind != RELKIND_MATVIEW)
continue;
Do you mean that if database-wide analyze is to be run, we should also
exclude those RELKIND_RELATION relations that are partitions?
So the only way to update a partition's statistics is to directly specify
it in the command or by autovacuum.
Truncate already recurses to partitions by way of inheritance recursion
that's already in place. The patch simply teaches ExecuteTruncate() to
ignore partitioned tables when we get to the part where relfilenodes are
manipulated.
- ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
tables, because there is nothing to be done.- Since we cannot create indexes on partitioned tables anyway, there is
no need to handle cluster and reindex (they throw a meaningful error
already due to the lack of indexes.)Create index on all partitions
That one's more complicated, per what I wrote in
/messages/by-id/CA+TgmoZUwj=QYnaK+F7xEf4w_e2g3XxdMnSNZMZjuinHRcOB8A@mail.gmail.com
I agree that it would be nice to fix this usability issue. But, it's also
important as you say in the quoted email that we should not underestimate
the amount of work that might be required to properly implement it.
(It also seems like wasted effort to try to remove the overhead caused
by a parent table for partitioning. Why introduce a special case to
save a few bytes? Premature optimization, surely?)I don't think it's wasted effort, actually. My concern isn't so much
the empty file on disk (which is stupid, but maybe harmless) as
eliminating the dummy scan from the query plan. I believe the
do-nothing scan can actually be a noticeable drag on performance in
some cases - e.g. if the scan of the partitioned table is on the
inside of a nested loop, so that instead of repeatedly doing an index
scan on each of 4 partitions, you repeatedly do an index scan on each
of 4 partitions and a sequential scan of an empty table. A zero-page
sequential scan is pretty fast, but not free. An even bigger problem
is that the planner may think that always-empty parent can contain
some rows, throwing planner estimates off and messing up the whole
plan. We've been living with that problem for a long time, but now
that we have an opportunity to fix it, it would be good to do so.
Agreed. I would have reconsidered if it were a more invasive patch.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 17, 2017 at 1:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
I agree. But, we need to be careful that a database-wide VACUUM or
ANALYZE doesn't hit the partitions multiple times, once for the parent
and again for each child. Actually, a database-wide VACUUM should hit
each partition individually and do nothing for the parents, but aThis is what would happen even without the patch. Patch only modifies
what happens when a partitioned table is specified in the vacuum command.
It emits a warning:WARNING: skipping "%s" --- cannot vacuum partitioned tables
It seems both you and Simon agree that instead of this warning, we should
recurse to process the leaf partitions (ignoring any partitioned tables in
the hierarchy for which there is nothing to do). If that's right, I will
modify the patch to do that.
Yeah, that sounds fine.
database-wide ANALYZE should process the parents and do nothing for
the children, so that the inheritance statistics get updated.Currently vacuum() processes the following relations:
/*
* Process all plain relations and materialized views listed in
* pg_class
*/while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
{
Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);if (classForm->relkind != RELKIND_RELATION &&
classForm->relkind != RELKIND_MATVIEW)
continue;Do you mean that if database-wide analyze is to be run, we should also
exclude those RELKIND_RELATION relations that are partitions?So the only way to update a partition's statistics is to directly specify
it in the command or by autovacuum.
I think if you type:
ANALYZE;
...that should process all partitioned tables and all tables that are
not themselves partitions. If you type:
ANALYZE name;
...that should ANALYZE that relation, whatever it is. If it's a
partitioned table, it should recurse.
Truncate already recurses to partitions by way of inheritance recursion
that's already in place. The patch simply teaches ExecuteTruncate() to
ignore partitioned tables when we get to the part where relfilenodes are
manipulated.
Oh, OK. That seems fine.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16 February 2017 at 11:32, Simon Riggs <simon@2ndquadrant.com> wrote:
On 10 February 2017 at 06:19, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:the "right thing" here being that the
command's code either throws an error or warning (in some cases) if the
specified table is a partitioned table or ignores any partitioned tables
when it reads the list of relations to process from pg_class.This is a massive assumption and deserves major discussion.
My expectation is that "partitioned tables" are "tables". Anything
else seems to fly in the face of both the SQL Standard and the POLA
principle for users coming from other database systems.IMHO all the main actions should all "just work" not throw errors.
This included DROP TABLE, which I commented on before.
CASCADE should not be required.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/02/20 5:31, Simon Riggs wrote:
On 16 February 2017 at 11:32, Simon Riggs <simon@2ndquadrant.com> wrote:
On 10 February 2017 at 06:19, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:the "right thing" here being that the
command's code either throws an error or warning (in some cases) if the
specified table is a partitioned table or ignores any partitioned tables
when it reads the list of relations to process from pg_class.This is a massive assumption and deserves major discussion.
My expectation is that "partitioned tables" are "tables". Anything
else seems to fly in the face of both the SQL Standard and the POLA
principle for users coming from other database systems.IMHO all the main actions should all "just work" not throw errors.
This included DROP TABLE, which I commented on before.
CASCADE should not be required.
Yeah, it seemed like that is the consensus so I posted a patch [0]/messages/by-id/ca132b99-0d18-439a-fe65-024085449259@lab.ntt.co.jp, which
re-posted in a new thread titled "dropping partitioned tables without
CASCADE" [1]/messages/by-id/6c420206-45d7-3f56-8325-4bd7b76483ba@lab.ntt.co.jp.
Thanks,
Amit
[0]: /messages/by-id/ca132b99-0d18-439a-fe65-024085449259@lab.ntt.co.jp
[1]: /messages/by-id/6c420206-45d7-3f56-8325-4bd7b76483ba@lab.ntt.co.jp
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/02/19 18:53, Robert Haas wrote:
On Fri, Feb 17, 2017 at 1:12 PM, Amit Langote wrote:
Do you mean that if database-wide analyze is to be run, we should also
exclude those RELKIND_RELATION relations that are partitions?So the only way to update a partition's statistics is to directly specify
it in the command or by autovacuum.I think if you type:
ANALYZE;
...that should process all partitioned tables and all tables that are
not themselves partitions.
OK.
If you type:
ANALYZE name;
...that should ANALYZE that relation, whatever it is. If it's a
partitioned table, it should recurse.
To be clear, by "recurse" I assume you mean to perform ANALYZE on
individual partitions, not just collect the inheritance statistics. So
ANALYZE partitioned_table would both a) collect the inheritance statistics
for the specified table and other partitioned tables in the hierarchy, b)
ANALYZE every leaf partitions updating their statistics in pg_class.
While working on this, I noticed that autovacuum.c does not collect
RELKIND_PARTITIONED_TABLE relations, which I think is not right. It
should match what get_rel_oids() does, which in the database-wide
VACUUM/ANALYZE case collects them.
Attached updated patches:
Updated 0001 addresses the following comments:
- should recurse when vacuum/analyze is performed on a partitioned table
- database-wide vacuum should ignore partitioned tables
- database-wide analyze should ignore partitions; only the inheritance
statistics of the partitioned tables must be collected in this case
Thanks,
Amit
Attachments:
0001-Partitioned-tables-are-empty-themselves.patchtext/x-diff; name=0001-Partitioned-tables-are-empty-themselves.patchDownload+118-28
0002-Do-not-allocate-storage-for-partitioned-tables.patchtext/x-diff; name=0002-Do-not-allocate-storage-for-partitioned-tables.patchDownload+11-10
0003-Always-plan-partitioned-tables-as-inheritance-sets.patchtext/x-diff; name=0003-Always-plan-partitioned-tables-as-inheritance-sets.patchDownload+68-87
Some comments about 0003 patch.
@@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
Index rti;
+ RangeTblEntry *parent_rte;
There's already another variable declared in that function within a loop
foreach(lc, root->append_rel_list)
{
...
RangeTblEntry *parent_rte;
RangeTblEntry *child_rte;
You might want to choose a different name or delete the one within the loop.
I am wondering whether we should deal with inh flat reset in a
slightly different way. Let expand_inherited_rtentry() mark inh =
false for the partitioned tables without any partitions and deal with
those at the time of estimating size by marking those as dummy. That
might be better than multiple changes here. I will try to cook up a
patch soon for that.
Also we should add tests to make sure the scans on partitioned tables
without any partitions do not get into problems. PFA patch which adds
those tests.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachments:
0003-Always-plan-partitioned-tables-as-inheritance-sets_v2.patchapplication/octet-stream; name=0003-Always-plan-partitioned-tables-as-inheritance-sets_v2.patchDownload+92-86
On 2017/02/21 22:21, Ashutosh Bapat wrote:
Some comments about 0003 patch.
@@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
Index rti;
+ RangeTblEntry *parent_rte;
There's already another variable declared in that function within a loop
foreach(lc, root->append_rel_list)
{
...
RangeTblEntry *parent_rte;
RangeTblEntry *child_rte;You might want to choose a different name or delete the one within the loop.
Deleted the one within the loop.
I am wondering whether we should deal with inh flat reset in a
slightly different way. Let expand_inherited_rtentry() mark inh =
false for the partitioned tables without any partitions and deal with
those at the time of estimating size by marking those as dummy. That
might be better than multiple changes here. I will try to cook up a
patch soon for that.
Are thinking something along the lines of the attached rewritten patch
0003? I also tend to think that's probably a cleaner patch. Thanks for
the idea.
Also we should add tests to make sure the scans on partitioned tables
without any partitions do not get into problems. PFA patch which adds
those tests.
I added the test case you suggest, but kept just the first one.
I am including the patches 0001 and 0002 to keep all patches being
discussed on this thread together.
Thanks,
Amit
Attachments:
0001-Partitioned-tables-are-empty-themselves.patchtext/x-diff; name=0001-Partitioned-tables-are-empty-themselves.patchDownload+123-30
0002-Do-not-allocate-storage-for-partitioned-tables.patchtext/x-diff; name=0002-Do-not-allocate-storage-for-partitioned-tables.patchDownload+11-10
0003-Avoid-creating-scan-nodes-for-partitioned-tables.patchtext/x-diff; name=0003-Avoid-creating-scan-nodes-for-partitioned-tables.patchDownload+76-84
I am wondering whether we should deal with inh flat reset in a
slightly different way. Let expand_inherited_rtentry() mark inh =
false for the partitioned tables without any partitions and deal with
those at the time of estimating size by marking those as dummy. That
might be better than multiple changes here. I will try to cook up a
patch soon for that.Are thinking something along the lines of the attached rewritten patch
0003? I also tend to think that's probably a cleaner patch. Thanks for
the idea.
Yes. Thanks for working on it.
Also we should add tests to make sure the scans on partitioned tables
without any partitions do not get into problems. PFA patch which adds
those tests.I added the test case you suggest, but kept just the first one.
The second one was testing multi-level partitioning case, which
deserves a testcase by its own.
Some more comments
The comment below seems too verbose. All it can say is "A partitioned table
without any partitions results in a dummy relation." I don't think we need to
be explicit about rte->inh. But it's more of personal preference. We will leave
it to the committer, if you don't agree.
+ /*
+ * An empty partitioned table, i.e., one without any leaf
+ * partitions, as signaled by rte->inh being set to false
+ * by the prep phase (see expand_inherited_rtentry).
+ */
We don't need this comment as well. Rather than repeating what's been said at
the top of the function, it should just refer to it like "nominal relation has
been set above for partitioned tables. For other parent relations, we'll use
the first child ...".
+ *
+ * If the parent is a partitioned table, we do not have a separate RTE
+ * representing it as a member of the inheritance set, because we do
+ * not create a scan plan for it. As mentioned at the top of this
+ * function, we make the parent RTE itself the nominal relation.
*/
Following condition is not very readable. It's not evident that it's of the
form (A && B) || C, at a glance it looks like it's A && (B || C).
+ if ((rte->relkind != RELKIND_PARTITIONED_TABLE &&
+ list_length(appinfos) < 2) || list_length(appinfos) < 1)
Instead you may rearrage it as
min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
if (list_length(appinfos) < min_child_rels)
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks for the review.
On 2017/02/22 21:58, Ashutosh Bapat wrote:
Also we should add tests to make sure the scans on partitioned tables
without any partitions do not get into problems. PFA patch which adds
those tests.I added the test case you suggest, but kept just the first one.
The second one was testing multi-level partitioning case, which
deserves a testcase by its own.
Ah, okay. Added it back.
Some more comments
The comment below seems too verbose. All it can say is "A partitioned table without any partitions results in a dummy relation." I don't think we need to be explicit about rte->inh. But it's more of personal preference. We will leave it to the committer, if you don't agree. + /* + * An empty partitioned table, i.e., one without any leaf + * partitions, as signaled by rte->inh being set to false + * by the prep phase (see expand_inherited_rtentry). + */
It does seem poorly worded. How about:
/*
* A partitioned table without leaf partitions is marked
* as a dummy rel.
*/
We don't need this comment as well. Rather than repeating what's been said at the top of the function, it should just refer to it like "nominal relation has been set above for partitioned tables. For other parent relations, we'll use the first child ...". + * + * If the parent is a partitioned table, we do not have a separate RTE + * representing it as a member of the inheritance set, because we do + * not create a scan plan for it. As mentioned at the top of this + * function, we make the parent RTE itself the nominal relation. */
Rewrote that comment block as:
*
* If the parent is a partitioned table, we already set the nominal
* relation.
*/
Following condition is not very readable. It's not evident that it's of the form (A && B) || C, at a glance it looks like it's A && (B || C). + if ((rte->relkind != RELKIND_PARTITIONED_TABLE && + list_length(appinfos) < 2) || list_length(appinfos) < 1)Instead you may rearrage it as
min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
if (list_length(appinfos) < min_child_rels)
OK, done that way.
Thanks,
Amit
Attachments:
0001-Partitioned-tables-are-empty-themselves.patchtext/x-diff; name=0001-Partitioned-tables-are-empty-themselves.patchDownload+123-30
0002-Do-not-allocate-storage-for-partitioned-tables.patchtext/x-diff; name=0002-Do-not-allocate-storage-for-partitioned-tables.patchDownload+11-10
0003-Avoid-creating-scan-nodes-for-partitioned-tables.patchtext/x-diff; name=0003-Avoid-creating-scan-nodes-for-partitioned-tables.patchDownload+90-84
On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
Thanks for the review.
On 2017/02/22 21:58, Ashutosh Bapat wrote:
Also we should add tests to make sure the scans on partitioned tables
without any partitions do not get into problems. PFA patch which adds
those tests.I added the test case you suggest, but kept just the first one.
The second one was testing multi-level partitioning case, which
deserves a testcase by its own.Ah, okay. Added it back.
Thanks.
Some more comments
The comment below seems too verbose. All it can say is "A partitioned table without any partitions results in a dummy relation." I don't think we need to be explicit about rte->inh. But it's more of personal preference. We will leave it to the committer, if you don't agree. + /* + * An empty partitioned table, i.e., one without any leaf + * partitions, as signaled by rte->inh being set to false + * by the prep phase (see expand_inherited_rtentry). + */It does seem poorly worded. How about:
/*
* A partitioned table without leaf partitions is marked
* as a dummy rel.
*/We don't need this comment as well. Rather than repeating what's been said at the top of the function, it should just refer to it like "nominal relation has been set above for partitioned tables. For other parent relations, we'll use the first child ...". + * + * If the parent is a partitioned table, we do not have a separate RTE + * representing it as a member of the inheritance set, because we do + * not create a scan plan for it. As mentioned at the top of this + * function, we make the parent RTE itself the nominal relation. */Rewrote that comment block as:
*
* If the parent is a partitioned table, we already set the nominal
* relation.
*/
I reworded those comments a bit and corrected grammar. Please check in
the attached patch.
Following condition is not very readable. It's not evident that it's of the form (A && B) || C, at a glance it looks like it's A && (B || C). + if ((rte->relkind != RELKIND_PARTITIONED_TABLE && + list_length(appinfos) < 2) || list_length(appinfos) < 1)Instead you may rearrage it as
min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
if (list_length(appinfos) < min_child_rels)OK, done that way.
On a second thought, I added a boolean to check if there were any
children created and then reset rte->inh based on that value. That's
better than relying on appinfos length now.
Let me know what you think.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachments:
0003-Avoid-creating-scan-nodes-for-partitioned-tables_v2.patchapplication/octet-stream; name=0003-Avoid-creating-scan-nodes-for-partitioned-tables_v2.patchDownload+96-86
Thanks for the review.
On 2017/02/23 15:44, Ashutosh Bapat wrote:
On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote wrote:
Rewrote that comment block as:
*
* If the parent is a partitioned table, we already set the nominal
* relation.
*/I reworded those comments a bit and corrected grammar. Please check in
the attached patch.
What was there sounds grammatically correct to me, but fine.
Following condition is not very readable. It's not evident that it's of the form (A && B) || C, at a glance it looks like it's A && (B || C). + if ((rte->relkind != RELKIND_PARTITIONED_TABLE && + list_length(appinfos) < 2) || list_length(appinfos) < 1)Instead you may rearrage it as
min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
if (list_length(appinfos) < min_child_rels)OK, done that way.
On a second thought, I added a boolean to check if there were any
children created and then reset rte->inh based on that value. That's
better than relying on appinfos length now.
@@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root)
/*
+ * Partitioned tables do not have storage for themselves and should not be
+ * scanned.
@@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root,
RangeTblEntry *rte, Index rti)
/*
+ * Partitioned tables themselves do not have any storage and should not
+ * be scanned. So, do not create child relations for those.
+ */
I guess we should not have to repeat "partitioned tables do not have
storage" in all these places.
+ * a partitioned relation as dummy. The duplicate RTE we added for the
+ * parent table is harmless, so we don't bother to get rid of it; ditto for
+ * the useless PlanRowMark node.
There is no duplicate RTE in the partitioned table case, which even my
original comment failed to consider. Can you, maybe?
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 23, 2017 at 1:08 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
Thanks for the review.
On 2017/02/23 15:44, Ashutosh Bapat wrote:
On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote wrote:
Rewrote that comment block as:
*
* If the parent is a partitioned table, we already set the nominal
* relation.
*/I reworded those comments a bit and corrected grammar. Please check in
the attached patch.What was there sounds grammatically correct to me, but fine.
Following condition is not very readable. It's not evident that it's of the form (A && B) || C, at a glance it looks like it's A && (B || C). + if ((rte->relkind != RELKIND_PARTITIONED_TABLE && + list_length(appinfos) < 2) || list_length(appinfos) < 1)Instead you may rearrage it as
min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2);
if (list_length(appinfos) < min_child_rels)OK, done that way.
On a second thought, I added a boolean to check if there were any
children created and then reset rte->inh based on that value. That's
better than relying on appinfos length now.@@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root) /* + * Partitioned tables do not have storage for themselves and should not be + * scanned.@@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) /* + * Partitioned tables themselves do not have any storage and should not + * be scanned. So, do not create child relations for those. + */I guess we should not have to repeat "partitioned tables do not have
storage" in all these places.
Hmm, you are right. But they are two different functions and the code
blocks are located away from each other. So, I thought it would be
better to have complete comment there, instead of pointing to other
places. I am fine, if we can reword it without compromising
readability.
+ * a partitioned relation as dummy. The duplicate RTE we added for the + * parent table is harmless, so we don't bother to get rid of it; ditto for + * the useless PlanRowMark node.There is no duplicate RTE in the partitioned table case, which even my
original comment failed to consider. Can you, maybe?
May be we could says "If we have added duplicate RTE for the parent
table, it is harmless ...". Does that sound good?
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers