On partitioning
Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
reference Tom's post
/messages/by-id/1598.1399826841@sss.pgh.pa.us
which mentions the possibility of a different partitioning
implementation than what we have so far. As it turns out, I've been
thinking about partitioning recently, so I thought I would share what
I'm thinking so that others can poke holes. My intention is to try to
implement this as soon as possible.
Declarative partitioning
========================
In this design, partitions are first-class objects, not normal tables in
inheritance hierarchies. There are no pg_inherits entries involved at all.
Partitions are a physical implementation detail. Therefore we do not allow
the owner to be changed, or permissions to be granted directly to partitions;
all these operations happen to the parent relation instead.
System Catalogs
---------------
In pg_class we have two additional relkind values:
* relkind RELKIND_PARTITIONED_REL 'P' indicates a partitioned relation.
It is used to indicate a parent table, i.e. one the user can directly
address in DML queries. Such relations DO NOT have their own storage.
These use the same rules as regular tables for access privileges,
ownership and so on.
* relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned
relation (its parent). These cannot be addressed directly in DML
queries and only limited DDL support is provided. They don't have
their own pg_attribute entries either and therefore they are always
identical in column definitions to the parent relation. Since they
are not accessible directly, there is no need for ACL considerations;
the parent relation's owner is the owner, and grants are applied to
the parent relation only.
XXX --- is there a need for a partition having different column
default values than its parent relation?
Partitions are numbered sequentially, normally from 1 onwards; but it is
valid to have negative partition numbers and 0. Partitions don't have
names (except automatically generated ones for pg_class.relname, but
they are unusable in DDL).
Each partition is assigned an Expression that receives a tuple and
returns boolean. This expression returns true if a given tuple belongs
into it, false otherwise. If a tuple for a partitioned relation is run
through expressions of all partitions, exactly one should return true.
If none returns true, it might be because the partition has not been
created yet. A user-facing error is raised in this case (Rationale: if
user creates a partitioned rel and there is no partition that accepts
some given tuple, it's the user's fault.)
Additionally, each partitioned relation may have a master expression.
This receives a tuple and returns an integer, which corresponds to the
number of the partition it belongs into.
There are two new system catalogs:
pg_partitioned_rel --> (prelrelid, prelexpr) pg_partition -->
(partrelid, partseq, partexpr, partoverflow)
For partitioned rels that have prelexpr, we run that expression and
obtain the partition number; as a crosscheck we run partexpr and ensure
it returns true. For partitioned rels that don't have prelexpr, we run
partexpr for each partition in turn until one returns true. This means
that for a properly set up partitioned table, we need to run a single
expression on a tuple to find out what partition the tuple belongs into.
Per-partition expressions are formed as each partition is created, and
are based on the user-supplied partitioning criterion. Master
expressions are formed at relation creation time. (XXX Can we change
the master expression later, as a result of some ALTER command?
Presumably this would mean that all partitions might need to be
rewritten.)
Triggers --------
(These are user-defined triggers, not partitioning triggers. In fact
there are no partitioning triggers at all.)
Triggers are attached to the parent relation, not to the specific
partition. When a trigger function runs on a tuple inserted, updated or
modified on a partition, the data received by the trigger function makes
it appear that the tuple belongs to the parent relation. There is no
need to let the trigger know which partition the tuple went in or came
from. XXX is there a need to give it the partition number that the
tuple went it?
Syntax ------
CREATE TABLE xyz ( ... ) PARTITION BY RANGE ( a_expr ) This creates the
main table only: no partitions are created automatically.
We do not support other types of partitioning at this stage. We will
implement these later.
We do not currently support ALTER TABLE/PARTITION BY (i.e. partition a
table after the fact). We leave this as a future improvement.
Allowed actions on RELKIND_PARTITIONED_REL: * ALTER TABLE <xyz> CREATE
PARTITION <n> This creates a new partition * ALTER TABLE <xyz> CREATE
PARTITION FOR <value> Same as above; the partition number is determined
automatically.
Allowed actions on a RELKIND_PARTITION:
* ALTER PARTITION <n> ON TABLE <xyz> SET TABLESPACE * ALTER PARTITION
<n> ON TABLE <xyz> DROP * CREATE INDEX .. ON PARTITION <n> ON TABLE
<xyz> * VACUUM parent PARTITION <n>
As a future extension we will allow partitions to become detached from
the parent relation, thus becoming an independent table. This might be
a relatively expensive operation: pg_attribute entries need to be
created, for example.
Overflow Partitions -------------------
There is no explicit concept of overflow partitions.
Vacuum, aging -------------
PARTITIONED_RELs, not containing tuples directly, do not have
relfrozenxid or relminmxid. Each partition has individual values for
these variables.
Autovacuum knows to ignore PARTITIONED_RELs, and considers each
RELKIND_PARTITION separately.
Each partition is vacuumed as a normal relation.
Planner -------
A partitioned relation behaves just like a regular relation for purposes
of planner. XXX do we need special considerations regarding relation
size estimation?
For scan plans, we need to prepare Append lists which are used to scan
for tuples in a partitioned relation. We can setup fake constraint
expressions based on the partitioning expressions, which let the planner
discard unnecessary partitions by way of constraint exclusion.
(In the future we might be interested in creating specialized plan and
execution nodes that know more about partitioned relations, to avoid
creating useless Append trees only to prune them later.)
Executor --------
When doing an INSERT or UPDATE ResultRelInfo needs to be expanded for
partitioned relations: the target relation of an insertion is the parent
relation, but the actual partition needs to be resolved at ModifyTable
execution time. This means RelOptInfo needs to know about partitions;
either we deal with them as "other rels" terms, or we create a new
RelOptKind. At any rate, running the partitioning expression on the new
tuple would give an partition index. This needs to be done once for
each new tuple.
I think during ExecInsert, after running triggers and before executing
constraints, we need to switch resultRelationDesc from the parent
relation into the partition-specific relation.
ExecInsertIndexTuples only knows about partitions. It's an error to
call it using a partitioned rel.
Heap Access Method ------------------ For the purposes of low-level
routines in heapam.c, only partitions exist; trying to insert or modify
tuples in a RELKIND_PARTITIONED_REL is an error. heap_insert and
heap_multi_insert only accept inserting tuples into an individual
partition. These routines do not check that the tuples belong into the
specific partition; that's responsibility of higher-level code. Because
of this, code like COPY will need to make its own checks. Maybe we
should offer another API (in between high-level things such as
ModifyTable/COPY and heapam.c) that receives tuples into a
PARTITIONED_REL and routes them into specific partitions. Note: need to
ensure we do not slow down COPY for the regular case of
RELKIND_RELATION.
Taking backups --------------
pg_dump is able to dump a partitioned relation as a CREATE
TABLE/PARTITION command and a series of ALTER TABLE/CREATE PARTITION
commands. The data of all partitions is considered a single COPY
operation.
XXX this limits the ability to restore in parallel. To fix we might consider
using one COPY for each partition. It's not clear what relation should be
mentioned in such a COPY command, though -- my instinct is that it
should reference the parent table only, not the individual partition.
Previous Discussion
-------------------
/messages/by-id/d3c4af540703292358s8ed731el7771ab14083aa610@mail.gmail.com
Auto Partitioning Patch - WIP version 1
(Nikhil Sontakke, March 2007)
/messages/by-id/20080111231945.GY6934@europa.idg.com.au
Declarative partitioning grammar
(Gavin Sherry, January 2008)
/messages/by-id/bd8134a40906080702s96c90a9q3bbb581b9bd0d5d7@mail.gmail.com
Patch for automating partitions in PostgreSQL 8.4 Beta 2
(Kedar Potdar, Jun 2009)
/messages/by-id/20091029111531.96CD.52131E4D@oss.ntt.co.jp
Syntax for partitioning
(Itagaki Takahiro, Oct 2009)
/messages/by-id/AANLkTikP-1_8B04eyIK0sDf8uA5KMo64o8sorFBZE_CT@mail.gmail.com
Partitioning syntax
(Itagaki Takahiro, Jan 2010)
Not really related:
/messages/by-id/1199296574.7260.149.camel@ebony.site
Dynamic Partitioning using Segment Visibility Maps
(Simon Riggs, January 2008)
Still To Be Designed
--------------------
* Dependency issues
* Are indexes/constraints inherited from the parent rel?
* Multiple keys? Subpartitioning? Hash partitioning?
Open Questions
--------------
* What's the syntax to refer to specific partitions within a partitioned
table?
We could do "TABLE <xyz> PARTITION <n>", but for example if in
the future we add hash partitioning, we might need some non-integer
addressing (OTOH assigning sequential numbers to hash partitions doesn't
seem so bad). Discussing with users of other DBMSs partitioning feature,
one useful phrase is "TABLE <xyz> PARTITION FOR <value>".
* Do we want to provide partitioned materialized views?
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
[ partition sketch ]
In this design, partitions are first-class objects, not normal tables in
inheritance hierarchies. There are no pg_inherits entries involved at all.
Hm, actually I'd say they are *not* first class objects; the problem with
the existing design is exactly that child tables *are* first class
objects. This is merely a terminology quibble though.
* relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned
relation (its parent). These cannot be addressed directly in DML
queries and only limited DDL support is provided. They don't have
their own pg_attribute entries either and therefore they are always
identical in column definitions to the parent relation.
Not sure that not storing the pg_attribute rows is a good thing; but
that's something that won't be clear till you try to code it.
Each partition is assigned an Expression that receives a tuple and
returns boolean. This expression returns true if a given tuple belongs
into it, false otherwise.
-1, in fact minus a lot. One of the core problems of the current approach
is that the system, particularly the planner, hasn't got a lot of insight
into exactly what the partitioning scheme is in a partitioned table built
on inheritance. If you allow the partitioning rule to be a black box then
that doesn't get any better. I want to see a design wherein the system
understands *exactly* what the partitioning behavior is. I'd start with
supporting range-based partitioning explicitly, and maybe we could add
other behaviors such as hashing later.
In particular, there should never be any question at all that there is
exactly one partition that a given row belongs to, not more, not less.
You can't achieve that with a set of independent filter expressions;
a meta-rule that says "exactly one of them should return true" is an
untrustworthy band-aid.
(This does not preclude us from mapping the tuple through the partitioning
rule and finding that the corresponding partition doesn't currently exist.
I think we could view the partitioning rule as a function from tuples to
partition numbers, and then we look in pg_class to see if such a partition
exists.)
Additionally, each partitioned relation may have a master expression.
This receives a tuple and returns an integer, which corresponds to the
number of the partition it belongs into.
I guess this might be the same thing I'm arguing for, except that I say
it is not optional but is *the* way you define the partitioning. And
I don't really want black-box expressions even in this formulation.
If you're looking for arbitrary partitioning rules, you can keep on
using inheritance. The point of inventing partitioning, IMHO, is for
the system to have a lot more understanding of the behavior than is
possible now.
As an example of the point I'm trying to make, the planner should be able
to discard range-based partitions that are eliminated by a WHERE clause
with something a great deal cheaper than the theorem prover it currently
has to use for the purpose. Black-box partitioning rules not only don't
improve that situation, they actually make it worse.
Other than that, this sketch seems reasonable ...
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 29, 2014 at 4:56 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
For scan plans, we need to prepare Append lists which are used to scan
for tuples in a partitioned relation. We can setup fake constraint
expressions based on the partitioning expressions, which let the planner
discard unnecessary partitions by way of constraint exclusion.(In the future we might be interested in creating specialized plan and
execution nodes that know more about partitioned relations, to avoid
creating useless Append trees only to prune them later.)
This seems like a big part of the point of doing first class
partitions. If we have an equivalence class that specifies a constant
for all the variables in the master expression then we should be able
to look up the corresponding partition as a O(1) operation (or
O(log(n) if it involves searching a list) rather than iterating over
all the partitions and trying to prove lots of exclusions. We might
even need a btree index to store the partitions so that we can handle
scaling up and still find the corresponding partitions quickly.
And I think there are still unanswered questions about indexes. You
seem to be implying that users would be free to create any index they
want on any partition. It's probably going to be necessary to support
creating an index on the partitioned table which would create an index
on each of the partitions and, crucially, automatically create
corresponding indexes whenever new partitions are added.
That said, everything that's here sounds pretty spot-on to me.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2014-08-29 18:35 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us>:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
[ partition sketch ]
In this design, partitions are first-class objects, not normal tables in
inheritance hierarchies. There are no pg_inherits entries involved atall.
Hm, actually I'd say they are *not* first class objects; the problem with
the existing design is exactly that child tables *are* first class
objects. This is merely a terminology quibble though.
+1 .. only few partitions slowdown planning significantly from 1ms to 20ms,
what is a issue with very simple queries over PK
Show quoted text
* relkind RELKIND_PARTITION 'p' indicates a partition within a
partitioned
relation (its parent). These cannot be addressed directly in DML
queries and only limited DDL support is provided. They don't have
their own pg_attribute entries either and therefore they are always
identical in column definitions to the parent relation.Not sure that not storing the pg_attribute rows is a good thing; but
that's something that won't be clear till you try to code it.Each partition is assigned an Expression that receives a tuple and
returns boolean. This expression returns true if a given tuple belongs
into it, false otherwise.-1, in fact minus a lot. One of the core problems of the current approach
is that the system, particularly the planner, hasn't got a lot of insight
into exactly what the partitioning scheme is in a partitioned table built
on inheritance. If you allow the partitioning rule to be a black box then
that doesn't get any better. I want to see a design wherein the system
understands *exactly* what the partitioning behavior is. I'd start with
supporting range-based partitioning explicitly, and maybe we could add
other behaviors such as hashing later.In particular, there should never be any question at all that there is
exactly one partition that a given row belongs to, not more, not less.
You can't achieve that with a set of independent filter expressions;
a meta-rule that says "exactly one of them should return true" is an
untrustworthy band-aid.(This does not preclude us from mapping the tuple through the partitioning
rule and finding that the corresponding partition doesn't currently exist.
I think we could view the partitioning rule as a function from tuples to
partition numbers, and then we look in pg_class to see if such a partition
exists.)Additionally, each partitioned relation may have a master expression.
This receives a tuple and returns an integer, which corresponds to the
number of the partition it belongs into.I guess this might be the same thing I'm arguing for, except that I say
it is not optional but is *the* way you define the partitioning. And
I don't really want black-box expressions even in this formulation.
If you're looking for arbitrary partitioning rules, you can keep on
using inheritance. The point of inventing partitioning, IMHO, is for
the system to have a lot more understanding of the behavior than is
possible now.As an example of the point I'm trying to make, the planner should be able
to discard range-based partitions that are eliminated by a WHERE clause
with something a great deal cheaper than the theorem prover it currently
has to use for the purpose. Black-box partitioning rules not only don't
improve that situation, they actually make it worse.Other than that, this sketch seems reasonable ...
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Greg Stark <stark@mit.edu> writes:
And I think there are still unanswered questions about indexes.
One other interesting thought that occurs to me: are we going to support
UPDATEs that cause a row to belong to a different partition? If so, how
are we going to handle the update chain links?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote:
Greg Stark <stark@mit.edu> writes:
And I think there are still unanswered questions about indexes.
One other interesting thought that occurs to me: are we going to support
UPDATEs that cause a row to belong to a different partition? If so, how
are we going to handle the update chain links?
Bah, I didn't mention it? My current thinking is that it would be
disallowed; if you have chosen your partitioning key well enough it
shouldn't be necessary. As a workaround you can always DELETE/INSERT.
Maybe we can allow it later, but for a first cut this seems more than
good enough.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
One other interesting thought that occurs to me: are we going to support
UPDATEs that cause a row to belong to a different partition? If so, how
are we going to handle the update chain links?
Bah, I didn't mention it? My current thinking is that it would be
disallowed; if you have chosen your partitioning key well enough it
shouldn't be necessary. As a workaround you can always DELETE/INSERT.
Maybe we can allow it later, but for a first cut this seems more than
good enough.
Hm. I certainly agree that it's a case that could be disallowed for a
first cut, but it'd be nice to have some clue about how we might allow it
eventually.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-08-29 13:15:16 -0400, Tom Lane wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
One other interesting thought that occurs to me: are we going to support
UPDATEs that cause a row to belong to a different partition? If so, how
are we going to handle the update chain links?Bah, I didn't mention it? My current thinking is that it would be
disallowed; if you have chosen your partitioning key well enough it
shouldn't be necessary. As a workaround you can always DELETE/INSERT.
Maybe we can allow it later, but for a first cut this seems more than
good enough.Hm. I certainly agree that it's a case that could be disallowed for a
first cut, but it'd be nice to have some clue about how we might allow it
eventually.
Not pretty, but we could set t_ctid to some 'magic' value when switching
partitions. Everything chasing ctid chains could then error out when
hitting a invisible row with such a t_ctid. The usecases for doing such
updates really are more maintenance style commands, so it's possibly not
too bad from a usability POV :(
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
On 2014-08-29 13:15:16 -0400, Tom Lane wrote:
Hm. I certainly agree that it's a case that could be disallowed for a
first cut, but it'd be nice to have some clue about how we might allow it
eventually.
Not pretty, but we could set t_ctid to some 'magic' value when switching
partitions. Everything chasing ctid chains could then error out when
hitting a invisible row with such a t_ctid.
An actual fix would presumably involve adding a partition number to the
ctid chain field in tuples in partitioned tables. The reason I bring it
up now is that we'd have to commit to doing that (or at least leaving room
for it) in the first implementation, if we don't want to have an on-disk
compatibility break.
There is certainly room to argue that the value of this capability isn't
worth the disk space this solution would eat. But we should have that
argument while the option is still feasible ...
The usecases for doing such
updates really are more maintenance style commands, so it's possibly not
too bad from a usability POV :(
I'm afraid that might just be wishful thinking.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
One other interesting thought that occurs to me: are we going to support
UPDATEs that cause a row to belong to a different partition? If so, how
are we going to handle the update chain links?Bah, I didn't mention it? My current thinking is that it would be
disallowed; if you have chosen your partitioning key well enough it
shouldn't be necessary. As a workaround you can always DELETE/INSERT.
Maybe we can allow it later, but for a first cut this seems more than
good enough.Hm. I certainly agree that it's a case that could be disallowed for a
first cut, but it'd be nice to have some clue about how we might allow it
eventually.
I hesitate to suggest this, but we have free flag bits in
MultiXactStatus. We could use a specially marked multixact member to
indicate the OID of the target relation; perhaps set an infomask bit to
indicate that this has happened. Of course, no HOT updates are possible
so I think it's okay from a heap_prune_chain perspective. This abuses
the knowledge that OIDs and XIDs are both 32 bits long.
Since nowhere else we have the space necessary to store the longer data
that a cross-partition update would require, I don't see anything else
ATM. (For a moment I thought about abusing combo CIDs, but that doesn't
work because this requires to be persistent and visible from other
backends, neither of which is a quality of combocids.)
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-08-29 13:29:19 -0400, Tom Lane wrote:
Andres Freund <andres@2ndquadrant.com> writes:
On 2014-08-29 13:15:16 -0400, Tom Lane wrote:
Hm. I certainly agree that it's a case that could be disallowed for a
first cut, but it'd be nice to have some clue about how we might allow it
eventually.Not pretty, but we could set t_ctid to some 'magic' value when switching
partitions. Everything chasing ctid chains could then error out when
hitting a invisible row with such a t_ctid.An actual fix would presumably involve adding a partition number to the
ctid chain field in tuples in partitioned tables. The reason I bring it
up now is that we'd have to commit to doing that (or at least leaving room
for it) in the first implementation, if we don't want to have an on-disk
compatibility break.
Right. Just adding it unconditionally doesn't sound feasible to me. Our
per-row overhead is already too large. And it doesn't sound fun to have
the first-class partitions use a different heap tuple format than plain
relations.
What we could do is to add some sort of 'jump' tuple when moving a tuple
from one relation to another. So, when updating a tuple between
partitions we add another in the old partition with xmin_jump =
xmax_jump = xmax_old and have the jump tuple's content point to the new
relation.
Far from pretty, but it'd only matter overhead wise when used.
The usecases for doing such
updates really are more maintenance style commands, so it's possibly not
too bad from a usability POV :(I'm afraid that might just be wishful thinking.
I admit that you might very well be right there :(
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
On 2014-08-29 13:29:19 -0400, Tom Lane wrote:
An actual fix would presumably involve adding a partition number to the
ctid chain field in tuples in partitioned tables. The reason I bring it
up now is that we'd have to commit to doing that (or at least leaving room
for it) in the first implementation, if we don't want to have an on-disk
compatibility break.
What we could do is to add some sort of 'jump' tuple when moving a tuple
from one relation to another. So, when updating a tuple between
partitions we add another in the old partition with xmin_jump =
xmax_jump = xmax_old and have the jump tuple's content point to the new
relation.
Hm, that might work. It sounds more feasible than Alvaro's suggestion
of abusing cmax --- I don't think that field is free for use in this
context.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/29/2014 07:15 PM, Tom Lane wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
One other interesting thought that occurs to me: are we going to support
UPDATEs that cause a row to belong to a different partition? If so, how
are we going to handle the update chain links?Bah, I didn't mention it? My current thinking is that it would be
disallowed; if you have chosen your partitioning key well enough it
shouldn't be necessary. As a workaround you can always DELETE/INSERT.
Maybe we can allow it later, but for a first cut this seems more than
good enough.Hm. I certainly agree that it's a case that could be disallowed for a
first cut, but it'd be nice to have some clue about how we might allow it
eventually.
There needs to be some structure that is specific to partitions and not
multiple plain tables which would then be used for both update chains and
cross-partition indexes (as you seem to imply by jumping from indexes
to update chains a few posts back).
It would need to replace plain tid (pagenr, tupnr) with triple of (partid,
pagenr, tupnr).
Cross-partition indexes are especially needed if we want to allow putting
UNIQUE constraints on non-partition-key columns.
Cheers
--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/29/2014 05:56 PM, Alvaro Herrera wrote:
Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
reference Tom's post
/messages/by-id/1598.1399826841@sss.pgh.pa.us
which mentions the possibility of a different partitioning
implementation than what we have so far. As it turns out, I've been
thinking about partitioning recently, so I thought I would share what
I'm thinking so that others can poke holes. My intention is to try to
implement this as soon as possible.Declarative partitioning
========================
...
Still To Be Designed
--------------------
* Dependency issues
* Are indexes/constraints inherited from the parent rel?
I'd say mostly yes.
There could some extra "constraint exclusion type" magic for
conditional indexes, but the rest probably should come from "main table"
And there should be some kind of cross-partition indexes. At
"partitioning" capability, this can probably wait for version 2.
* Multiple keys?
Why not. But probably just for hash partitioning.
Subpartitioning?
Probably not. If you need speed for huge numbers of partitions, use
Gregs idea of keeping the partitions in a tree (or just having a
partition index).
Hash partitioning?
At some point definitely.
Also one thing you left unmentioned is dropping (and perhaps also
truncating)
a partition. We still may want to do historic data management the same way
we do it now, by just getting rid of the whole partition or its data.
At some point we may also want to do redistributing data between
partitions,
maybe for case where we end up with 90% of the data in on partition due to
bad partitioning key or partitioning function choice. This is again
something
that is hard now and can therefore be left to a later version.
Open Questions
--------------* What's the syntax to refer to specific partitions within a partitioned
table?
We could do "TABLE <xyz> PARTITION <n>", but for example if in
the future we add hash partitioning, we might need some non-integer
addressing (OTOH assigning sequential numbers to hash partitions doesn't
seem so bad). Discussing with users of other DBMSs partitioning feature,
one useful phrase is "TABLE <xyz> PARTITION FOR <value>".
Or more generally
TABLE <xyz> PARTITION FOR/WHERE col1=val1, col2=val2, ...;
Cheers
--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hannu Krosing wrote:
Cross-partition indexes are especially needed if we want to allow putting
UNIQUE constraints on non-partition-key columns.
I'm not going to implement cross-partition indexes in the first patch.
They are a huge can of worms.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 29, 2014 at 11:56 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
In this design, partitions are first-class objects, not normal tables in
inheritance hierarchies. There are no pg_inherits entries involved at all.
Whoa. I always assumed that table inheritance was a stepping-stone to
real partitioning, and that real partitioning would be built on top of
table inheritance. In particular, I assume that (as Itagaki
Takahiro's patch did those many years ago) we'd add some metadata
somewhere to allow fast tuple routing (for both pruning and
inserts/updates). What's the benefit of inventing something new
instead?
I'm skeptical about your claim that there will be no pg_inherits
entries involved at all. You need some way to know which partitions
go with which parent table. You can store that many-to-one mapping
someplace other than pg_inherits, but it seems to me that that doesn't
buy you anything; they're just pg_inherits entries under some other
name. Why reinvent that?
Each partition is assigned an Expression that receives a tuple and
returns boolean. This expression returns true if a given tuple belongs
into it, false otherwise. If a tuple for a partitioned relation is run
through expressions of all partitions, exactly one should return true.
If none returns true, it might be because the partition has not been
created yet. A user-facing error is raised in this case (Rationale: if
user creates a partitioned rel and there is no partition that accepts
some given tuple, it's the user's fault.)Additionally, each partitioned relation may have a master expression.
This receives a tuple and returns an integer, which corresponds to the
number of the partition it belongs into.
I agree with Tom: this is a bad design. In particular, if we want to
scale to large numbers of partitions (a principal weakness of the
present system) we need the operation of routing a tuple to a
partition to be as efficient as possible. Range partitioning can be
O(lg n) where n is the number of partitions: store a list of the
boundaries and binary-search it. List partitioning can be O(lg k)
where k is the number of values (which may be more than the number of
partitions) via a similar technique. Hash partitioning can be O(1).
I'm not sure what other kind of partitioning anybody would want to do,
but it's likely that they *won't* want it to be O(1) in the number of
partitions. So I'd say have *only* the master expression.
But, really, I don't think an expression is the right way to store
this; evaluating that repeatedly will, I think, still be too slow.
Think about what happens in PL/pgsql: minimizing the number of times
that you enter and exit the executor helps performance enormously,
even if the expressions are simple enough not to need planning. I
think the representation should be more like an array of partition
boundaries and the pg_proc OID of a comparator.
Per-partition expressions are formed as each partition is created, and
are based on the user-supplied partitioning criterion. Master
expressions are formed at relation creation time. (XXX Can we change
the master expression later, as a result of some ALTER command?
Presumably this would mean that all partitions might need to be
rewritten.)
This is another really important point. If you store an opaque
expression mapping partitioning keys to partition numbers, you can't
do things like this efficiently. With a more transparent
representation, like a sorted array of partition boundaries for range
partitioning, or a sorted array of hash values for consistent hashing,
you can do things like split and merge partitions efficiently,
minimizing rewriting.
Planner -------
A partitioned relation behaves just like a regular relation for purposes
of planner. XXX do we need special considerations regarding relation
size estimation?For scan plans, we need to prepare Append lists which are used to scan
for tuples in a partitioned relation. We can setup fake constraint
expressions based on the partitioning expressions, which let the planner
discard unnecessary partitions by way of constraint exclusion.
So if we're going to do all this, why bother making the partitions
anything other than inheritance children? There might be some benefit
in having the partitions be some kind of stripped-down object if we
could avoid some of these planner gymnastics and get, e.g. efficient
run-time partition pruning. But if you're going to generate Append
plans and switch ResultRelInfos and stuff just as you would for an
inheritance hierarchy, why not just make it an inheritance hierarchy?
It seems pretty clear to me that we need partitioned tables to have
the same tuple descriptor throughout the relation, for efficient tuple
routing and so on. But the other restrictions you're proposing to
impose on partitions have no obvious value that I can see. We could
have a rule that when you inherit from a partition root, you can only
inherit from that one table (no multiple inheritance) and your tuple
descriptor must match precisely (down to dropped columns and column
ordering) and that would give you everything I think you really need
here. There's no gain to be had in forbidding partitions from having
different owners, or being selected from directly, or having
user-visible names. The first of those is arguably useless, but it's
not really causing us any problems, and the latter two are extremely
useful features. Unless you are going to implement partition pruning
is so good that it will never fail to realize a situation where only
one partition needs to be scanned, letting users target the partition
directly is a very important escape hatch.
(In the future we might be interested in creating specialized plan and
execution nodes that know more about partitioned relations, to avoid
creating useless Append trees only to prune them later.)
Good idea.
pg_dump is able to dump a partitioned relation as a CREATE
TABLE/PARTITION command and a series of ALTER TABLE/CREATE PARTITION
commands. The data of all partitions is considered a single COPY
operation.XXX this limits the ability to restore in parallel. To fix we might consider
using one COPY for each partition. It's not clear what relation should be
mentioned in such a COPY command, though -- my instinct is that it
should reference the parent table only, not the individual partition.
Targeting the individual partitions seems considerably better.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Aug 30, 2014 at 12:56 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
reference Tom's post
/messages/by-id/1598.1399826841@sss.pgh.pa.us
which mentions the possibility of a different partitioning
implementation than what we have so far. As it turns out, I've been
thinking about partitioning recently, so I thought I would share what
I'm thinking so that others can poke holes. My intention is to try to
implement this as soon as possible.
+1.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Another thought about this general topic:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
...
Allowed actions on a RELKIND_PARTITION:
* CREATE INDEX .. ON PARTITION <n> ON TABLE <xyz>
...
Still To Be Designed
--------------------
* Are indexes/constraints inherited from the parent rel?
I think one of the key design decisions we have to make is whether
partitions are all constrained to have exactly the same set of indexes.
If we don't insist on that it will greatly complicate planning compared
to what we'll get if we do insist on it, because then the planner will
need to generate a separate customized plan subtree for each partition.
Aside from costing planning time, most likely that would forever prevent
us from pushing some types of intelligence about partitioning into the
executor.
Now, in the current model, it's up to the user what indexes to create
on each partition, and sometimes one might feel that maintaining a
particular index is unnecessary in some partitions. But the flip side
of that is it's awfully easy to screw yourself by forgetting to add
some index when you add a new partition. So I'm not real sure which
approach is superior from a purely user-oriented perspective.
I'm not trying to push one or the other answer right now, just noting
that this is a critical decision.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/31/2014 10:03 PM, Tom Lane wrote:
Another thought about this general topic:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
...
Allowed actions on a RELKIND_PARTITION:
* CREATE INDEX .. ON PARTITION <n> ON TABLE <xyz>
...
Still To Be Designed
--------------------
* Are indexes/constraints inherited from the parent rel?I think one of the key design decisions we have to make is whether
partitions are all constrained to have exactly the same set of indexes.
If we don't insist on that it will greatly complicate planning compared
to what we'll get if we do insist on it, because then the planner will
need to generate a separate customized plan subtree for each partition.
Aside from costing planning time, most likely that would forever prevent
us from pushing some types of intelligence about partitioning into the
executor.Now, in the current model, it's up to the user what indexes to create
on each partition, and sometimes one might feel that maintaining a
particular index is unnecessary in some partitions. But the flip side
of that is it's awfully easy to screw yourself by forgetting to add
some index when you add a new partition.
The "forgetting" part is easy to solve by inheriting all indexes from
parent (or template) partition unless explicitly told not to.
One other thing that has been bothering me about this proposal
is the ability to take partitions offline for maintenance or to load
them offline ant then switch in.
In current scheme we do this using ALTER TABLE ... [NO] INHERIT ...
If we also want to have this with the not-directly-accessible partitions
then perhaps it could be done by having a possibility to move
a partition between two tables with exactly the same structure ?
So I'm not real sure which
approach is superior from a purely user-oriented perspective.
What we currently have is a very flexible scheme which has a few
drawbacks
1) unnecessarily complex for simple case
2) easy to shoot yourself in the foot by forgetting something
3) can be hard on planner, especially with huge number of partitions
An alternative way of solving these problems is adding some
(meta-)constraints to current way of doing things and some more
automation
CREATE TABLE FOR PARTITIONMASTER
WITH (ALL_INDEXES_SAME=ON,
SAME_STRUCTURE_ALWAYS=ON,
SINGLE_INHERITANCE_ONLY=ON,
NESTED_INHERITS=OFF,
PARTITION_FUNCTION=default_range_partitioning(int)
);
and then force these when adding inherited tables (in this case
partition tables)
either via CREATE TABLE or ALTER TABLE
Best Regards
--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 29, 2014 at 12:35:50PM -0400, Tom Lane wrote:
Each partition is assigned an Expression that receives a tuple and
returns boolean. This expression returns true if a given tuple belongs
into it, false otherwise.-1, in fact minus a lot. One of the core problems of the current approach
is that the system, particularly the planner, hasn't got a lot of insight
into exactly what the partitioning scheme is in a partitioned table built
on inheritance. If you allow the partitioning rule to be a black box then
that doesn't get any better. I want to see a design wherein the system
understands *exactly* what the partitioning behavior is. I'd start with
supporting range-based partitioning explicitly, and maybe we could add
other behaviors such as hashing later.In particular, there should never be any question at all that there is
exactly one partition that a given row belongs to, not more, not less.
You can't achieve that with a set of independent filter expressions;
a meta-rule that says "exactly one of them should return true" is an
untrustworthy band-aid.(This does not preclude us from mapping the tuple through the partitioning
rule and finding that the corresponding partition doesn't currently exist.
I think we could view the partitioning rule as a function from tuples to
partition numbers, and then we look in pg_class to see if such a partition
exists.)
There is one situation where you need to be more flexible, and that is
if you ever want to support online repartitioning. To do that you have
to distinguish between "I want to insert tuple X, which partition
should it go into" and "I want to know which partitions I need to look
for partition_key=Y".
For the latter you really have need an expression per partition, or
something equivalent. If performance is an issue I suppose you could
live with having an "old" and an "new" partition scheme, so you
couldn't have two "live repartitionings" happening simultaneously.
Now, if you want to close the door on online repartitioning forever
then that fine. But being in the position of having to say "yes our
partitioning scheme sucks, but we would have to take the database down
for a week to fix it" is no fun.
Unless logical replication provides a way out maybe??
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer