UPDATE of partition key
Currently, an update of a partition key of a partition is not allowed,
since it requires to move the row(s) into the applicable partition.
Attached is a WIP patch (update-partition-key.patch) that removes this
restriction. When an UPDATE causes the row of a partition to violate
its partition constraint, then a partition is searched in that subtree
that can accommodate this row, and if found, the row is deleted from
the old partition and inserted in the new partition. If not found, an
error is reported.
There are a few things that can be discussed about :
1. We can run an UPDATE using a child partition at any level in a
nested partition tree. In such case, we should move the row only
within that child subtree.
For e.g. , in a tree such as :
tab ->
t1 ->
t1_1
t1_2
t2 ->
t2_1
t2_2
For "UPDATE t2 set col1 = 'AAA' " , if the modified tuple does not fit
in t2_1 but can fit in t1_1, it should not be moved to t1_1, because
the UPDATE is fired using t2.
2. In the patch, as part of the row movement, ExecDelete() is called
followed by ExecInsert(). This is done that way, because we want to
have the ROW triggers on that (sub)partition executed. If a user has
explicitly created DELETE and INSERT BR triggers for this partition, I
think we should run those. While at the same time, another question
is, what about UPDATE trigger on the same table ? Here again, one can
argue that because this UPDATE has been transformed into a
DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
there can be a counter-argument. For e.g. if a user needs to make sure
about logging updates of particular columns of a row, he will expect
the logging to happen even when that row was transparently moved. In
the patch, I have retained the firing of UPDATE BR trigger.
3. In case of a concurrent update/delete, suppose session A has locked
the row for deleting it. Now a session B has decided to update this
row and that is going to cause row movement, which means it will
delete it first. But when session A is finished deleting it, session B
finds that it is already deleted. In such case, it should not go ahead
with inserting a new row as part of the row movement. For that, I have
added a new parameter 'already_delete' for ExecDelete().
Of course, this still won't completely solve the concurrency anomaly.
In the above case, the UPDATE of Session B gets lost. May be, for a
user that does not tolerate this, we can have a table-level option
that disallows row movement, or will cause an error to be thrown for
one of the concurrent session.
4. The ExecSetupPartitionTupleRouting() is re-used for routing the row
that is to be moved. So in ExecInitModifyTable(), we call
ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this
only during execution time for the very first time we find that we
need to do a row movement. I will think over that, but I am thinking
it might complicate things, as compared to always doing the setup for
UPDATE. WIll check on that.
5. Regarding performance testing, I have compared the results of
row-movement with partition versus row-movement with inheritance tree
using triggers. Below are the details :
Schema :
CREATE TABLE ptab (a date, b int, c int);
CREATE TABLE ptab (a date, b int, c int) PARTITION BY RANGE (a, b);
CREATE TABLE ptab_1_1 PARTITION OF ptab
for values from ('1900-01-01', 1) to ('1900-01-01', 101)
PARTITION BY range (c);
CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1
for values from (1) to (51);
CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1
for values from (51) to (101);
.....
.....
CREATE TABLE ptab_1_1_n PARTITION OF ptab_1_1
for values from (n) to (n+m);
......
......
CREATE TABLE ptab_5_n PARTITION OF ptab
for values from ('1905-01-01', 101) to ('1905-01-01', 201)
PARTITION BY range (c);
CREATE TABLE ptab_1_2_1 PARTITION OF ptab_1_2
for values from (1) to (51);
CREATE TABLE ptab_1_2_2 PARTITION OF ptab_1_2
for values from (51) to (101);
.....
.....
CREATE TABLE ptab_1_2_n PARTITION OF ptab_1_2
for values from (n) to (n+m);
.....
.....
Similarly for inheritance :
CREATE TABLE ptab_1_1
(constraint check_ptab_1_1 check (a = '1900-01-01' and b >= 1 and b <
8)) inherits (ptab);
create trigger brutrig_ptab_1_1 before update on ptab_1_1 for each row
execute procedure ptab_upd_trig();
CREATE TABLE ptab_1_1_1
(constraint check_ptab_1_1_1 check (c >= 1 and c < 51))
inherits (ptab_1_1);
create trigger brutrig_ptab_1_1_1 before update on ptab_1_1_1 for each
row execute procedure ptab_upd_trig();
CREATE TABLE ptab_1_1_2
(constraint check_ptab_1_1_2 check (c >= 51 and c < 101))
inherits (ptab_1_1);
create trigger brutrig_ptab_1_1_2 before update on ptab_1_1_2 for each
row execute procedure ptab_upd_trig();
I had to have a BR UPDATE trigger on each of the leaf tables.
Attached is the BR trigger function update_trigger.sql. There it
generates the table name assuming a fixed pattern of distribution of
data over the partitions. It first deletes the row and then inserts a
new one. I also skipped the deletion part, and it did not show any
significant change in results.
parts partitioned inheritance no. of rows subpartitions
===== =========== =========== =========== =============
500 10 sec 3 min 02 sec 1,000,000 0
1000 10 sec 3 min 05 sec 1,000,000 0
1000 1 min 38sec 30min 50 sec 10,000,000 0
4000 28 sec 5 min 41 sec 1,000,000 10
part : total number of partitions including subparitions if any.
partitioned : Partitions created using declarative syntax.
inheritence : Partitions created using inheritence , check constraints
and insert,update triggers.
subpartitions : Number of subpartitions for each partition (in a 2-level tree)
Overall the UPDATE in partitions is faster by 10-20 times compared
with inheritance with triggers.
The UPDATE query moved all of the rows into another partition. It was
something like this :
update ptab set a = '1949-01-1' where a <= '1924-01-01'
For a plain table with 1,000,000 rows, the UPDATE took 8 seconds, and
with 10,000,000 rows, it took 1min 32sec.
In general, for both partitioned and inheritence tables, the time
taken linearly rose with the number of rows.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
On Mon, Feb 13, 2017 at 7:01 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
parts partitioned inheritance no. of rows subpartitions
===== =========== =========== =========== =============500 10 sec 3 min 02 sec 1,000,000 0
1000 10 sec 3 min 05 sec 1,000,000 0
1000 1 min 38sec 30min 50 sec 10,000,000 0
4000 28 sec 5 min 41 sec 1,000,000 10
That's a big speedup.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:
Currently, an update of a partition key of a partition is not
allowed, since it requires to move the row(s) into the applicable
partition.Attached is a WIP patch (update-partition-key.patch) that removes
this restriction. When an UPDATE causes the row of a partition to
violate its partition constraint, then a partition is searched in
that subtree that can accommodate this row, and if found, the row is
deleted from the old partition and inserted in the new partition. If
not found, an error is reported.
This is great!
Would it be really invasive to HINT something when the subtree is a
proper subtree?
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14 February 2017 at 22:24, David Fetter <david@fetter.org> wrote:
On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:
Currently, an update of a partition key of a partition is not
allowed, since it requires to move the row(s) into the applicable
partition.Attached is a WIP patch (update-partition-key.patch) that removes
this restriction. When an UPDATE causes the row of a partition to
violate its partition constraint, then a partition is searched in
that subtree that can accommodate this row, and if found, the row is
deleted from the old partition and inserted in the new partition. If
not found, an error is reported.This is great!
Would it be really invasive to HINT something when the subtree is a
proper subtree?
I am not quite sure I understood this question. Can you please explain
it a bit more ...
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 15, 2017 at 01:06:32PM +0530, Amit Khandekar wrote:
On 14 February 2017 at 22:24, David Fetter <david@fetter.org> wrote:
On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:
Currently, an update of a partition key of a partition is not
allowed, since it requires to move the row(s) into the applicable
partition.Attached is a WIP patch (update-partition-key.patch) that removes
this restriction. When an UPDATE causes the row of a partition to
violate its partition constraint, then a partition is searched in
that subtree that can accommodate this row, and if found, the row
is deleted from the old partition and inserted in the new
partition. If not found, an error is reported.This is great!
Would it be really invasive to HINT something when the subtree is
a proper subtree?I am not quite sure I understood this question. Can you please
explain it a bit more ...
Sorry. When an UPDATE can't happen, there are often ways to hint at
what went wrong and how to correct it. Violating a uniqueness
constraint would be one example.
When an UPDATE can't happen and the depth of the subtree is a
plausible candidate for what prevents it, there might be a way to say
so.
Let's imagine a table called log with partitions on "stamp" log_YYYY
and subpartitions, also on "stamp", log_YYYYMM. If you do something
like
UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
it's possible to know that it might have worked had the UPDATE taken
place on log rather than on log_2017.
Does that make sense, and if so, is it super invasive to HINT that?
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:
When an UPDATE can't happen, there are often ways to hint at
what went wrong and how to correct it. Violating a uniqueness
constraint would be one example.When an UPDATE can't happen and the depth of the subtree is a
plausible candidate for what prevents it, there might be a way to say
so.Let's imagine a table called log with partitions on "stamp" log_YYYY
and subpartitions, also on "stamp", log_YYYYMM. If you do something
likeUPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
it's possible to know that it might have worked had the UPDATE taken
place on log rather than on log_2017.Does that make sense, and if so, is it super invasive to HINT that?
Yeah, I think it should be possible to find the root partition with
the help of pg_partitioned_table, and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/02/16 15:50, Amit Khandekar wrote:
On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:
When an UPDATE can't happen, there are often ways to hint at
what went wrong and how to correct it. Violating a uniqueness
constraint would be one example.When an UPDATE can't happen and the depth of the subtree is a
plausible candidate for what prevents it, there might be a way to say
so.Let's imagine a table called log with partitions on "stamp" log_YYYY
and subpartitions, also on "stamp", log_YYYYMM. If you do something
likeUPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
it's possible to know that it might have worked had the UPDATE taken
place on log rather than on log_2017.Does that make sense, and if so, is it super invasive to HINT that?
Yeah, I think it should be possible to find the root partition with
I assume you mean root *partitioned* table.
the help of pg_partitioned_table,
The pg_partitioned_table catalog does not store parent-child
relationships, just information about the partition key of a table. To
get the root partitioned table, you might want to create a recursive
version of get_partition_parent(), maybe called
get_partition_root_parent(). By the way, get_partition_parent() scans
pg_inherits to find the inheritance parent.
and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.
I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify? ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work. I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16 February 2017 at 12:57, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/02/16 15:50, Amit Khandekar wrote:
On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:
When an UPDATE can't happen, there are often ways to hint at
what went wrong and how to correct it. Violating a uniqueness
constraint would be one example.When an UPDATE can't happen and the depth of the subtree is a
plausible candidate for what prevents it, there might be a way to say
so.Let's imagine a table called log with partitions on "stamp" log_YYYY
and subpartitions, also on "stamp", log_YYYYMM. If you do something
likeUPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
it's possible to know that it might have worked had the UPDATE taken
place on log rather than on log_2017.Does that make sense, and if so, is it super invasive to HINT that?
Yeah, I think it should be possible to find the root partition with
I assume you mean root *partitioned* table.
the help of pg_partitioned_table,
The pg_partitioned_table catalog does not store parent-child
relationships, just information about the partition key of a table. To
get the root partitioned table, you might want to create a recursive
version of get_partition_parent(), maybe called
get_partition_root_parent(). By the way, get_partition_parent() scans
pg_inherits to find the inheritance parent.
Yeah. But we also want to make sure that it's a part of declarative
partition tree, and not just an inheritance tree ? I am not sure
whether it is currently possible to have a mix of these two. May be it
is easy to prevent that from happening.
and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify? ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work. I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.
What I had in mind was : Give that hint only if there *was* a
subpartition that could accommodate that row. And if found, we can
only include the subpartition name.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/02/16 17:55, Amit Khandekar wrote:
On 16 February 2017 at 12:57, Amit Langote wrote:
On 2017/02/16 15:50, Amit Khandekar wrote:
On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:
Does that make sense, and if so, is it super invasive to HINT that?
Yeah, I think it should be possible to find the root partition with
I assume you mean root *partitioned* table.
the help of pg_partitioned_table,
The pg_partitioned_table catalog does not store parent-child
relationships, just information about the partition key of a table. To
get the root partitioned table, you might want to create a recursive
version of get_partition_parent(), maybe called
get_partition_root_parent(). By the way, get_partition_parent() scans
pg_inherits to find the inheritance parent.Yeah. But we also want to make sure that it's a part of declarative
partition tree, and not just an inheritance tree ? I am not sure
whether it is currently possible to have a mix of these two. May be it
is easy to prevent that from happening.
It is not possible to mix declarative partitioning and regular
inheritance. So, you cannot have a table in a declarative partitioning
tree that is not a (sub-) partition of the root table.
and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify? ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work. I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.What I had in mind was : Give that hint only if there *was* a
subpartition that could accommodate that row. And if found, we can
only include the subpartition name.
Asking to try the update query with the root table sounds like a good
enough hint. Trying to find the exact sub-partition (I assume you mean to
imply sub-tree here) seems like an overkill, IMHO.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16 February 2017 at 14:42, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/02/16 17:55, Amit Khandekar wrote:
On 16 February 2017 at 12:57, Amit Langote wrote:
On 2017/02/16 15:50, Amit Khandekar wrote:
On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:
Does that make sense, and if so, is it super invasive to HINT that?
Yeah, I think it should be possible to find the root partition with
I assume you mean root *partitioned* table.
the help of pg_partitioned_table,
The pg_partitioned_table catalog does not store parent-child
relationships, just information about the partition key of a table. To
get the root partitioned table, you might want to create a recursive
version of get_partition_parent(), maybe called
get_partition_root_parent(). By the way, get_partition_parent() scans
pg_inherits to find the inheritance parent.Yeah. But we also want to make sure that it's a part of declarative
partition tree, and not just an inheritance tree ? I am not sure
whether it is currently possible to have a mix of these two. May be it
is easy to prevent that from happening.It is not possible to mix declarative partitioning and regular
inheritance. So, you cannot have a table in a declarative partitioning
tree that is not a (sub-) partition of the root table.
Ok, then that makes things easy.
and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify? ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work. I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.What I had in mind was : Give that hint only if there *was* a
subpartition that could accommodate that row. And if found, we can
only include the subpartition name.Asking to try the update query with the root table sounds like a good
enough hint. Trying to find the exact sub-partition (I assume you mean to
imply sub-tree here) seems like an overkill, IMHO.
Yeah ... I was thinking , anyways it's an error condition, so why not
let the server spend a bit more CPU and get the right sub-partition
for the message. If we decide to write code to find the root
partition, then it's just a matter of another function
ExecFindPartition().
Also, I was thinking : give the hint *only* if we know there is a
right sub-partition. Otherwise, it might distract the user.
Thanks,
Amit
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
There are a few things that can be discussed about :
If you do a normal update the new tuple is linked to the old one using
the ctid forming a chain of tuple versions. This tuple movement breaks
that chain. So the question I had reading this proposal is what
behaviour depends on ctid and how is it affected by the ctid chain
being broken.
I think the concurrent update case is just a symptom of this. If you
try to update a row that's locked for a concurrent update you normally
wait until the concurrent update finishes, then follow the ctid chain
and recheck the where clause on the target of the link and if it still
matches you perform the update there.
At least you do that if you have isolation_level set to
repeatable_read. If you have isolation level set to serializable then
you just fail with a serialization failure. I think that's what you
should do if you come across a row that's been updated with a broken
ctid chain even in repeatable read mode. Just fail with a
serialization failure and document that in partitioned tables if you
perform updates that move tuples between partitions then you need to
be ensure your updates are prepared for serialization failures.
I think this would require another bit in the tuple info mask
indicating that this is tuple is the last version before a broken ctid
chain -- i.e. that it was updated by moving it to another partition.
Maybe there's some combination of bits you could use though since this
is only needed in a particular situation.
Offhand I don't know what other behaviours are dependent on the ctid
chain. I think you need to go search the docs -- and probably the code
just to be sure -- for any references to ctid to ensure you catch
every impact of breaking the ctid chain.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 16, 2017 at 03:39:30PM +0530, Amit Khandekar wrote:
and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify? ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work. I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.What I had in mind was : Give that hint only if there *was* a
subpartition that could accommodate that row. And if found, we can
only include the subpartition name.Asking to try the update query with the root table sounds like a good
enough hint. Trying to find the exact sub-partition (I assume you mean to
imply sub-tree here) seems like an overkill, IMHO.Yeah ... I was thinking , anyways it's an error condition, so why not
let the server spend a bit more CPU and get the right sub-partition
for the message. If we decide to write code to find the root
partition, then it's just a matter of another function
ExecFindPartition().Also, I was thinking : give the hint *only* if we know there is a
right sub-partition. Otherwise, it might distract the user.
If this is relatively straight-forward, it'd be great. More
actionable knowledge is better.
Thanks for taking this on.
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark <stark@mit.edu> wrote:
On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
There are a few things that can be discussed about :
If you do a normal update the new tuple is linked to the old one using
the ctid forming a chain of tuple versions. This tuple movement breaks
that chain. So the question I had reading this proposal is what
behaviour depends on ctid and how is it affected by the ctid chain
being broken.
I think this is a good question.
I think the concurrent update case is just a symptom of this. If you
try to update a row that's locked for a concurrent update you normally
wait until the concurrent update finishes, then follow the ctid chain
and recheck the where clause on the target of the link and if it still
matches you perform the update there.
Right. EvalPlanQual behavior, in short.
At least you do that if you have isolation_level set to
repeatable_read. If you have isolation level set to serializable then
you just fail with a serialization failure. I think that's what you
should do if you come across a row that's been updated with a broken
ctid chain even in repeatable read mode. Just fail with a
serialization failure and document that in partitioned tables if you
perform updates that move tuples between partitions then you need to
be ensure your updates are prepared for serialization failures.
Now, this part I'm not sure about. What's pretty clear is that,
barring some redesign of the heap format, we can't keep the CTID chain
intact when the tuple moves to a different relfilenode. What's less
clear is what to do about that. We can either (1) give up on
EvalPlanQual behavior in this case and act just as we would if the row
had been deleted; no update happens or (2) throw a serialization
error. You're advocating for #2, but I'm not sure that's right,
because:
1. It's a lot more work,
2. Your proposed implementation needs an on-disk format change that
uses up a scarce infomask bit, and
3. It's not obvious to me that it's clearly preferable from a user
experience standpoint. I mean, either way the user doesn't get the
behavior that they want. Either they're hoping for EPQ semantics and
they instead do a no-op update, or they're hoping for EPQ semantics
and they instead get an ERROR. Generally speaking, we don't throw
serialization errors today at READ COMMITTED, so if we do so here,
that's going to be a noticeable and perhaps unwelcome change.
More opinions welcome.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16 February 2017 at 20:53, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark <stark@mit.edu> wrote:
On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
There are a few things that can be discussed about :
If you do a normal update the new tuple is linked to the old one using
the ctid forming a chain of tuple versions. This tuple movement breaks
that chain. So the question I had reading this proposal is what
behaviour depends on ctid and how is it affected by the ctid chain
being broken.I think this is a good question.
I think the concurrent update case is just a symptom of this. If you
try to update a row that's locked for a concurrent update you normally
wait until the concurrent update finishes, then follow the ctid chain
and recheck the where clause on the target of the link and if it still
matches you perform the update there.Right. EvalPlanQual behavior, in short.
At least you do that if you have isolation_level set to
repeatable_read. If you have isolation level set to serializable then
you just fail with a serialization failure. I think that's what you
should do if you come across a row that's been updated with a broken
ctid chain even in repeatable read mode. Just fail with a
serialization failure and document that in partitioned tables if you
perform updates that move tuples between partitions then you need to
be ensure your updates are prepared for serialization failures.Now, this part I'm not sure about. What's pretty clear is that,
barring some redesign of the heap format, we can't keep the CTID chain
intact when the tuple moves to a different relfilenode. What's less
clear is what to do about that. We can either (1) give up on
EvalPlanQual behavior in this case and act just as we would if the row
had been deleted; no update happens.
This is what the current patch has done.
or (2) throw a serialization
error. You're advocating for #2, but I'm not sure that's right,
because:1. It's a lot more work,
2. Your proposed implementation needs an on-disk format change that
uses up a scarce infomask bit, and3. It's not obvious to me that it's clearly preferable from a user
experience standpoint. I mean, either way the user doesn't get the
behavior that they want. Either they're hoping for EPQ semantics and
they instead do a no-op update, or they're hoping for EPQ semantics
and they instead get an ERROR. Generally speaking, we don't throw
serialization errors today at READ COMMITTED, so if we do so here,
that's going to be a noticeable and perhaps unwelcome change.More opinions welcome.
I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.
Another option I was thinking : When the UPDATE is on a partition key,
acquire ExclusiveLock (not AccessExclusiveLock) only on that
partition, so that the selects will continue to execute, but
UPDATE/DELETE will wait before opening the table for scan. The UPDATE
on partition key is not going to be a very routine operation, it
sounds more like a DBA maintenance operation; so it does not look like
it would come in between usual transactions.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Generally speaking, we don't throw
serialization errors today at READ COMMITTED, so if we do so here,
that's going to be a noticeable and perhaps unwelcome change.
Yes we do:
https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 20, 2017 at 3:36 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Generally speaking, we don't throw
serialization errors today at READ COMMITTED, so if we do so here,
that's going to be a noticeable and perhaps unwelcome change.Yes we do:
https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ
Oops -- please ignore, I misread that as repeatable read.
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Amit,
Thanks for working on this.
On 2017/02/13 21:01, Amit Khandekar wrote:
Currently, an update of a partition key of a partition is not allowed,
since it requires to move the row(s) into the applicable partition.Attached is a WIP patch (update-partition-key.patch) that removes this
restriction. When an UPDATE causes the row of a partition to violate
its partition constraint, then a partition is searched in that subtree
that can accommodate this row, and if found, the row is deleted from
the old partition and inserted in the new partition. If not found, an
error is reported.
That's clearly an improvement over what we have now.
There are a few things that can be discussed about :
1. We can run an UPDATE using a child partition at any level in a
nested partition tree. In such case, we should move the row only
within that child subtree.For e.g. , in a tree such as :
tab ->
t1 ->
t1_1
t1_2
t2 ->
t2_1
t2_2For "UPDATE t2 set col1 = 'AAA' " , if the modified tuple does not fit
in t2_1 but can fit in t1_1, it should not be moved to t1_1, because
the UPDATE is fired using t2.
Makes sense. One should perform the update by specifying tab such that
the row moves from t2 to t1, before we could determine t1_1 as the target
for the new row. Specifying t2 directly in that case is clearly the
"violates partition constraint" situation. I wonder if that's enough a
hint for the user to try updating the parent (or better still, root
parent). Or as we were discussing, should there be an actual HINT message
spelling that out for the user.
2. In the patch, as part of the row movement, ExecDelete() is called
followed by ExecInsert(). This is done that way, because we want to
have the ROW triggers on that (sub)partition executed. If a user has
explicitly created DELETE and INSERT BR triggers for this partition, I
think we should run those. While at the same time, another question
is, what about UPDATE trigger on the same table ? Here again, one can
argue that because this UPDATE has been transformed into a
DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
there can be a counter-argument. For e.g. if a user needs to make sure
about logging updates of particular columns of a row, he will expect
the logging to happen even when that row was transparently moved. In
the patch, I have retained the firing of UPDATE BR trigger.
What of UPDATE AR triggers?
As a comment on how row-movement is being handled in code, I wonder if it
could be be made to look similar structurally to the code in ExecInsert()
that handles ON CONFLICT DO UPDATE. That is,
if (partition constraint fails)
{
/* row movement */
}
else
{
/* ExecConstraints() */
/* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */
}
I see that ExecConstraint() won't get called on the source partition's
constraints if row movement occurs. Maybe, that's unnecessary because the
new row won't be inserted into that partition anyway.
ExecWithCheckOptions() for RLS update check does happen *before* row
movement though.
3. In case of a concurrent update/delete, suppose session A has locked
the row for deleting it. Now a session B has decided to update this
row and that is going to cause row movement, which means it will
delete it first. But when session A is finished deleting it, session B
finds that it is already deleted. In such case, it should not go ahead
with inserting a new row as part of the row movement. For that, I have
added a new parameter 'already_delete' for ExecDelete().
Makes sense. Maybe: already_deleted -> concurrently_deleted.
Of course, this still won't completely solve the concurrency anomaly.
In the above case, the UPDATE of Session B gets lost. May be, for a
user that does not tolerate this, we can have a table-level option
that disallows row movement, or will cause an error to be thrown for
one of the concurrent session.
Will this table-level option be specified for a partitioned table once or
for individual partitions?
4. The ExecSetupPartitionTupleRouting() is re-used for routing the row
that is to be moved. So in ExecInitModifyTable(), we call
ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this
only during execution time for the very first time we find that we
need to do a row movement. I will think over that, but I am thinking
it might complicate things, as compared to always doing the setup for
UPDATE. WIll check on that.
Hmm. ExecSetupPartitionTupleRouting(), which does significant amount of
setup work, is fine being called in ExecInitModifyTable() in the insert
case because there are often cases where that's a bulk-insert and hence
cost of the setup work is amortized. Updates, OTOH, are seldom done in a
bulk manner. So that might be an argument for doing it late only when
needed. But that starts to sound less attractive when one realizes that
that will occur for every row that wants to move.
I wonder if updates that will require row movement when done will be done
in a bulk manner (as a maintenance op), so one-time tuple routing setup
seems fine. Again, enable_row_movement option specified for the parent
sounds like it would be a nice to have. Only do the setup if it's turned
on, which goes without saying.
5. Regarding performance testing, I have compared the results of
row-movement with partition versus row-movement with inheritance tree
using triggers. Below are the details :Schema :
[ ... ]
parts partitioned inheritance no. of rows subpartitions
===== =========== =========== =========== =============500 10 sec 3 min 02 sec 1,000,000 0
1000 10 sec 3 min 05 sec 1,000,000 0
1000 1 min 38sec 30min 50 sec 10,000,000 0
4000 28 sec 5 min 41 sec 1,000,000 10part : total number of partitions including subparitions if any.
partitioned : Partitions created using declarative syntax.
inheritence : Partitions created using inheritence , check constraints
and insert,update triggers.
subpartitions : Number of subpartitions for each partition (in a 2-level tree)Overall the UPDATE in partitions is faster by 10-20 times compared
with inheritance with triggers.The UPDATE query moved all of the rows into another partition. It was
something like this :
update ptab set a = '1949-01-1' where a <= '1924-01-01'For a plain table with 1,000,000 rows, the UPDATE took 8 seconds, and
with 10,000,000 rows, it took 1min 32sec.
Nice!
In general, for both partitioned and inheritence tables, the time
taken linearly rose with the number of rows.
Hopefully not also with the number of partitions though.
I will look more closely at the code soon.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.
I'm still a fan of the "do nothing and just document that this is a
weirdness of partitioned tables" approach, because implementing
something will be complicated, will ensure that this misses this
release if not the next one, and may not be any better for users. But
probably we need to get some more opinions from other people, since I
can imagine people being pretty unhappy if the consensus happens to be
at odds with my own preferences.
Another option I was thinking : When the UPDATE is on a partition key,
acquire ExclusiveLock (not AccessExclusiveLock) only on that
partition, so that the selects will continue to execute, but
UPDATE/DELETE will wait before opening the table for scan. The UPDATE
on partition key is not going to be a very routine operation, it
sounds more like a DBA maintenance operation; so it does not look like
it would come in between usual transactions.
I think that's going to make users far more unhappy than breaking the
EPQ behavior ever would.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Friday, February 24, 2017, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <amitdkhan.pg@gmail.com
<javascript:;>> wrote:I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.I'm still a fan of the "do nothing and just document that this is a
weirdness of partitioned tables" approach, because implementing
something will be complicated, will ensure that this misses this
release if not the next one, and may not be any better for users. But
probably we need to get some more opinions from other people, since I
can imagine people being pretty unhappy if the consensus happens to be
at odds with my own preferences.
For my own sanity - the move update would complete successfully and break
every ctid chain that it touches. Any update lined up behind it in the
lock queue would discover their target record has been deleted and
would experience whatever behavior their isolation level dictates for such
a situation. So multi-partition update queries will fail to update some
records if they happen to move between partitions even if they would
otherwise match the query's predicate.
Is there any difference in behavior between this and a SQL writeable CTE
performing the same thing via delete-returning-insert?
David J.
On Fri, Feb 24, 2017 at 1:18 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
For my own sanity - the move update would complete successfully and break
every ctid chain that it touches. Any update lined up behind it in the lock
queue would discover their target record has been deleted and would
experience whatever behavior their isolation level dictates for such a
situation. So multi-partition update queries will fail to update some
records if they happen to move between partitions even if they would
otherwise match the query's predicate.
Right. That's the behavior for which I am advocating, on the grounds
that it's the simplest to implement and if we all agree on something
else more complicated later, we can do it then.
Is there any difference in behavior between this and a SQL writeable CTE
performing the same thing via delete-returning-insert?
Not to my knowledge.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers