Delay locking partitions during INSERT and UPDATE
As a follow-on from [1]/messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com and also discussed in [2]/messages/by-id/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03, I'd like to propose
that we don't obtain locks on all partitions during INSERT into a
partitioned table and UPDATE of a partitioned key and instead, only
lock the partition when we first route a tuple to it. This means that
the order that the locks are obtained is no longer well defined and is
at the mercy of the order that tuples are INSERTed or UPDATEd. It
seems worth relaxing this a bit for gains in performance, as when a
partitioned table contains many partitions, the overhead of locking
all partitions when inserting a single row, or just a few rows is
often significantly higher than the cost of doing the actual insert.
The current behaviour was added in 54cde0c4c058073 in order to
minimise deadlock risk. It seems that the risk there only comes from
AELs that could be taken when a partition directly receives a TRUNCATE
/ CREATE INDEX / VACUUM FULL / CLUSTER. There's obviously no conflict
with other DML operations since two RowExclusiveLocks don't conflict
with each other. I think all other AEL obtaining DDL must be
performed on the top level partitioned table, for example, ADD COLUMN
can't be done directly on a partition, so there's no added deadlock
risk from those. For a deadlock to occur one of the above DDL commands
would have to be executed inside a transaction in an order opposite to
the order rows are being INSERTed or UPDATEd in the partitioned table.
If required, such operations could LOCK TABLE the top partitioned
table to block the DML operation. There's already a risk of similar
deadlocks from such operations done on multiple separate tables when
the order they're done is not the same as the order the tables are
written in a query, although, in that case, the window for the
deadlock is likely to be much smaller.
With this done, the performance of an INSERT into a 10k partition
partitioned table looks like:
Setup:
create table hashp (a int) partition by hash(a);
select 'create table hashp'||x::Text || ' partition of hashp for
values with (modulus 10000, remainder '||x::text||');' from
generate_Series(0,9999) x;
\gexec
hashp_insert.sql:
\set p_a random(1,1000)
insert into hashp values(:p_a);
Results:
$ psql -c "truncate hashp;" postgres && pgbench -n -f hashp_insert.sql
-M prepared -c 4 -j 4 -T 60 postgres
Patched:
tps = 27811.427620 (excluding connections establishing)
tps = 28617.417308 (excluding connections establishing)
Unpatched:
tps = 130.446706 (excluding connections establishing)
tps = 119.726641 (excluding connections establishing)
The patch is attached.
I'll park this here until the January commitfest.
[1]: /messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com
[2]: /messages/by-id/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
v1-0001-Delay-locking-of-partitions-during-INSERT-and-UPD.patchapplication/octet-stream; name=v1-0001-Delay-locking-of-partitions-during-INSERT-and-UPD.patchDownload+2-13
Hi,
On 11/23/18 1:14 AM, David Rowley wrote:
As a follow-on from [1] and also discussed in [2], I'd like to propose
that we don't obtain locks on all partitions during INSERT into a
partitioned table and UPDATE of a partitioned key and instead, only
lock the partition when we first route a tuple to it. This means that
the order that the locks are obtained is no longer well defined and is
at the mercy of the order that tuples are INSERTed or UPDATEd. It
seems worth relaxing this a bit for gains in performance, as when a
partitioned table contains many partitions, the overhead of locking
all partitions when inserting a single row, or just a few rows is
often significantly higher than the cost of doing the actual insert.
Yep, the locking seems like a significant bottleneck. I've done quite a
bit of testing on two machines, using a slightly modified version of
your test script with variable number of partitions (0 means not
partitioned), and the results look like this:
1) xeon e5-2620v4
partitions 0 100 1000 10000
---------------------------------------------
master 16643 6956 1039 108
patched 16398 15522 15222 13228
2) i5-2500k
partitions 0 100 1000 10000
-----------------------------------------
master 3901 2892 920 76
patched 3894 3838 3845 3522
When using UNLOGGED tables to minimize the external noise, it looks like
this:
3) xeon e5-2620v4
partitions 0 100 1000 10000
--------------------------------------------
master 30806 8740 1091 107
patched 30455 28137 27582 24985
partitions 0 100 1000 10000
--------------------------------------------
master 27662 9013 1277 79
patched 28263 26474 25794 22434
So the performance benefit is pretty clear - up to 2 orders of magnitude
with 10k partitions, and gets us fairly close to non-partitioned table.
Me gusta.
The current behaviour was added in 54cde0c4c058073 in order to
minimise deadlock risk. It seems that the risk there only comes from
AELs that could be taken when a partition directly receives a TRUNCATE
/ CREATE INDEX / VACUUM FULL / CLUSTER. There's obviously no conflict
with other DML operations since two RowExclusiveLocks don't conflict
with each other. I think all other AEL obtaining DDL must be
performed on the top level partitioned table, for example, ADD COLUMN
can't be done directly on a partition, so there's no added deadlock
risk from those. For a deadlock to occur one of the above DDL commands
would have to be executed inside a transaction in an order opposite to
the order rows are being INSERTed or UPDATEd in the partitioned table.
If required, such operations could LOCK TABLE the top partitioned
table to block the DML operation. There's already a risk of similar
deadlocks from such operations done on multiple separate tables when
the order they're done is not the same as the order the tables are
written in a query, although, in that case, the window for the
deadlock is likely to be much smaller.
Hmmm, yeah.
Per the discussion in [1] the locking was necessary also to ensure
partitions can't disappear while we're building the descriptors in
RelationBuildPartitionDesc(). But AFAICS 3f2393edef fixed this.
The other issue - as you note - is ensuring locking order, to prevent
(or rather reduce the risk of) deadlocks. I agree with your assessment
here, i.e. that locking the parent is a sufficient protection.
Maybe there's an alternative solution with the same benefits and not
sacrificing the lock ordering, but I fail to see how it would work.
[1] /messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, failed
Spec compliant: tested, failed
Documentation: tested, failed
Hi,
Increasing the number of clients, I benchmarked with a table partitioned into 1k partition.
I confirmed that this patch improve performance by 10 times or more.
master (commit: 90525d7b4e Date: Tue Jan 15 12:19:21 2019 -0800)
cpu:
Xeon(R) CPU E5-2667 v3 * 2
setup:
create table history(aid int, delta int, mtime timestamp without time zone) partition by range(aid);
\o /dev/null
select 'create table history_' || x || ' partition of history for values from(' || x ||') to(' || x+1 || ')' from generate_series(1, 1000) x;
\gexec
benchmark.sql:
\set aid random(1, 1000)
\set delta random(-5000, 5000)
INSERT INTO history VALUES (:aid, :delta, CURRENT_TIMESTAMP);
command line:
pgbench -d testdb -f benchmark.sql -c number_of_clients -T 60 -r -n
Results:
clients | tps_patched | tps_unpatched | tps_unpatched / tps_patched
---------+-------------+---------------+-----------------------------
1 | 8890 | 841 | 11
2 | 17484 | 1470 | 12
4 | 29218 | 2474 | 12
8 | 48789 | 3876 | 13
16 | 68794 | 4030 | 17
32 | 69550 | 2888 | 24
64 | 71196 | 2555 | 28
128 | 71739 | 2295 | 31
256 | 66548 | 2105 | 32
I wonder why performance does not increase much when the number of clients exceeds 16.
Even though it is caused by competition of lightweight locks I thought 16 clients are small.
Also, I did make installcheck world, but test partition_prune failed.
However, this test case failed even before applying a patch, so this patch is not a problem.
One of the results is as follows.
explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3);
- QUERY PLAN
----------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------
Append (actual rows=0 loops=1)
- Subplans Removed: 6
-> Seq Scan on ab_a2_b1 (actual rows=0 loops=1)
- Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
+ Filter: ((a >= 2) AND (a <= 2) AND (b <= 3))
-> Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
- Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
+ Filter: ((a >= 2) AND (a <= 2) AND (b <= 3))
-> Seq Scan on ab_a2_b3 (actual rows=0 loops=1)
- Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
-(8 rows)
+ Filter: ((a >= 2) AND (a <= 2) AND (b <= 3))
+(7 rows)
regards,
sho kato
On Fri, 18 Jan 2019 at 19:08, sho kato <kato-sho@jp.fujitsu.com> wrote:
I confirmed that this patch improve performance by 10 times or more.
Thanks for testing this out.
Also, I did make installcheck world, but test partition_prune failed.
However, this test case failed even before applying a patch, so this patch is not a problem.
One of the results is as follows.explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3); - QUERY PLAN ---------------------------------------------------------- + QUERY PLAN +------------------------------------------------------ Append (actual rows=0 loops=1) - Subplans Removed: 6 -> Seq Scan on ab_a2_b1 (actual rows=0 loops=1) - Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) + Filter: ((a >= 2) AND (a <= 2) AND (b <= 3)) -> Seq Scan on ab_a2_b2 (actual rows=0 loops=1) - Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) + Filter: ((a >= 2) AND (a <= 2) AND (b <= 3)) -> Seq Scan on ab_a2_b3 (actual rows=0 loops=1) - Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -(8 rows) + Filter: ((a >= 2) AND (a <= 2) AND (b <= 3)) +(7 rows)
Perhaps you're running with plan_cache_mode = force_custom_plan.
You'll likely need it set to auto for these to pass.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 11/22/18, David Rowley <david.rowley@2ndquadrant.com> wrote:
If required, such operations could LOCK TABLE the top partitioned
table to block the DML operation. There's already a risk of similar
deadlocks from such operations done on multiple separate tables when
the order they're done is not the same as the order the tables are
written in a query, although, in that case, the window for the
deadlock is likely to be much smaller.
Is this something that would need documentation anywhere?
With this done, the performance of an INSERT into a 10k partition
partitioned table looks like:Setup:
create table hashp (a int) partition by hash(a);
select 'create table hashp'||x::Text || ' partition of hashp for
values with (modulus 10000, remainder '||x::text||');' from
generate_Series(0,9999) x;
\gexechashp_insert.sql:
\set p_a random(1,1000)
insert into hashp values(:p_a);Results:
$ psql -c "truncate hashp;" postgres && pgbench -n -f hashp_insert.sql
-M prepared -c 4 -j 4 -T 60 postgres
I used a similar test, but with unlogged tables, and "-c 2", and got:
normal table: 32000tps
10k partitions / master: 82tps
10k partitions / patch: 7000tps
So far I haven't gotten quite as good performance as you and Tomas,
although it's still a ~85x improvement.
-John Naylor
On 1/19/19 12:05 AM, John Naylor wrote:
On 11/22/18, David Rowley <david.rowley@2ndquadrant.com> wrote:
If required, such operations could LOCK TABLE the top partitioned
table to block the DML operation. There's already a risk of similar
deadlocks from such operations done on multiple separate tables when
the order they're done is not the same as the order the tables are
written in a query, although, in that case, the window for the
deadlock is likely to be much smaller.Is this something that would need documentation anywhere?
Not sure. Initially I was going to say "no" because it's an internal
implementation detail and the risk of the deadlock is already there
anyway. But maybe this patch is making it more likely and we should at
least mention how partitions are locked.
With this done, the performance of an INSERT into a 10k partition
partitioned table looks like:Setup:
create table hashp (a int) partition by hash(a);
select 'create table hashp'||x::Text || ' partition of hashp for
values with (modulus 10000, remainder '||x::text||');' from
generate_Series(0,9999) x;
\gexechashp_insert.sql:
\set p_a random(1,1000)
insert into hashp values(:p_a);Results:
$ psql -c "truncate hashp;" postgres && pgbench -n -f hashp_insert.sql
-M prepared -c 4 -j 4 -T 60 postgresI used a similar test, but with unlogged tables, and "-c 2", and got:
normal table: 32000tps
10k partitions / master: 82tps
10k partitions / patch: 7000tpsSo far I haven't gotten quite as good performance as you and Tomas,
although it's still a ~85x improvement.
What hardware are you running the tests on? I wouldn't be surprised if
you were hitting some CPU or I/O bottleneck, which we're not hitting.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jan 19, 2019 at 10:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On 1/19/19 12:05 AM, John Naylor wrote:
I used a similar test, but with unlogged tables, and "-c 2", and got:
normal table: 32000tps
10k partitions / master: 82tps
10k partitions / patch: 7000tpsSo far I haven't gotten quite as good performance as you and Tomas,
although it's still a ~85x improvement.What hardware are you running the tests on? I wouldn't be surprised if
you were hitting some CPU or I/O bottleneck, which we're not hitting.
9 year-old laptop, Core i3. Side note, I miswrote my test parameters
above -- should be "-c4 -j2".
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/20/19 5:45 AM, John Naylor wrote:
On Sat, Jan 19, 2019 at 10:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On 1/19/19 12:05 AM, John Naylor wrote:
I used a similar test, but with unlogged tables, and "-c 2", and got:
normal table: 32000tps
10k partitions / master: 82tps
10k partitions / patch: 7000tpsSo far I haven't gotten quite as good performance as you and Tomas,
although it's still a ~85x improvement.What hardware are you running the tests on? I wouldn't be surprised if
you were hitting some CPU or I/O bottleneck, which we're not hitting.9 year-old laptop, Core i3. Side note, I miswrote my test parameters
above -- should be "-c4 -j2".
Hmmm, I wouldn't be surprised if it was getting throttled for some
reasons (e.g. temperature).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, 19 Jan 2019 at 12:05, John Naylor <jcnaylor@gmail.com> wrote:
On 11/22/18, David Rowley <david.rowley@2ndquadrant.com> wrote:
If required, such operations could LOCK TABLE the top partitioned
table to block the DML operation. There's already a risk of similar
deadlocks from such operations done on multiple separate tables when
the order they're done is not the same as the order the tables are
written in a query, although, in that case, the window for the
deadlock is likely to be much smaller.Is this something that would need documentation anywhere?
Probably at least the release notes. I'm unsure where else to mention
it. I don't feel the workaround of using LOCK TABLE is special to
this case. The patch does, however, make it a possible requirement for
performing DDL on individual partitions where it was not previously.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
on Fri, 18 2019 at 19:41, David Rowley <david.rowley@2ndquadrant.com> wrote:
Perhaps you're running with plan_cache_mode = force_custom_plan.
You'll likely need it set to auto for these to pass.
Thank you.
I was running with plan_cache_mode = force_custom_plan.
When it set to auto, all tests are passed.
regards,
sho kato
Show quoted text
-----Original Message-----
From: David Rowley [mailto:david.rowley@2ndquadrant.com]
Sent: Friday, January 18, 2019 7:41 PM
To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; David
Rowley <dgrowley@gmail.com>
Subject: Re: Delay locking partitions during INSERT and UPDATEOn Fri, 18 Jan 2019 at 19:08, sho kato <kato-sho@jp.fujitsu.com> wrote:
I confirmed that this patch improve performance by 10 times or more.
Thanks for testing this out.
Also, I did make installcheck world, but test partition_prune failed.
However, this test case failed even before applying a patch, so thispatch is not a problem.
One of the results is as follows.
explain (analyze, costs off, summary off, timing off) execute ab_q1
(2, 2, 3);
- QUERY PLAN ---------------------------------------------------------- + QUERY PLAN +------------------------------------------------------ Append (actual rows=0 loops=1) - Subplans Removed: 6 -> Seq Scan on ab_a2_b1 (actual rows=0 loops=1) - Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) + Filter: ((a >= 2) AND (a <= 2) AND (b <= 3)) -> Seq Scan on ab_a2_b2 (actual rows=0 loops=1) - Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) + Filter: ((a >= 2) AND (a <= 2) AND (b <= 3)) -> Seq Scan on ab_a2_b3 (actual rows=0 loops=1) - Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -(8 rows) + Filter: ((a >= 2) AND (a <= 2) AND (b <= 3)) +(7 rows)Perhaps you're running with plan_cache_mode = force_custom_plan.
You'll likely need it set to auto for these to pass.--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
The cfbot reports this patch no longer applies.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 23 Jan 2019 at 04:35, John Naylor <john.naylor@2ndquadrant.com> wrote:
The cfbot reports this patch no longer applies.
Thanks. I've attached a rebased version.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
v2-0001-Delay-locking-of-partitions-during-INSERT-and-UPD.patchapplication/octet-stream; name=v2-0001-Delay-locking-of-partitions-during-INSERT-and-UPD.patchDownload+2-9
On Tue, Jan 22, 2019 at 4:31 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
Thanks. I've attached a rebased version.
Hmm, now instead of an 85x speedup over master in the 10k partition
case, I only get 20x. Anyone else see this?
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 24 Jan 2019 at 13:38, John Naylor <john.naylor@2ndquadrant.com> wrote:
On Tue, Jan 22, 2019 at 4:31 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:Thanks. I've attached a rebased version.
Hmm, now instead of an 85x speedup over master in the 10k partition
case, I only get 20x. Anyone else see this?
What's it like with fsync off?
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 23, 2019 at 7:56 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
On Thu, 24 Jan 2019 at 13:38, John Naylor <john.naylor@2ndquadrant.com> wrote:
Hmm, now instead of an 85x speedup over master in the 10k partition
case, I only get 20x. Anyone else see this?What's it like with fsync off?
No change. Just in case, I tried git bisect and also recreated the
cluster, but never got the same performance as my first test, so not
sure what happened.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/24/19 9:50 PM, John Naylor wrote:
On Wed, Jan 23, 2019 at 7:56 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:On Thu, 24 Jan 2019 at 13:38, John Naylor <john.naylor@2ndquadrant.com> wrote:
Hmm, now instead of an 85x speedup over master in the 10k partition
case, I only get 20x. Anyone else see this?What's it like with fsync off?
No change. Just in case, I tried git bisect and also recreated the
cluster, but never got the same performance as my first test, so not
sure what happened.
I can still see about the same performance as before (on both clusters).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 24, 2019 at 4:17 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I can still see about the same performance as before (on both clusters).
Thanks for checking! I think the thing to do now is have a committer
look at it. It's a small patch with obvious performance effects --
there's just the question of whether the locking behavior with
concurrent DDL is as safe as it is now.
Anyone have anything else?
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/24/19 10:34 PM, John Naylor wrote:
On Thu, Jan 24, 2019 at 4:17 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I can still see about the same performance as before (on both clusters).
Thanks for checking! I think the thing to do now is have a committer
look at it. It's a small patch with obvious performance effects --
there's just the question of whether the locking behavior with
concurrent DDL is as safe as it is now.Anyone have anything else?
Yes, I don't see why would the patch change that and I've been looking
for such cases. I think David was looking at that this week too, and I
assume he'll send an update if he finds anything. If not, I plan to get
it committed soon-ish (possibly next week after the FOSDEM dev meeting,
unless there are some objections).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 24, 2019 at 4:43 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
Yes, I don't see why would the patch change that and I've been looking
for such cases. I think David was looking at that this week too, and I
assume he'll send an update if he finds anything. If not, I plan to get
it committed soon-ish (possibly next week after the FOSDEM dev meeting,
unless there are some objections).
I have reviewed this patch and I am in favor of it. I think it likely
needs minor rebasing because of the heap_open -> table_open renaming.
I also agree that it's worth taking some deadlock risk for the rather
massive performance gain, although I suspect it's likely that a few
users are going to complain about deadlocks and I wonder whether we'll
have to put some energy into that problem at some point. However, I
think what we probably want to do there is reduce the probability of
deadlocks through some trickery or maybe invent some new locking
mechanisms that work around the problem. The alternative of trying to
block this patch seems unpalatable.
In short, +1 from me.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, 1 Feb 2019 at 07:46, Robert Haas <robertmhaas@gmail.com> wrote:
I have reviewed this patch and I am in favor of it. I think it likely
needs minor rebasing because of the heap_open -> table_open renaming.
Many thanks for looking at it. The v2 patch was based on top of the
heap_open -> table_open change.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services