Performance improvements of INSERTs to a partitioned table

Started by Kato, Shoabout 7 years ago6 messages
#1Kato, Sho
Kato, Sho
kato-sho@jp.fujitsu.com

Hi,

I want to discuss performance improvements of INSERTs to a partitioned table.

When an application inserts records into a table partitioned into thousands, find_all_inheritors() locks all of the partitions.
Updating partition key locks in the same way.

So, Execution time becomes longer as the number of partition increases.

* nparts 8

testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.281..0.281 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.079..0.080 rows=1 loops=1)
Planning Time: 0.080 ms
Execution Time: 0.362 ms
(4 rows)

* nparts 8192

testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8192, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.058..0.059 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.032..0.034 rows=1 loops=1)
Planning Time: 0.032 ms
Execution Time: 12.508 ms
(4 rows)

Locking only the target partitions like the patch previously proposed by David[1]/messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com, the performance will improve greatly.

* nparts 8192 (patched)

testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8192, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.415..0.416 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.140..0.141 rows=1 loops=1)
Planning Time: 0.120 ms
Execution Time: 1.694 ms
(4 rows)

However, I am concerned that "unsafe" is included in the name of this patch.
If locking only target partitions and locking all of partitions are executed at the same time, a problem may occurs.
But, I'm not sure what kind of problem will occur.
Is it enough to lock only the target partitions?

If a problem occurs in above case, I think it is safer to divide the steps to acquire the lock into two.

In first step, locking only the parent table in share or exclusive mode.
In second step, locking only the target partitions after locking the parent table.

Thoughts?

[1]: /messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com

regards,
Sho Kato

#2Kato, Sho
Kato, Sho
kato-sho@jp.fujitsu.com
In reply to: Kato, Sho (#1)
RE: Performance improvements of INSERTs to a partitioned table

AFAIK, When CREATE INDEX on a partition and INSERT to a parent table are executed at the same time, this patch causes deadlock.

* partitions information

Partition key: RANGE (a)
Partitions: a_1 FOR VALUES FROM (1) TO (100),
a_2 FOR VALUES FROM (100) TO (200)

T1: create index a_1_ix on a_1(a);
T2: insert into a values(101),(1); locking a_2 and waiting releasing a_1’s lock
T1: create index a_2_ix on a_2(a); ERROR: deadlock detected |

I think this situation does not mean unsafe because similar situation will occurs at no partitioned tables and DBMS could not prevent this situation.
But, I'm not sure this behavior is correct.
Does similar deadlock occur in other DBMS like Oracle?

From: Kato, Sho [mailto:kato-sho@jp.fujitsu.com]
Sent: Tuesday, November 6, 2018 6:36 PM
To: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Subject: Performance improvements of INSERTs to a partitioned table

Hi,

I want to discuss performance improvements of INSERTs to a partitioned table.

When an application inserts records into a table partitioned into thousands, find_all_inheritors() locks all of the partitions.
Updating partition key locks in the same way.

So, Execution time becomes longer as the number of partition increases.

* nparts 8

testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.281..0.281 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.079..0.080 rows=1 loops=1)
Planning Time: 0.080 ms
Execution Time: 0.362 ms
(4 rows)

* nparts 8192

testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8192, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.058..0.059 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.032..0.034 rows=1 loops=1)
Planning Time: 0.032 ms
Execution Time: 12.508 ms
(4 rows)

Locking only the target partitions like the patch previously proposed by David[1]/messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com, the performance will improve greatly.

* nparts 8192 (patched)

testdb=# explain analyze insert into test.accounts_history(aid, delta, mtime) values(8192, 5000, current_timestamp);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Insert on accounts_history (cost=0.00..0.02 rows=1 width=20) (actual time=0.415..0.416 rows=0 loops=1)
-> Result (cost=0.00..0.02 rows=1 width=20) (actual time=0.140..0.141 rows=1 loops=1)
Planning Time: 0.120 ms
Execution Time: 1.694 ms
(4 rows)

However, I am concerned that "unsafe" is included in the name of this patch.
If locking only target partitions and locking all of partitions are executed at the same time, a problem may occurs.
But, I'm not sure what kind of problem will occur.
Is it enough to lock only the target partitions?

If a problem occurs in above case, I think it is safer to divide the steps to acquire the lock into two.

In first step, locking only the parent table in share or exclusive mode.
In second step, locking only the target partitions after locking the parent table.

Thoughts?

[1]: /messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com

regards,
Sho Kato

#3David Rowley
David Rowley
david.rowley@2ndquadrant.com
In reply to: Kato, Sho (#2)
Re: Performance improvements of INSERTs to a partitioned table

On 7 November 2018 at 21:31, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

AFAIK, When CREATE INDEX on a partition and INSERT to a parent table are
executed at the same time, this patch causes deadlock.

* partitions information

Partition key: RANGE (a)
Partitions: a_1 FOR VALUES FROM (1) TO (100),
a_2 FOR VALUES FROM (100) TO (200)

T1: create index a_1_ix on a_1(a);
T2: insert into a values(101),(1); locking a_2 and waiting releasing a_1’s
lock
T1: create index a_2_ix on a_2(a); ERROR: deadlock detected

I think this situation does not mean unsafe because similar situation will
occurs at no partitioned tables and DBMS could not prevent this situation.

But, I'm not sure this behavior is correct.

That's one case that will deadlock with the 0002 patch in [1]/messages/by-id/CAKJS1f-vYAHqqaU878Q4cdZYHwPcv1J_C-mG=Cs2UwhsD6cqwg@mail.gmail.com. Another
command that takes a conflicting lock on the partition only is
TRUNCATE. Many other commands that normally take an AEL can't be run
on a partition, for example, ALTER TABLE ADD COLUMN.

The same would have deadlocked on master if you'd created the indexes
in the reverse order, but I guess the point is that today there's some
defined order that you can create the indexes in that won't cause a
deadlock, but with [1]/messages/by-id/CAKJS1f-vYAHqqaU878Q4cdZYHwPcv1J_C-mG=Cs2UwhsD6cqwg@mail.gmail.com there is no defined order since the tables are
locked in order that tuples are routed to them.

Robert pointed out something interesting in [2]/messages/by-id/CA+TgmoZGJsy-nRFnzurhZQJtHdDh5fzJKvbvhS0byN6_46pB9Q@mail.gmail.com about UPDATEs of a
partition key perhaps routing a tuple to a partition that's been
excluded by constraint exclusion, so the lock is not taken initially,
but would be taken later in ExecSetupPartitionTupleRouting(). I ran
through that scenario today and discovered it can't happen as excluded
partitions are still in the rangetable, so are still locked either in
the planner or by AcquireExecutorLocks() and the order those are
locked in is well defined from the planner. However, I believe that's
something Amit plans to change in [3]https://commitfest.postgresql.org/20/1778/, where he proposes to only lock
partitions that survive partition pruning (among many other things).

There are probably a few things we could do here:

a. Have all DDL on partitions obtain the same lock level on all
ancestor partitioned tables taking a lock on the root first and
working down to the leaf.
b. Just lock what we touch and increase the likelihood of deadlocks occurring.
c. Reject [1]/messages/by-id/CAKJS1f-vYAHqqaU878Q4cdZYHwPcv1J_C-mG=Cs2UwhsD6cqwg@mail.gmail.com and [3]https://commitfest.postgresql.org/20/1778/ because we don't want to increase the chances of
deadlocks.

I'm highly against doing 'c' since both [1]/messages/by-id/CAKJS1f-vYAHqqaU878Q4cdZYHwPcv1J_C-mG=Cs2UwhsD6cqwg@mail.gmail.com and [3]https://commitfest.postgresql.org/20/1778/ are very useful
patches which scale partitioning to work well with very large numbers
of partitions. At the moment we just can bearly do half a dozen
partitions before the performance starts going south. 'a' is not that
ideal a solution either as it means that anyone doing CREATE INDEX or
TRUNCATE on a partition would conflict with queries that are querying
an unrelated part of the partitioned table. While I don't really like
'b' either, I wonder how many people are going to hit deadlocks here.
To hit that, I believe that you'd have to be doing the TRUNCATE or
CREATE INDEX inside a transaction and to hit it where you couldn't hit
it before you'd have had to carefully written your CREATE INDEX /
TRUNCATE statements to ensure they're executed in partition order. I'm
unsure how likely that is, but I certainly can't rule out that doing
'b' won't upset anyone.

Perhaps a GUC is needed to choose between 'a' and 'b'?

I'm adding Amit to this email because [3]https://commitfest.postgresql.org/20/1778/ is his and Robert because
he's recently been talking about partition locking too.

[1]: /messages/by-id/CAKJS1f-vYAHqqaU878Q4cdZYHwPcv1J_C-mG=Cs2UwhsD6cqwg@mail.gmail.com
[2]: /messages/by-id/CA+TgmoZGJsy-nRFnzurhZQJtHdDh5fzJKvbvhS0byN6_46pB9Q@mail.gmail.com
[3]: https://commitfest.postgresql.org/20/1778/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#4Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#3)
Re: Performance improvements of INSERTs to a partitioned table

On 2018/11/09 11:19, David Rowley wrote:

On 7 November 2018 at 21:31, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

AFAIK, When CREATE INDEX on a partition and INSERT to a parent table are
executed at the same time, this patch causes deadlock.

* partitions information

Partition key: RANGE (a)
Partitions: a_1 FOR VALUES FROM (1) TO (100),
a_2 FOR VALUES FROM (100) TO (200)

T1: create index a_1_ix on a_1(a);
T2: insert into a values(101),(1); locking a_2 and waiting releasing a_1’s
lock
T1: create index a_2_ix on a_2(a); ERROR: deadlock detected

I think this situation does not mean unsafe because similar situation will
occurs at no partitioned tables and DBMS could not prevent this situation.

But, I'm not sure this behavior is correct.

That's one case that will deadlock with the 0002 patch in [1]. Another
command that takes a conflicting lock on the partition only is
TRUNCATE. Many other commands that normally take an AEL can't be run
on a partition, for example, ALTER TABLE ADD COLUMN.

The same would have deadlocked on master if you'd created the indexes
in the reverse order, but I guess the point is that today there's some
defined order that you can create the indexes in that won't cause a
deadlock, but with [1] there is no defined order since the tables are
locked in order that tuples are routed to them.

We cannot control the order in which a user performs operations directly
on partitions. What we do have to worry about is the order in which
partitions are locked by a particular piece of backend code when a certain
operation is applied to the parent table. However, if the user had
applied the operation to the parent table instead, then it would've
blocked for any concurrent queries that already had a lock on the parent,
or vice versa, because obviously the table named in the command is locked
before any of its children.

Robert pointed out something interesting in [2] about UPDATEs of a
partition key perhaps routing a tuple to a partition that's been
excluded by constraint exclusion, so the lock is not taken initially> but would be taken later in ExecSetupPartitionTupleRouting(). I ran
through that scenario today and discovered it can't happen as excluded
partitions are still in the rangetable, so are still locked either in
the planner or by AcquireExecutorLocks() and the order those are
locked in is well defined from the planner.

Yes, expand_inherited_rtentry() would lock *all* partitions even if some
of them might not end up in the final plan due to some being excluded.

However, I believe that's
something Amit plans to change in [3], where he proposes to only lock
partitions that survive partition pruning (among many other things).

With my patch, while the planner would take the locks in deterministic
order on the partitions that it adds to the plan, the executor time
locking order is non-deterministic considering that tuple routing may
encounter multiple, not yet seen, partitions.

There are probably a few things we could do here:

a. Have all DDL on partitions obtain the same lock level on all
ancestor partitioned tables taking a lock on the root first and
working down to the leaf.
b. Just lock what we touch and increase the likelihood of deadlocks occurring.
c. Reject [1] and [3] because we don't want to increase the chances of
deadlocks.

I'm highly against doing 'c' since both [1] and [3] are very useful
patches which scale partitioning to work well with very large numbers
of partitions. At the moment we just can bearly do half a dozen
partitions before the performance starts going south.

Agreed.

'a' is not that
ideal a solution either as it means that anyone doing CREATE INDEX or
TRUNCATE on a partition would conflict with queries that are querying
an unrelated part of the partitioned table.

As you pointed out, many of the commands that require locks that would
conflict with concurrent queries are not allowed to run directly on
partitions anyway. For operations that *are* allowed, doing 'a' is same
as asking the user to perform the operation on the root parent instead, at
least as far as the concurrency is concerned (user may not want to
*truncate* all partitions, only some.)

While I don't really like
'b' either, I wonder how many people are going to hit deadlocks here.
To hit that, I believe that you'd have to be doing the TRUNCATE or
CREATE INDEX inside a transaction and to hit it where you couldn't hit
it before you'd have had to carefully written your CREATE INDEX /
TRUNCATE statements to ensure they're executed in partition order. I'm
unsure how likely that is, but I certainly can't rule out that doing
'b' won't upset anyone.

As long as queries involve tuple routing that touches multiple not yet
seen partitions, someone doing conflicting operations directly on multiple
partitions in a transaction will have to be ready to see deadlocks.
Maybe, we can document that.

Perhaps a GUC is needed to choose between 'a' and 'b'?

I guess we'll have a hard time explaining such configuration option to
users though.

Thanks,
Amit

#5Kato, Sho
Kato, Sho
kato-sho@jp.fujitsu.com
In reply to: David Rowley (#3)
RE: Performance improvements of INSERTs to a partitioned table

Thanks David.

I'm highly against doing 'c' since both [1] and [3] are very useful patches which scale partitioning to work well with very large numbers of partitions.

Agreed.

'a' is not that ideal a solution either as it means that anyone doing CREATE INDEX or TRUNCATE on a partition would conflict with queries that are querying an unrelated part of the partitioned table.

If only deadlock is a problem, I think intention lock is needed.
All SQL on a partitioned tables take a intention exclusive or shared lock on the root first and take a conventional lock on target leaf next.

While I don't really like 'b' either, I wonder how many people are going to hit deadlocks here.
To hit that, I believe that you'd have to be doing the TRUNCATE or CREATE INDEX inside a transaction and to hit it where you couldn't hit it before you'd have had to carefully written your CREATE INDEX / TRUNCATE statements to ensure they're executed in partition order.

In case of CREATE INDEX on the leaf, users have to ensure to statements executed in partition order since similar case could occur in no partitioned table.
But, DBMS have to prevent deadlock occurred by CREATE INDEX/TRUNCATE on the root.

Perhaps a GUC is needed to choose between 'a' and 'b'?

I agree with Amit. It is difficult to choose for users.

regards,
Sho Kato

Show quoted text

-----Original Message-----
From: David Rowley [mailto:david.rowley@2ndquadrant.com]
Sent: Friday, November 9, 2018 11:20 AM
To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp>; Robert Haas <robertmhaas@gmail.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Performance improvements of INSERTs to a partitioned table

On 7 November 2018 at 21:31, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

AFAIK, When CREATE INDEX on a partition and INSERT to a parent table
are executed at the same time, this patch causes deadlock.

* partitions information

Partition key: RANGE (a)
Partitions: a_1 FOR VALUES FROM (1) TO (100),
a_2 FOR VALUES FROM (100) TO (200)

T1: create index a_1_ix on a_1(a);
T2: insert into a values(101),(1); locking a_2 and waiting releasing
a_1’s lock
T1: create index a_2_ix on a_2(a); ERROR: deadlock detected

I think this situation does not mean unsafe because similar situation
will occurs at no partitioned tables and DBMS could not prevent this

situation.

But, I'm not sure this behavior is correct.

That's one case that will deadlock with the 0002 patch in [1]. Another
command that takes a conflicting lock on the partition only is TRUNCATE.
Many other commands that normally take an AEL can't be run on a partition,
for example, ALTER TABLE ADD COLUMN.

The same would have deadlocked on master if you'd created the indexes
in the reverse order, but I guess the point is that today there's some
defined order that you can create the indexes in that won't cause a
deadlock, but with [1] there is no defined order since the tables are
locked in order that tuples are routed to them.

Robert pointed out something interesting in [2] about UPDATEs of a
partition key perhaps routing a tuple to a partition that's been excluded
by constraint exclusion, so the lock is not taken initially, but would
be taken later in ExecSetupPartitionTupleRouting(). I ran through that
scenario today and discovered it can't happen as excluded partitions are
still in the rangetable, so are still locked either in the planner or
by AcquireExecutorLocks() and the order those are locked in is well
defined from the planner. However, I believe that's something Amit plans
to change in [3], where he proposes to only lock partitions that survive
partition pruning (among many other things).

There are probably a few things we could do here:

a. Have all DDL on partitions obtain the same lock level on all ancestor
partitioned tables taking a lock on the root first and working down to
the leaf.
b. Just lock what we touch and increase the likelihood of deadlocks
occurring.
c. Reject [1] and [3] because we don't want to increase the chances of
deadlocks.

I'm highly against doing 'c' since both [1] and [3] are very useful patches
which scale partitioning to work well with very large numbers of
partitions. At the moment we just can bearly do half a dozen partitions
before the performance starts going south. 'a' is not that ideal a solution
either as it means that anyone doing CREATE INDEX or TRUNCATE on a
partition would conflict with queries that are querying an unrelated part
of the partitioned table. While I don't really like 'b' either, I wonder
how many people are going to hit deadlocks here.
To hit that, I believe that you'd have to be doing the TRUNCATE or CREATE
INDEX inside a transaction and to hit it where you couldn't hit it before
you'd have had to carefully written your CREATE INDEX / TRUNCATE
statements to ensure they're executed in partition order. I'm unsure how
likely that is, but I certainly can't rule out that doing 'b' won't upset
anyone.

Perhaps a GUC is needed to choose between 'a' and 'b'?

I'm adding Amit to this email because [3] is his and Robert because he's
recently been talking about partition locking too.

[1]
/messages/by-id/CAKJS1f-vYAHqqaU878Q4cdZYHwPcv
1J_C-mG%3DCs2UwhsD6cqwg%40mail.gmail.com
[2]
/messages/by-id/CA+TgmoZGJsy-nRFnzurhZQJtHdD
h5fzJKvbvhS0byN6_46pB9Q%40mail.gmail.com
[3] https://commitfest.postgresql.org/20/1778/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#6David Rowley
David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#4)
Re: Performance improvements of INSERTs to a partitioned table

On 9 November 2018 at 18:45, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

As long as queries involve tuple routing that touches multiple not yet
seen partitions, someone doing conflicting operations directly on multiple
partitions in a transaction will have to be ready to see deadlocks.
Maybe, we can document that.

Perhaps it's good enough. A guess there's at least a workaround of
doing LOCK TABLE <root_partitioned_table> in the CREATE INDEX /
TRUNCATE session.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services