speeding up planning with partitions
It is more or less well known that the planner doesn't perform well with
more than a few hundred partitions even when only a handful of partitions
are ultimately included in the plan. Situation has improved a bit in PG
11 where we replaced the older method of pruning partitions one-by-one
using constraint exclusion with a much faster method that finds relevant
partitions by using partitioning metadata. However, we could only use it
for SELECT queries, because UPDATE/DELETE are handled by a completely
different code path, whose structure doesn't allow it to call the new
pruning module's functionality. Actually, not being able to use the new
pruning is not the only problem for UPDATE/DELETE, more on which further
below.
While situation improved with new pruning where it could, there are still
overheads in the way planner handles partitions. As things stand today,
it will spend cycles and allocate memory for partitions even before
pruning is performed, meaning most of that effort could be for partitions
that were better left untouched. Currently, planner will lock, heap_open
*all* partitions, create range table entries and AppendRelInfos for them,
and finally initialize RelOptInfos for them, even touching the disk file
of each partition in the process, in an earlier phase of planning. All of
that processing is vain for partitions that are pruned, because they won't
be included in the final plan. This problem grows worse as the number of
partitions grows beyond thousands, because range table grows too big.
That could be fixed by delaying all that per-partition activity to a point
where pruning has already been performed, so that we know the partitions
to open and create planning data structures for, such as somewhere
downstream to query_planner. But before we can do that we must do
something about the fact that UPDATE/DELETE won't be able to cope with
that because the code path that currently handles the planning of
UPDATE/DELETE on partitioned tables (inheritance_planner called from
subquery_planner) relies on AppendRelInfos for all partitions having been
initialized by an earlier planning phase. Delaying it to query_planner
would be too late, because inheritance_planner calls query_planner for
each partition, not for the parent. That is, if query_planner, which is
downstream to inheritance_planner, was in the charge of determining which
partitions to open, the latter wouldn't know which partitions to call the
former for. :)
That would be fixed if there is no longer this ordering dependency, which
is what I propose to do with the attached patch 0001. I've tried to
describe how the patch manages to do that in its commit message, but I'll
summarize here. As things stand today, inheritance_planner modifies the
query for each leaf partition to make the partition act as the query's
result relation instead of the original partitioned table and calls
grouping_planner on the query. That means anything that's joined to
partitioned table looks to instead be joined to the partition and join
paths are generated likewise. Also, the resulting path's targetlist is
adjusted to be suitable for the result partition. Upon studying how this
works, I concluded that the same result can be achieved if we call
grouping_planner only once and repeat the portions of query_planner's and
grouping_planner's processing that generate the join paths and appropriate
target list, respectively, for each partition. That way, we can rely on
query_planner determining result partitions for us, which in turn relies
on the faster partprune.c based method of pruning. That speeds things up
in two ways. Faster pruning and we no longer repeat common processing for
each partition.
With 0001 in place, there is nothing that requires that partitions be
opened by an earlier planning phase, so, I propose patch 0002, which
refactors the opening and creation of planner data structures for
partitions such that it is now performed after pruning. However, it
doesn't do anything about the fact that partitions are all still locked in
the earlier phase.
With various overheads gone thanks to 0001 and 0002, locking of all
partitions via find_all_inheritos can be seen as the single largest
bottleneck, which 0003 tries to address. I've kept it a separate patch,
because I'll need to think a bit more to say that it's actually to safe to
defer locking to late planning, due mainly to the concern about the change
in the order of locking from the current method. I'm attaching it here,
because I also want to show the performance improvement we can expect with it.
I measured the gain in performance due to each patch on a modest virtual
machine. Details of the measurement and results follow.
* Benchmark scripts
update.sql
update ht set a = 0 where b = 1;
select.sql
select * from ht where b = 1;
* Table:
create table ht (a int, b int) partition by hash (b)
create table ht_1 partition of ht for values with (modulus N, remainder 0)
..
create table ht_N partition of ht for values with (modulus N, remainder N-1)
* Rounded tps with update.sql and select.sql against regular table (nparts
= 0) and partitioned table with various partition counts:
pgbench -n -T 60 -f update.sql
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2856 2893 2862 2816
8 507 1115 1447 1872
16 260 765 1173 1892
32 119 483 922 1884
64 59 282 615 1881
128 29 153 378 1835
256 14 79 210 1803
512 5 40 113 1728
1024 2 17 57 1616
2048 0* 9 30 1471
4096 0+ 4 15 1236
8192 0= 2 7 975
* 0.46
+ 0.0064
= 0 (OOM on a virtual machine with 4GB RAM)
As can be seen here, 0001 is a big help for update queries.
pgbench -n -T 60 -f select.sql
For a select query that doesn't contain join and needs to scan only one
partition:
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2290 2329 2319 2268
8 1058 1077 1414 1788
16 711 729 1124 1789
32 450 475 879 1773
64 265 272 603 1765
128 146 149 371 1685
256 76 77 214 1678
512 39 39 112 1636
1024 16 17 59 1525
2048 8 9 29 1416
4096 4 4 15 1195
8192 2 2 7 932
Actually, here we get almost same numbers with 0001 as with master,
because 0001 changes nothing for SELECT queries. We start seeing
improvement with 0002, the patch to delay opening partitions.
Thanks,
Amit
Attachments:
0001-Overhaul-partitioned-table-update-delete-planning.patchtext/plain; charset=UTF-8; name=0001-Overhaul-partitioned-table-update-delete-planning.patchDownload+416-141
0002-Lazy-creation-of-partition-objects-for-planning.patchtext/plain; charset=UTF-8; name=0002-Lazy-creation-of-partition-objects-for-planning.patchDownload+486-325
0003-Only-lock-partitions-that-will-be-scanned-by-a-query.patchtext/plain; charset=UTF-8; name=0003-Only-lock-partitions-that-will-be-scanned-by-a-query.patchDownload+13-13
On 30 August 2018 at 00:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2856 2893 2862 2816
8 507 1115 1447 1872
16 260 765 1173 1892
32 119 483 922 1884
64 59 282 615 1881
128 29 153 378 1835
256 14 79 210 1803
512 5 40 113 1728
1024 2 17 57 1616
2048 0* 9 30 1471
4096 0+ 4 15 1236
8192 0= 2 7 975
Those look promising. Are you going to submit these patches to the
September 'fest?
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
I measured the gain in performance due to each patch on a modest virtual
machine. Details of the measurement and results follow.
Amazing!
UPDATE
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2856 2893 2862 2816
8 507 1115 1447 1872
SELECT
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2290 2329 2319 2268
8 1058 1077 1414 1788
Even a small number of partitions still introduces a not-small overhead (UPDATE:34%, SELECT:22%). Do you think this overhead can be reduced further? What part do you guess would be relevant? This one?
that it is now performed after pruning. However, it doesn't do anything
about the fact that partitions are all still locked in the earlier phase.
Regards
Takayuki Tsunakawa
On 2018/08/30 7:27, David Rowley wrote:
On 30 August 2018 at 00:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2856 2893 2862 2816
8 507 1115 1447 1872
16 260 765 1173 1892
32 119 483 922 1884
64 59 282 615 1881
128 29 153 378 1835
256 14 79 210 1803
512 5 40 113 1728
1024 2 17 57 1616
2048 0* 9 30 1471
4096 0+ 4 15 1236
8192 0= 2 7 975Those look promising. Are you going to submit these patches to the
September 'fest?
Thanks, I just did.
https://commitfest.postgresql.org/19/1778/
Regards,
Amit
On 2018/08/30 10:09, Tsunakawa, Takayuki wrote:
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
I measured the gain in performance due to each patch on a modest virtual
machine. Details of the measurement and results follow.Amazing!
Thanks.
UPDATE
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2856 2893 2862 2816
8 507 1115 1447 1872SELECT
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2290 2329 2319 2268
8 1058 1077 1414 1788Even a small number of partitions still introduces a not-small overhead (UPDATE:34%, SELECT:22%).
Yeah, that's true.
Do you think this overhead can be reduced further?
We can definitely try, but I'm not immediately sure if the further
improvements will come from continuing to fix the planner. Maybe, the
overhead of partitioning could be attributed to other parts of the system.
What part do you guess would be relevant? This one?
that it is now performed after pruning. However, it doesn't do anything
about the fact that partitions are all still locked in the earlier phase.
Actually, I wrote that for patch 0002. The next patch (0003) is meant to
fix that. So, the overhead you're seeing is even after making sure that
only the selected partitions are locked.
Thanks,
Amit
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
We can definitely try, but I'm not immediately sure if the further
improvements will come from continuing to fix the planner. Maybe, the
overhead of partitioning could be attributed to other parts of the system.
Actually, I wrote that for patch 0002. The next patch (0003) is meant to
fix that. So, the overhead you're seeing is even after making sure that
only the selected partitions are locked.
Thanks for telling your thought. I understood we should find the bottleneck with profiling first.
Regards
Takayuki Tsunakawa
On 2018/08/29 21:06, Amit Langote wrote:
I measured the gain in performance due to each patch on a modest virtual
machine. Details of the measurement and results follow.UPDATE:
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2856 2893 2862 2816
8 507 1115 1447 1872
16 260 765 1173 1892
32 119 483 922 1884
64 59 282 615 1881
128 29 153 378 1835
256 14 79 210 1803
512 5 40 113 1728
1024 2 17 57 1616
2048 0* 9 30 1471
4096 0+ 4 15 1236
8192 0= 2 7 975For SELECT:
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2290 2329 2319 2268
8 1058 1077 1414 1788
16 711 729 1124 1789
32 450 475 879 1773
64 265 272 603 1765
128 146 149 371 1685
256 76 77 214 1678
512 39 39 112 1636
1024 16 17 59 1525
2048 8 9 29 1416
4096 4 4 15 1195
8192 2 2 7 932
Prompted by Tsunakawa-san's comment, I tried to look at the profiles when
running the benchmark with partitioning and noticed a few things that made
clear why, even with 0003 applied, tps numbers decreased as the number of
partitions increased. Some functions that appeared high up in the
profiles were related to partitioning:
* set_relation_partition_info calling partition_bounds_copy(), which calls
datumCopy() on N Datums, where N is the number of partitions. The more
the number of partitions, higher up it is in profiles. I suspect that
this copying might be redundant; planner can keep using the same pointer
as relcache
There are a few existing and newly introduced sites in the planner where
the code iterates over *all* partitions of a table where processing just
the partition selected for scanning would suffice. I observed the
following functions in profiles:
* make_partitionedrel_pruneinfo, which goes over all partitions to
generate subplan_map and subpart_map arrays to put into the
PartitionedRelPruneInfo data structure that it's in the charge of
generating
* apply_scanjoin_target_to_paths, which goes over all partitions to adjust
their Paths for applying required scanjoin target, although most of
those are dummy ones that won't need the adjustment
* For UPDATE, a couple of functions I introduced in patch 0001 were doing
the same thing as apply_scanjoin_target_to_paths, which is unnecessary
To fix the above three instances of redundant processing, I added a
Bitmapset 'live_parts' to the RelOptInfo which stores the set of indexes
of only the unpruned partitions (into the RelOptInfo.part_rels array) and
replaced the for (i = 0; i < rel->nparts; i++) loops in those sites with
the loop that iterates over the members of 'live_parts'.
Results looked were promising indeed, especially after applying 0003 which
gets rid of locking all partitions.
UPDATE:
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2856 2893 2862 2816
8 507 1115 1466 1845
16 260 765 1161 1876
32 119 483 910 1862
64 59 282 609 1895
128 29 153 376 1884
256 14 79 212 1874
512 5 40 115 1859
1024 2 17 58 1847
2048 0 9 29 1883
4096 0 4 15 1867
8192 0 2 7 1826
SELECT:
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2290 2329 2319 2268
8 1058 1077 1431 1800
16 711 729 1158 1781
32 450 475 908 1777
64 265 272 612 1791
128 146 149 379 1777
256 76 77 213 1785
512 39 39 114 1776
1024 16 17 59 1756
2048 8 9 30 1746
4096 4 4 15 1722
8192 2 2 7 1706
Note that with 0003, tps doesn't degrade as the number of partitions increase.
Attached updated patches, with 0002 containing the changes mentioned above.
Thanks,
Amit
Attachments:
v2-0001-Overhaul-partitioned-table-update-delete-planning.patchtext/plain; charset=UTF-8; name=v2-0001-Overhaul-partitioned-table-update-delete-planning.patchDownload+416-141
v2-0002-Lazy-creation-of-partition-objects-for-planning.patchtext/plain; charset=UTF-8; name=v2-0002-Lazy-creation-of-partition-objects-for-planning.patchDownload+507-346
v2-0003-Only-lock-partitions-that-will-be-scanned-by-a-qu.patchtext/plain; charset=UTF-8; name=v2-0003-Only-lock-partitions-that-will-be-scanned-by-a-qu.patchDownload+13-13
Hi, Amit
Great!
With the total number of records being 6400, I benchmarked while increasing the number of partitions from 100 to 6400.
Applying three all patches, 20% performance improved for 100 partitions.
I have the same number of records for each partition, do you do the same?
Also, in my case, performance was better when not prepare.
I think these patches do not improve execute case, so we need faster runtime pruning patch[1]/messages/by-id/CAKJS1f_QN-nmf6jCQ4gRU_8ab0zrd0ms-U=_Dj0KUARJiuGpOA@mail.gmail.com, right?
Details of measurement conditions and results are as follows.
- base source
master(@777e6ddf17) + Speeding up Insert v8 patch[1]/messages/by-id/CAKJS1f_QN-nmf6jCQ4gRU_8ab0zrd0ms-U=_Dj0KUARJiuGpOA@mail.gmail.com
- table definition(e.g. 100 partition)
create table test.accounts(aid serial, abalance int) partition by range(aid);
create table test.accounts_history(id serial, aid int, delta int, mtime timestamp without time zone) partition by range(aid);
create table test.account_part_1 partition of test.accounts for values from (1) to (65);
create table test.account_part_2 partition of test.accounts for values from (65) to (129);
...
create table test.account_part_100 partition of test.accounts for values from (6337) to (6400);
create table test.ah_part_1 partition of test.accounts_history for values from (1) to (65);
create table test.ah_part_2 partition of test.accounts_history for values from (65) to (129);
...
create table test.ah_part_100 partition of test.accounts_history for values from (6337) to (6400);
- benchmark script
\set aid random(1, 6400)
\set delta random(-5000, 5000)
BEGIN;
UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM test.accounts WHERE aid = :aid;
INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
END;
- command option
pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare
pgbench -d testdb -f benchmark.pgbench -T 180 -r -n
-results
base source no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 662.414805 | 0.357 | 0.265 | 0.421
200 | 494.478431 | 0.439 | 0.349 | 0.579
400 | 307.982089 | 0.651 | 0.558 | 0.927
800 | 191.360676 | 0.979 | 0.876 | 1.548
1600 | 75.344947 | 2.253 | 2.003 | 4.301
3200 | 30.643902 | 5.716 | 4.955 | 10.118
6400 | 16.726056 | 12.512 | 8.582 | 18.054
0001 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 429.816329 | 0.811 | 0.75 | 0.365
200 | 275.211531 | 1.333 | 1.248 | 0.501
400 | 160.499833 | 2.384 | 2.252 | 0.754
800 | 79.387776 | 4.935 | 4.698 | 1.468
1600 | 24.787377 | 16.593 | 15.954 | 4.302
3200 | 9.846421 | 42.96 | 42.139 | 8.848
6400 | 4.919772 | 87.43 | 83.109 | 16.56
0001 prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 245.100776 | 2.728 | 0.374 | 0.476
200 | 140.249283 | 5.116 | 0.603 | 0.686
400 | 67.608559 | 11.383 | 1.055 | 1.179
800 | 23.085806 | 35.781 | 2.585 | 2.677
1600 | 6.211247 | 141.093 | 7.096 | 6.785
3200 | 1.808214 | 508.045 | 15.741 | 13.243
6400 | 0.495635 | 1919.362 | 37.691 | 28.177
0001 + 0002 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 682.53091 | 0.388 | 0.35 | 0.35
200 | 469.906601 | 0.543 | 0.496 | 0.51
400 | 321.915349 | 0.78 | 0.721 | 0.752
800 | 201.620975 | 1.246 | 1.156 | 1.236
1600 | 94.438204 | 2.612 | 2.335 | 2.745
3200 | 38.292922 | 6.657 | 5.579 | 6.808
6400 | 21.48462 | 11.989 | 10.104 | 12.601
0001 + 0002 prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 591.10863 | 0.433 | 0.342 | 0.422
200 | 393.223638 | 0.625 | 0.522 | 0.614
400 | 253.672736 | 0.946 | 0.828 | 0.928
800 | 146.840262 | 1.615 | 1.448 | 1.604
1600 | 52.805593 | 4.656 | 3.811 | 4.473
3200 | 21.461606 | 11.48 | 9.56 | 10.661
6400 | 11.888232 | 22.869 | 16.841 | 18.871
0001 + 0002 + 0003 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 798.962029 | 0.304 | 0.267 | 0.339
200 | 577.893396 | 0.384 | 0.346 | 0.487
400 | 426.542177 | 0.472 | 0.435 | 0.717
800 | 288.616213 | 0.63 | 0.591 | 1.162
1600 | 154.247034 | 1.056 | 0.987 | 2.384
3200 | 59.711446 | 2.416 | 2.233 | 6.514
6400 | 37.109761 | 3.387 | 3.099 | 11.762
0001 + 0002 + 0003 prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 662.414805 | 0.357 | 0.265 | 0.421
200 | 494.478431 | 0.439 | 0.349 | 0.579
400 | 307.982089 | 0.651 | 0.558 | 0.927
800 | 191.360676 | 0.979 | 0.876 | 1.548
1600 | 75.344947 | 2.253 | 2.003 | 4.301
3200 | 30.643902 | 5.716 | 4.955 | 10.118
6400 | 16.726056 | 12.512 | 8.582 | 18.054
Although it may not be related to this, when measured with pg11 beta2, somehow the performance was better.
11beta2 + v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch[3]/messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+-------------+----------------+----------------+----------------
100 | 756.07228 | 0.942 | 0.091 | 0.123
[1]: /messages/by-id/CAKJS1f_QN-nmf6jCQ4gRU_8ab0zrd0ms-U=_Dj0KUARJiuGpOA@mail.gmail.com
[2]: /messages/by-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g@mail.gmail.com
[3]: /messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com
regards,
Sho Kato
-----Original Message-----
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
Sent: Wednesday, August 29, 2018 9:06 PM
To: Pg Hackers <pgsql-hackers@postgresql.org>
Subject: speeding up planning with partitions
It is more or less well known that the planner doesn't perform well with more than a few hundred partitions even when only a handful of partitions are ultimately included in the plan. Situation has improved a bit in PG
11 where we replaced the older method of pruning partitions one-by-one using constraint exclusion with a much faster method that finds relevant partitions by using partitioning metadata. However, we could only use it for SELECT queries, because UPDATE/DELETE are handled by a completely different code path, whose structure doesn't allow it to call the new pruning module's functionality. Actually, not being able to use the new pruning is not the only problem for UPDATE/DELETE, more on which further below.
While situation improved with new pruning where it could, there are still overheads in the way planner handles partitions. As things stand today, it will spend cycles and allocate memory for partitions even before pruning is performed, meaning most of that effort could be for partitions that were better left untouched. Currently, planner will lock, heap_open
*all* partitions, create range table entries and AppendRelInfos for them, and finally initialize RelOptInfos for them, even touching the disk file of each partition in the process, in an earlier phase of planning. All of that processing is vain for partitions that are pruned, because they won't be included in the final plan. This problem grows worse as the number of partitions grows beyond thousands, because range table grows too big.
That could be fixed by delaying all that per-partition activity to a point where pruning has already been performed, so that we know the partitions to open and create planning data structures for, such as somewhere downstream to query_planner. But before we can do that we must do something about the fact that UPDATE/DELETE won't be able to cope with that because the code path that currently handles the planning of UPDATE/DELETE on partitioned tables (inheritance_planner called from
subquery_planner) relies on AppendRelInfos for all partitions having been initialized by an earlier planning phase. Delaying it to query_planner would be too late, because inheritance_planner calls query_planner for each partition, not for the parent. That is, if query_planner, which is downstream to inheritance_planner, was in the charge of determining which partitions to open, the latter wouldn't know which partitions to call the former for. :)
That would be fixed if there is no longer this ordering dependency, which is what I propose to do with the attached patch 0001. I've tried to describe how the patch manages to do that in its commit message, but I'll summarize here. As things stand today, inheritance_planner modifies the query for each leaf partition to make the partition act as the query's result relation instead of the original partitioned table and calls grouping_planner on the query. That means anything that's joined to partitioned table looks to instead be joined to the partition and join paths are generated likewise. Also, the resulting path's targetlist is adjusted to be suitable for the result partition. Upon studying how this works, I concluded that the same result can be achieved if we call grouping_planner only once and repeat the portions of query_planner's and grouping_planner's processing that generate the join paths and appropriate target list, respectively, for each partition. That way, we can rely on query_planner determining result partitions for us, which in turn relies on the faster partprune.c based method of pruning. That speeds things up in two ways. Faster pruning and we no longer repeat common processing for each partition.
With 0001 in place, there is nothing that requires that partitions be opened by an earlier planning phase, so, I propose patch 0002, which refactors the opening and creation of planner data structures for partitions such that it is now performed after pruning. However, it doesn't do anything about the fact that partitions are all still locked in the earlier phase.
With various overheads gone thanks to 0001 and 0002, locking of all partitions via find_all_inheritos can be seen as the single largest bottleneck, which 0003 tries to address. I've kept it a separate patch, because I'll need to think a bit more to say that it's actually to safe to defer locking to late planning, due mainly to the concern about the change in the order of locking from the current method. I'm attaching it here, because I also want to show the performance improvement we can expect with it.
I measured the gain in performance due to each patch on a modest virtual machine. Details of the measurement and results follow.
* Benchmark scripts
update.sql
update ht set a = 0 where b = 1;
select.sql
select * from ht where b = 1;
* Table:
create table ht (a int, b int) partition by hash (b) create table ht_1 partition of ht for values with (modulus N, remainder 0) ..
create table ht_N partition of ht for values with (modulus N, remainder N-1)
* Rounded tps with update.sql and select.sql against regular table (nparts = 0) and partitioned table with various partition counts:
pgbench -n -T 60 -f update.sql
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2856 2893 2862 2816
8 507 1115 1447 1872
16 260 765 1173 1892
32 119 483 922 1884
64 59 282 615 1881
128 29 153 378 1835
256 14 79 210 1803
512 5 40 113 1728
1024 2 17 57 1616
2048 0* 9 30 1471
4096 0+ 4 15 1236
8192 0= 2 7 975
* 0.46
+ 0.0064
= 0 (OOM on a virtual machine with 4GB RAM)
As can be seen here, 0001 is a big help for update queries.
pgbench -n -T 60 -f select.sql
For a select query that doesn't contain join and needs to scan only one
partition:
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2290 2329 2319 2268
8 1058 1077 1414 1788
16 711 729 1124 1789
32 450 475 879 1773
64 265 272 603 1765
128 146 149 371 1685
256 76 77 214 1678
512 39 39 112 1636
1024 16 17 59 1525
2048 8 9 29 1416
4096 4 4 15 1195
8192 2 2 7 932
Actually, here we get almost same numbers with 0001 as with master, because 0001 changes nothing for SELECT queries. We start seeing improvement with 0002, the patch to delay opening partitions.
Thanks,
Amit
Thank you Kato-san for testing.
On 2018/08/31 19:48, Kato, Sho wrote:
Hi, Amit
Great!
With the total number of records being 6400, I benchmarked while increasing the number of partitions from 100 to 6400.
Applying three all patches, 20% performance improved for 100 partitions.I have the same number of records for each partition, do you do the same?
I didn't load any data into tables when running the tests, because these
patches are meant for reducing planner latency. More specifically,
they're addressed to fix the current planner behavior that its latency
increases with increasing number of partitions, with focus on the common
case where only a single partition will need to be scanned by a given query.
I'd try to avoid using a benchmark whose results is affected by anything
other than the planning latency. It will be a good idea if you leave the
tables empty and just vary the number of partitions and nothing else.
Also, we're interested in planning latency, so using just SELECT and
UPDATE in your benchmark script will be enough, because their planning
time is affected by the number of partitions. No need to try to measure
the INSERT latency, because its planning latency is not affected by the
number of partitions. Moreover, I'd rather suggest you take out the
INSERT statement from the benchmark for now, because its execution time
does vary unfavorably with increasing number of partitions. Sure, there
are other patches to address that, but it's better not to mix the patches
to avoid confusion.
Also, in my case, performance was better when not prepare.
Patches in this thread do nothing for the execution, so, there is no need
to compare prepared vs non-prepared. In fact, just measure non-prepared
tps and latency, because we're only interested in planning time here.
I think these patches do not improve execute case, so we need faster runtime pruning patch[1], right?
We already have run-time pruning code (that is the code in the patch you
linked) committed into the tree in PG 11, so the master tree also has it.
But since we're not interested in execution time, no need to worry about
run-time pruning.
Details of measurement conditions and results are as follows.
- base source
master(@777e6ddf17) + Speeding up Insert v8 patch[1]- table definition(e.g. 100 partition)
create table test.accounts(aid serial, abalance int)
partition by range(aid);
create table test.accounts_history(id serial, aid int, delta int,
mtime timestamp without time zone)
partition by range(aid);- command option
pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare
pgbench -d testdb -f benchmark.pgbench -T 180 -r -n-results
base source no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 662.414805 | 0.357 | 0.265 | 0.421
200 | 494.478431 | 0.439 | 0.349 | 0.579
400 | 307.982089 | 0.651 | 0.558 | 0.927
800 | 191.360676 | 0.979 | 0.876 | 1.548
1600 | 75.344947 | 2.253 | 2.003 | 4.301
3200 | 30.643902 | 5.716 | 4.955 | 10.118
6400 | 16.726056 | 12.512 | 8.582 | 18.0540001 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 429.816329 | 0.811 | 0.75 | 0.365
200 | 275.211531 | 1.333 | 1.248 | 0.501
400 | 160.499833 | 2.384 | 2.252 | 0.754
800 | 79.387776 | 4.935 | 4.698 | 1.468
1600 | 24.787377 | 16.593 | 15.954 | 4.302
3200 | 9.846421 | 42.96 | 42.139 | 8.848
6400 | 4.919772 | 87.43 | 83.109 | 16.56
Hmm, since 0001 is meant to improve update planning time, it seems odd
that you'd get poorer results compared to base source. But, it seems base
source is actually master + the patch to improve the execution time of
update, so maybe that patch is playing a part here, although I'm not sure
why even that's making this much of a difference.
I suggest that you use un-patched master as base source, that is, leave
out any patches to improve execution time.
[ ... ]
0001 + 0002 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 682.53091 | 0.388 | 0.35 | 0.35
200 | 469.906601 | 0.543 | 0.496 | 0.51
400 | 321.915349 | 0.78 | 0.721 | 0.752
800 | 201.620975 | 1.246 | 1.156 | 1.236
1600 | 94.438204 | 2.612 | 2.335 | 2.745
3200 | 38.292922 | 6.657 | 5.579 | 6.808
6400 | 21.48462 | 11.989 | 10.104 | 12.601
[ ... ]
0001 + 0002 + 0003 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 798.962029 | 0.304 | 0.267 | 0.339
200 | 577.893396 | 0.384 | 0.346 | 0.487
400 | 426.542177 | 0.472 | 0.435 | 0.717
800 | 288.616213 | 0.63 | 0.591 | 1.162
1600 | 154.247034 | 1.056 | 0.987 | 2.384
3200 | 59.711446 | 2.416 | 2.233 | 6.514
6400 | 37.109761 | 3.387 | 3.099 | 11.762
[ ... ]
By the way, as you may have noticed, I posted a version 2 of the patches
on this thread. If you apply them, you will be be able to see almost same
TPS for any number of partitions with master + 0001 + 0002 + 0003.
Although it may not be related to this, when measured with pg11 beta2, somehow the performance was better.
11beta2 + v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch[3] prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+-------------+----------------+----------------+----------------
100 | 756.07228 | 0.942 | 0.091 | 0.123
I guess if you had applied the latest version of
"Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch" (which is v8
posted at [1]/messages/by-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g@mail.gmail.com) on top of master + 0001 + 0002 + 0003, you'd get better
performance too. But, as mentioned above, we're interested in measuring
planning latency, not execution latency, so we should leave out any
patches that are meant toward improving execution latency to avoid confusion.
Thanks again.
Regards,
Amit
[1]: /messages/by-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g@mail.gmail.com
/messages/by-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g@mail.gmail.com
On Wed, Aug 29, 2018 at 5:36 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
It is more or less well known that the planner doesn't perform well with
more than a few hundred partitions even when only a handful of partitions
are ultimately included in the plan. Situation has improved a bit in PG
11 where we replaced the older method of pruning partitions one-by-one
using constraint exclusion with a much faster method that finds relevant
partitions by using partitioning metadata. However, we could only use it
for SELECT queries, because UPDATE/DELETE are handled by a completely
different code path, whose structure doesn't allow it to call the new
pruning module's functionality. Actually, not being able to use the new
pruning is not the only problem for UPDATE/DELETE, more on which further
below.pgbench -n -T 60 -f update.sql
nparts master 0001 0002 0003
====== ====== ==== ==== ====
0 2856 2893 2862 2816
8 507 1115 1447 1872
16 260 765 1173 1892
32 119 483 922 1884
64 59 282 615 1881
128 29 153 378 1835
256 14 79 210 1803
512 5 40 113 1728
1024 2 17 57 1616
2048 0* 9 30 1471
4096 0+ 4 15 1236
8192 0= 2 7 975* 0.46
+ 0.0064
= 0 (OOM on a virtual machine with 4GB RAM)
The idea looks interesting while going through the patch I observed
this comment.
/*
* inheritance_planner
* Generate Paths in the case where the result relation is an
* inheritance set.
*
* We have to handle this case differently from cases where a source relation
* is an inheritance set. Source inheritance is expanded at the bottom of the
* plan tree (see allpaths.c), but target inheritance has to be expanded at
* the top.
I think with your patch these comments needs to be change?
if (parse->resultRelation &&
- rt_fetch(parse->resultRelation, parse->rtable)->inh)
+ rt_fetch(parse->resultRelation, parse->rtable)->inh &&
+ rt_fetch(parse->resultRelation, parse->rtable)->relkind !=
+ RELKIND_PARTITIONED_TABLE)
inheritance_planner(root);
else
grouping_planner(root, false, tuple_fraction);
I think we can add some comments to explain if the target rel itself
is partitioned rel then why
we can directly go to the grouping planner.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi, Amit. Thank you for your reply.
I didn't load any data into tables when running the tests, because these patches are meant for reducing planner latency. More specifically, they're addressed to fix the current planner behavior that its latency increases with increasing number of partitions, with focus on the common case where only a single partition will need to be scanned by a given query.
Thank you for telling me details. It is very helpful.
No need to try to measure the INSERT latency, because its planning latency is not affected by the number of partitions. Moreover, I'd rather suggest you take out the INSERT statement from the benchmark for now, because its execution time does vary unfavorably with increasing number of partitions.
Thank you for your advice.
In fact, just measure non-prepared tps and latency, because we're only interested in planning time here.
I got it.
Hmm, since 0001 is meant to improve update planning time, it seems odd that you'd get poorer results compared to base source. But, it seems base source is actually master + the patch to improve the execution time of update, so maybe that patch is playing a part here, although I'm not sure why even that's making this much of a difference.
I suggest that you use un-patched master as base source, that is, leave out any patches to improve execution time.
By the way, as you may have noticed, I posted a version 2 of the patches on this thread. If you apply them, you will be be able to see almost same TPS for any number of partitions with master + 0001 + 0002 + 0003.
I guess if you had applied the latest version of "Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch" (which is v8 posted at [1]) on top of master + 0001 + 0002 + 0003, you'd get better performance too.
Thank you. I will try out these cases.
Thanks,
--
Sho Kato
-----Original Message-----
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
Sent: Monday, September 3, 2018 2:12 PM
To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; Pg Hackers <pgsql-hackers@postgresql.org>
Subject: Re: speeding up planning with partitions
Thank you Kato-san for testing.
On 2018/08/31 19:48, Kato, Sho wrote:
Hi, Amit
Great!
With the total number of records being 6400, I benchmarked while increasing the number of partitions from 100 to 6400.
Applying three all patches, 20% performance improved for 100 partitions.I have the same number of records for each partition, do you do the same?
I didn't load any data into tables when running the tests, because these patches are meant for reducing planner latency. More specifically, they're addressed to fix the current planner behavior that its latency increases with increasing number of partitions, with focus on the common case where only a single partition will need to be scanned by a given query.
I'd try to avoid using a benchmark whose results is affected by anything other than the planning latency. It will be a good idea if you leave the tables empty and just vary the number of partitions and nothing else.
Also, we're interested in planning latency, so using just SELECT and UPDATE in your benchmark script will be enough, because their planning time is affected by the number of partitions. No need to try to measure the INSERT latency, because its planning latency is not affected by the number of partitions. Moreover, I'd rather suggest you take out the INSERT statement from the benchmark for now, because its execution time does vary unfavorably with increasing number of partitions. Sure, there are other patches to address that, but it's better not to mix the patches to avoid confusion.
Also, in my case, performance was better when not prepare.
Patches in this thread do nothing for the execution, so, there is no need to compare prepared vs non-prepared. In fact, just measure non-prepared tps and latency, because we're only interested in planning time here.
I think these patches do not improve execute case, so we need faster runtime pruning patch[1], right?
We already have run-time pruning code (that is the code in the patch you
linked) committed into the tree in PG 11, so the master tree also has it.
But since we're not interested in execution time, no need to worry about run-time pruning.
Details of measurement conditions and results are as follows.
- base source
master(@777e6ddf17) + Speeding up Insert v8 patch[1]- table definition(e.g. 100 partition)
create table test.accounts(aid serial, abalance int)
partition by range(aid);
create table test.accounts_history(id serial, aid int, delta int,
mtime timestamp without time zone)
partition by range(aid);- command option
pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare
pgbench -d testdb -f benchmark.pgbench -T 180 -r -n-results
base source no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 662.414805 | 0.357 | 0.265 | 0.421
200 | 494.478431 | 0.439 | 0.349 | 0.579
400 | 307.982089 | 0.651 | 0.558 | 0.927
800 | 191.360676 | 0.979 | 0.876 | 1.548
1600 | 75.344947 | 2.253 | 2.003 | 4.301
3200 | 30.643902 | 5.716 | 4.955 | 10.118
6400 | 16.726056 | 12.512 | 8.582 | 18.0540001 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 429.816329 | 0.811 | 0.75 | 0.365
200 | 275.211531 | 1.333 | 1.248 | 0.501
400 | 160.499833 | 2.384 | 2.252 | 0.754
800 | 79.387776 | 4.935 | 4.698 | 1.468
1600 | 24.787377 | 16.593 | 15.954 | 4.302
3200 | 9.846421 | 42.96 | 42.139 | 8.848
6400 | 4.919772 | 87.43 | 83.109 | 16.56
Hmm, since 0001 is meant to improve update planning time, it seems odd that you'd get poorer results compared to base source. But, it seems base source is actually master + the patch to improve the execution time of update, so maybe that patch is playing a part here, although I'm not sure why even that's making this much of a difference.
I suggest that you use un-patched master as base source, that is, leave out any patches to improve execution time.
[ ... ]
0001 + 0002 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 682.53091 | 0.388 | 0.35 | 0.35
200 | 469.906601 | 0.543 | 0.496 | 0.51
400 | 321.915349 | 0.78 | 0.721 | 0.752
800 | 201.620975 | 1.246 | 1.156 | 1.236
1600 | 94.438204 | 2.612 | 2.335 | 2.745
3200 | 38.292922 | 6.657 | 5.579 | 6.808
6400 | 21.48462 | 11.989 | 10.104 | 12.601
[ ... ]
0001 + 0002 + 0003 no prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+------------+----------------+----------------+----------------
100 | 798.962029 | 0.304 | 0.267 | 0.339
200 | 577.893396 | 0.384 | 0.346 | 0.487
400 | 426.542177 | 0.472 | 0.435 | 0.717
800 | 288.616213 | 0.63 | 0.591 | 1.162
1600 | 154.247034 | 1.056 | 0.987 | 2.384
3200 | 59.711446 | 2.416 | 2.233 | 6.514
6400 | 37.109761 | 3.387 | 3.099 | 11.762
[ ... ]
By the way, as you may have noticed, I posted a version 2 of the patches on this thread. If you apply them, you will be be able to see almost same TPS for any number of partitions with master + 0001 + 0002 + 0003.
Although it may not be related to this, when measured with pg11 beta2, somehow the performance was better.
11beta2 + v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch[3] prepared
part_num | tps_ex | update_latency | select_latency | insert_latency
----------+-------------+----------------+----------------+----------------
100 | 756.07228 | 0.942 | 0.091 | 0.123
I guess if you had applied the latest version of "Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch" (which is v8 posted at [1]/messages/by-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g@mail.gmail.com) on top of master + 0001 + 0002 + 0003, you'd get better performance too. But, as mentioned above, we're interested in measuring planning latency, not execution latency, so we should leave out any patches that are meant toward improving execution latency to avoid confusion.
Thanks again.
Regards,
Amit
[1]: /messages/by-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g@mail.gmail.com
/messages/by-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g@mail.gmail.com
Hi Dilip,
Thanks for taking a look.
On 2018/09/03 20:57, Dilip Kumar wrote:
The idea looks interesting while going through the patch I observed
this comment./*
* inheritance_planner
* Generate Paths in the case where the result relation is an
* inheritance set.
*
* We have to handle this case differently from cases where a source relation
* is an inheritance set. Source inheritance is expanded at the bottom of the
* plan tree (see allpaths.c), but target inheritance has to be expanded at
* the top.I think with your patch these comments needs to be change?
Yes, maybe a good idea to mention that partitioned table result relations
are not handled here.
Actually, I've been wondering if this patch (0001) shouldn't get rid of
inheritance_planner altogether and implement the new approach for *all*
inheritance sets, not just partitioned tables, but haven't spent much time
on that idea yet.
if (parse->resultRelation && - rt_fetch(parse->resultRelation, parse->rtable)->inh) + rt_fetch(parse->resultRelation, parse->rtable)->inh && + rt_fetch(parse->resultRelation, parse->rtable)->relkind != + RELKIND_PARTITIONED_TABLE) inheritance_planner(root); else grouping_planner(root, false, tuple_fraction);I think we can add some comments to explain if the target rel itself
is partitioned rel then why
we can directly go to the grouping planner.
Okay, I will try to add more explanatory comments here and there in the
next version I will post later this week.
Thanks,
Amit
On 30 August 2018 at 00:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
With various overheads gone thanks to 0001 and 0002, locking of all
partitions via find_all_inheritos can be seen as the single largest
bottleneck, which 0003 tries to address. I've kept it a separate patch,
because I'll need to think a bit more to say that it's actually to safe to
defer locking to late planning, due mainly to the concern about the change
in the order of locking from the current method. I'm attaching it here,
because I also want to show the performance improvement we can expect with it.
For now, find_all_inheritors() locks the tables in ascending Oid
order. This makes sense with inheritance parent tables as it's much
cheaper to sort on this rather than on something like the table's
namespace and name. I see no reason why what we sort on really
matters, as long as we always sort on the same thing and the key we
sort on is always unique so that the locking order is well defined.
For partitioned tables, there's really not much sense in sticking to
the same lock by Oid order. The order of the PartitionDesc is already
well defined so I don't see any reason why we can't just perform the
locking in PartitionDesc order. This would mean you could perform the
locking of the partitions once pruning is complete somewhere around
add_rel_partitions_to_query(). Also, doing this would remove the need
for scanning pg_inherits during find_all_inheritors() and would likely
further speed up the planning of queries to partitioned tables with
many partitions. I wrote a function named
get_partition_descendants_worker() to do this in patch 0002 in [1]/messages/by-id/CAKJS1f9QjUwQrio20Pi=yCHmnouf4z3SfN8sqXaAcwREG6k0zQ@mail.gmail.com (it
may have been a bad choice to have made this a separate function
rather than just part of find_all_inheritors() as it meant I had to
change a bit too much code in tablecmds.c). There might be something
you can salvage from that patch to help here.
[1]: /messages/by-id/CAKJS1f9QjUwQrio20Pi=yCHmnouf4z3SfN8sqXaAcwREG6k0zQ@mail.gmail.com
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 2018/09/04 22:24, David Rowley wrote:
On 30 August 2018 at 00:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
With various overheads gone thanks to 0001 and 0002, locking of all
partitions via find_all_inheritos can be seen as the single largest
bottleneck, which 0003 tries to address. I've kept it a separate patch,
because I'll need to think a bit more to say that it's actually to safe to
defer locking to late planning, due mainly to the concern about the change
in the order of locking from the current method. I'm attaching it here,
because I also want to show the performance improvement we can expect with it.For now, find_all_inheritors() locks the tables in ascending Oid
order. This makes sense with inheritance parent tables as it's much
cheaper to sort on this rather than on something like the table's
namespace and name. I see no reason why what we sort on really
matters, as long as we always sort on the same thing and the key we
sort on is always unique so that the locking order is well defined.For partitioned tables, there's really not much sense in sticking to
the same lock by Oid order. The order of the PartitionDesc is already
well defined so I don't see any reason why we can't just perform the
locking in PartitionDesc order.
The reason we do the locking with find_all_inheritors for regular queries
(planner (expand_inherited_rtentry) does it for select/update/delete and
executor (ExecSetupPartitionTupleRouting) for insert) is that that's the
order used by various DDL commands when locking partitions, such as when
adding a column. So, we'd have one piece of code locking partitions by
Oid order and others by canonical partition bound order or PartitionDesc
order. I'm no longer sure if that's problematic though, about which more
below.
This would mean you could perform the
locking of the partitions once pruning is complete somewhere around
add_rel_partitions_to_query(). Also, doing this would remove the need
for scanning pg_inherits during find_all_inheritors() and would likely
further speed up the planning of queries to partitioned tables with
many partitions.
That's what happens with 0003; note the following hunk in it:
@@ -1555,14 +1555,15 @@ expand_inherited_rtentry(PlannerInfo *root,
RangeTblEntry *rte, Index rti)
lockmode = AccessShareLock;
/* Scan for all members of inheritance set, acquire needed locks */
- inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+ if (rte->relkind != RELKIND_PARTITIONED_TABLE)
+ inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
I wrote a function named
get_partition_descendants_worker() to do this in patch 0002 in [1] (it
may have been a bad choice to have made this a separate function
rather than just part of find_all_inheritors() as it meant I had to
change a bit too much code in tablecmds.c). There might be something
you can salvage from that patch to help here.[1] /messages/by-id/CAKJS1f9QjUwQrio20Pi=yCHmnouf4z3SfN8sqXaAcwREG6k0zQ@mail.gmail.com
Actually, I had written a similar patch to replace the usages of
find_all_inheritors and find_inheritance_children by different
partitioning-specific functions which would collect the the partition OIDs
from the already open parent table's PartitionDesc, more or less like the
patch you mention does. But I think we don't need any new function(s) to
do that, that is, we don't need to change all the sites that call
find_all_inheritors or find_inheritance_children in favor of new functions
that return partition OIDs in PartitionDesc order, if only *because* we
want to change planner to lock partitions in the PartitionDesc order. I'm
failing to see why the difference in locking order matters. I understood
the concern as that locking partitions in different order could lead to a
deadlock if concurrent backends request mutually conflicting locks, but
only one of the conflicting backends, the one that got lock on the parent,
would be allowed to lock children. Thoughts on that?
Thanks,
Amit
On 30 August 2018 at 21:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated patches, with 0002 containing the changes mentioned above.
Here's my first pass review on this:
0001:
1. I think the following paragraph should probably mention something
about some performance difference between the two methods:
<para>
Constraint exclusion works in a very similar way to partition
pruning, except that it uses each table's <literal>CHECK</literal>
constraints — which gives it its name — whereas partition
pruning uses the table's partition bounds, which exist only in the
case of declarative partitioning. Another difference is that
constraint exclusion is only applied at plan time; there is no attempt
to remove partitions at execution time.
</para>
Perhaps tagging on. "Constraint exclusion is also a much less
efficient way of eliminating unneeded partitions as metadata for each
partition must be loaded in the planner before constraint exclusion
can be applied. This is not a requirement for partition pruning."
2. I think "rootrel" should be named "targetpart" in:
+ RelOptInfo *rootrel = root->simple_rel_array[root->parse->resultRelation];
3. Why does the following need to list_copy()?
+ List *saved_join_info_list = list_copy(root->join_info_list);
4. Is the "root->parse->commandType != CMD_INSERT" required in:
@@ -181,13 +185,30 @@ make_one_rel(PlannerInfo *root, List *joinlist)
/*
* Generate access paths for the entire join tree.
+ *
+ * If we're doing this for an UPDATE or DELETE query whose target is a
+ * partitioned table, we must do the join planning against each of its
+ * leaf partitions instead.
*/
- rel = make_rel_from_joinlist(root, joinlist);
+ if (root->parse->resultRelation &&
+ root->parse->commandType != CMD_INSERT &&
+ root->simple_rel_array[root->parse->resultRelation] &&
+ root->simple_rel_array[root->parse->resultRelation]->part_scheme)
+ {
Won't the simple_rel_array entry for the resultRelation always be NULL
for an INSERT?
5. In regards to:
+ /*
+ * Hack to make the join planning code believe that 'partrel' can
+ * be joined against.
+ */
+ partrel->reloptkind = RELOPT_BASEREL;
Have you thought about other implications of join planning for other
member rels, for example, equivalence classes and em_is_child?
6. It would be good to not have to rt_fetch the same RangeTblEntry twice here:
@@ -959,7 +969,9 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
* needs special processing, else go straight to grouping_planner.
*/
if (parse->resultRelation &&
- rt_fetch(parse->resultRelation, parse->rtable)->inh)
+ rt_fetch(parse->resultRelation, parse->rtable)->inh &&
+ rt_fetch(parse->resultRelation, parse->rtable)->relkind !=
+ RELKIND_PARTITIONED_TABLE)
inheritance_planner(root);
7. Why don't you just pass the Parse into the function as a parameter
instead of overwriting PlannerInfo's copy in:
+ root->parse = partition_parse;
+ partitionwise_adjust_scanjoin_target(root, child_rel,
+ subroots,
+ partitioned_rels,
+ resultRelations,
+ subpaths,
+ WCOLists,
+ returningLists,
+ rowMarks);
+ /* Restore the Query for processing the next partition. */
+ root->parse = parse;
8. Can you update the following comment to mention why you're not
calling add_paths_to_append_rel for this case?
@@ -6964,7 +7164,9 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
}
/* Build new paths for this relation by appending child paths. */
- if (live_children != NIL)
+ if (live_children != NIL &&
+ !(rel->reloptkind == RELOPT_BASEREL &&
+ rel->relid == root->parse->resultRelation))
add_paths_to_append_rel(root, rel, live_children);
9. The following should use >= 0, not > 0
+ while ((relid = bms_next_member(child_rel->relids, relid)) > 0)
0002:
10. I think moving the PlannerInfo->total_table_pages calculation
needs to be done in its own patch. This is a behavioural change where
we no longer count pruned relations in the calculation which can
result in plan changes. I posted a patch in [1]https://commitfest.postgresql.org/19/1768/ to fix this as a bug
fix as I think the current code is incorrect. My patch also updates
the first paragraph of the comment. You've not done that.
11. "pruning"
+ * And do prunin. Note that this adds AppendRelInfo's of only the
12. It's much more efficient just to do bms_add_range() outside the loop in:
+ for (i = 0; i < rel->nparts; i++)
+ {
+ rel->part_rels[i] = build_partition_rel(root, rel,
+ rel->part_oids[i]);
+ rel->live_parts = bms_add_member(rel->live_parts, i);
+ }
13. In set_append_rel_size() the foreach(l, root->append_rel_list)
loop could be made to loop over RelOptInfo->live_parts instead which
would save having to skip over AppendRelInfos that don't belong to
this parent. You'd need to make live_parts more general purpose and
also use it to mark children of inheritance parents.
14. I think you can skip the following if both are NULL. You could
likely add more smarts for different join types, but I think you
should at least skip when both are NULL. Perhaps the loop could be
driven off of bms_intersec of the two Rel's live_parts?
+ if (child_rel1 == NULL)
+ child_rel1 = build_dummy_partition_rel(root, rel1, cnt_parts);
+ if (child_rel2 == NULL)
+ child_rel2 = build_dummy_partition_rel(root, rel2, cnt_parts);
15. The following is not required when append_rel_array was previously NULL.
+ MemSet(root->append_rel_array + root->simple_rel_array_size,
+ 0, sizeof(AppendRelInfo *) * num_added_parts);
16. I don't think scan_all_parts is worth the extra code. The cost of
doing bms_num_members is going to be pretty small in comparison to
building paths for and maybe doing a join search for all partitions.
+ num_added_parts = scan_all_parts ? rel->nparts :
+ bms_num_members(partindexes);
In any case, you've got a bug in prune_append_rel_partitions() where
you're setting scan_all_parts to true instead of returning when
contradictory == true.
17. lockmode be of type LOCKMODE, not int.
+ Oid childOID, int lockmode,
18. Comment contradicts the code.
+ /* Open rel if needed; we already have required locks */
+ if (childOID != parentOID)
+ {
+ childrel = heap_open(childOID, lockmode);
I think you should be using NoLock here.
19. Comment contradicts the code.
+ /* Close child relations, but keep locks */
+ if (childOID != parentOID)
+ {
+ Assert(childrel != NULL);
+ heap_close(childrel, lockmode);
+ }
20. I assume translated_vars can now be NIL due to
build_dummy_partition_rel() not setting it.
- if (var->varlevelsup == 0 && appinfo)
+ if (var->varlevelsup == 0 && appinfo && appinfo->translated_vars)
It might be worth writing a comment to explain that, otherwise it's
not quite clear why you're doing this.
21. Unrelated change;
Assert(relid > 0 && relid < root->simple_rel_array_size);
+
22. The following comment mentions that Oids are copied, but that does
not happen in the code.
+ /*
+ * For partitioned tables, we just allocate space for RelOptInfo's.
+ * pointers for all partitions and copy the partition OIDs from the
+ * relcache. Actual RelOptInfo is built for a partition only if it is
+ * not pruned.
+ */
The Oid copying already happened during get_relation_info().
23. Traditionally translated_vars populated with a sequence of Vars in
the same order to mean no translation. Here you're changing how that
works:
+ /* leaving translated_vars to NIL to mean no translation needed */
This seems to be about the extent of your documentation on this, which
is not good enough.
24. "each" -> "reach"? ... Actually, I don't understand the comment.
In a partitioned hierarchy, how can the one before the top-level
partitioned table not be a partitioned table?
/*
* Keep moving up until we each the parent rel that's not a
* partitioned table. The one before that one would be the root
* parent.
*/
25. "already"
+ * expand_inherited_rtentry alreay locked all partitions, so pass
26. Your changes to make_partitionedrel_pruneinfo() look a bit broken.
You're wrongly assuming live_parts is the same as present_parts. If a
CHECK constraint eliminated the partition then those will be present
in live_parts but won't be part of the Append/MergeAppend subplans.
You might be able to maintain some of this optimisation by checking
for dummy rels inside the loop, but you're going to need to put back
the code that sets present_parts.
+ present_parts = bms_copy(subpart->live_parts);
27. Comment contradicts the code:
+ Bitmapset *live_parts; /* unpruned parts; NULL if all are live */
in add_rel_partitions_to_query() you're doing:
+ if (scan_all_parts)
+ {
+ for (i = 0; i < rel->nparts; i++)
+ {
+ rel->part_rels[i] = build_partition_rel(root, rel,
+ rel->part_oids[i]);
+ rel->live_parts = bms_add_member(rel->live_parts, i);
+ }
+ }
so the NULL part seems untrue.
28. Missing comments:
+ TupleDesc tupdesc;
+ Oid reltype;
29. The comment for prune_append_rel_partitions claims it "Returns RT
indexes", but that's not the case, per:
-Relids
-prune_append_rel_partitions(RelOptInfo *rel)
+void
+prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel)
0003:
30. 2nd test can be tested inside the first test to remove redundant
partition check.
- inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+ if (rte->relkind != RELKIND_PARTITIONED_TABLE)
+ inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
/*
* Check that there's at least one descendant, else treat as no-child
* case. This could happen despite above has_subclass() check, if table
* once had a child but no longer does.
*/
- if (list_length(inhOIDs) < 2)
+ if (rte->relkind != RELKIND_PARTITIONED_TABLE && list_length(inhOIDs) < 2)
{
31. The following code is wrong:
+ /* Determine the correct lockmode to use. */
+ if (rootRTindex == root->parse->resultRelation)
+ lockmode = RowExclusiveLock;
+ else if (rootrc && RowMarkRequiresRowShareLock(rootrc->markType))
+ lockmode = RowShareLock;
+ else
+ lockmode = AccessShareLock;
rootRTIndex remains at 0 if there are no row marks and resultRelation
will be 0 for SELECT queries, this means you'll use a RowExclusiveLock
for a SELECT instead of an AccessShareLock.
Why not just check the lockmode of the parent and use that?
[1]: https://commitfest.postgresql.org/19/1768/
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 29, 2018 at 5:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
It is more or less well known that the planner doesn't perform well with
more than a few hundred partitions even when only a handful of partitions
are ultimately included in the plan. Situation has improved a bit in PG
11 where we replaced the older method of pruning partitions one-by-one
using constraint exclusion with a much faster method that finds relevant
partitions by using partitioning metadata. However, we could only use it
for SELECT queries, because UPDATE/DELETE are handled by a completely
different code path, whose structure doesn't allow it to call the new
pruning module's functionality. Actually, not being able to use the new
pruning is not the only problem for UPDATE/DELETE, more on which further
below.
This was a big problem for the SQL MERGE patch. I hope that this
problem can be fixed.
--
Peter Geoghegan
On 2018/09/11 10:11, Peter Geoghegan wrote:
On Wed, Aug 29, 2018 at 5:06 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:It is more or less well known that the planner doesn't perform well with
more than a few hundred partitions even when only a handful of partitions
are ultimately included in the plan. Situation has improved a bit in PG
11 where we replaced the older method of pruning partitions one-by-one
using constraint exclusion with a much faster method that finds relevant
partitions by using partitioning metadata. However, we could only use it
for SELECT queries, because UPDATE/DELETE are handled by a completely
different code path, whose structure doesn't allow it to call the new
pruning module's functionality. Actually, not being able to use the new
pruning is not the only problem for UPDATE/DELETE, more on which further
below.This was a big problem for the SQL MERGE patch. I hope that this
problem can be fixed.
Sorry if I've missed some prior discussion about this, but could you
clarify which aspect of UPDATE/DELETE planning got in the way of SQL MERGE
patch? It'd be great if you can point to an email or portion of the SQL
MERGE discussion thread where this aspect was discussed.
In the updated patch that I'm still hacking on (also need to look at
David's comments earlier today before posting it), I have managed to
eliminate the separation of code paths handling SELECT vs UPDATE/DELETE on
inheritance tables, but I can't be sure if the new approach (also) solves
any problems that were faced during SQL MERGE development.
Thanks,
Amit
On Tue, Sep 4, 2018 at 12:44 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi Dilip,
Thanks for taking a look.
On 2018/09/03 20:57, Dilip Kumar wrote:
The idea looks interesting while going through the patch I observed
this comment./*
* inheritance_planner
* Generate Paths in the case where the result relation is an
* inheritance set.
*
* We have to handle this case differently from cases where a source relation
* is an inheritance set. Source inheritance is expanded at the bottom of the
* plan tree (see allpaths.c), but target inheritance has to be expanded at
* the top.I think with your patch these comments needs to be change?
Yes, maybe a good idea to mention that partitioned table result relations
are not handled here.Actually, I've been wondering if this patch (0001) shouldn't get rid of
inheritance_planner altogether and implement the new approach for *all*
inheritance sets, not just partitioned tables, but haven't spent much time
on that idea yet.
That will be interesting.
if (parse->resultRelation && - rt_fetch(parse->resultRelation, parse->rtable)->inh) + rt_fetch(parse->resultRelation, parse->rtable)->inh && + rt_fetch(parse->resultRelation, parse->rtable)->relkind != + RELKIND_PARTITIONED_TABLE) inheritance_planner(root); else grouping_planner(root, false, tuple_fraction);I think we can add some comments to explain if the target rel itself
is partitioned rel then why
we can directly go to the grouping planner.Okay, I will try to add more explanatory comments here and there in the
next version I will post later this week.
Okay.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Thanks a lot for your detailed review.
On 2018/09/11 8:23, David Rowley wrote:
On 30 August 2018 at 21:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated patches, with 0002 containing the changes mentioned above.
Here's my first pass review on this:
I've gone through your comments on 0001 so far, but didn't go through
others yet, to which I'll reply separately.
0001:
1. I think the following paragraph should probably mention something
about some performance difference between the two methods:<para>
Constraint exclusion works in a very similar way to partition
pruning, except that it uses each table's <literal>CHECK</literal>
constraints — which gives it its name — whereas partition
pruning uses the table's partition bounds, which exist only in the
case of declarative partitioning. Another difference is that
constraint exclusion is only applied at plan time; there is no attempt
to remove partitions at execution time.
</para>Perhaps tagging on. "Constraint exclusion is also a much less
efficient way of eliminating unneeded partitions as metadata for each
partition must be loaded in the planner before constraint exclusion
can be applied. This is not a requirement for partition pruning."
Hmm, isn't that implied by the existing text? It already says that
constraint exclusion uses *each* table's/partition's CHECK constraints,
which should make it clear that for a whole lot of partitions, that will
be a slower than partition pruning which requires accessing only one
table's metadata.
If we will have dissociated constraint exclusion completely from
partitioned tables with these patches, I'm not sure if we have to stress
that it is inefficient for large number of tables.
2. I think "rootrel" should be named "targetpart" in:
+ RelOptInfo *rootrel = root->simple_rel_array[root->parse->resultRelation];
To me, "targetpart" makes a partitioned table sound like a partition,
which it is not. I get that using "root" can be ambiguous, because a
query can specify a non-root partitioned table.
How about "targetrel"?
3. Why does the following need to list_copy()?
+ List *saved_join_info_list = list_copy(root->join_info_list);
In earlier versions of this code, root->join_info_list would be modified
during make_one_rel_from_joinlist, but that no longer seems true.
Removed the list_copy.
4. Is the "root->parse->commandType != CMD_INSERT" required in:
@@ -181,13 +185,30 @@ make_one_rel(PlannerInfo *root, List *joinlist)
/* * Generate access paths for the entire join tree. + * + * If we're doing this for an UPDATE or DELETE query whose target is a + * partitioned table, we must do the join planning against each of its + * leaf partitions instead. */ - rel = make_rel_from_joinlist(root, joinlist); + if (root->parse->resultRelation && + root->parse->commandType != CMD_INSERT && + root->simple_rel_array[root->parse->resultRelation] && + root->simple_rel_array[root->parse->resultRelation]->part_scheme) + {Won't the simple_rel_array entry for the resultRelation always be NULL
for an INSERT?
Yep, you're right. Removed.
5. In regards to:
+ /* + * Hack to make the join planning code believe that 'partrel' can + * be joined against. + */ + partrel->reloptkind = RELOPT_BASEREL;Have you thought about other implications of join planning for other
member rels, for example, equivalence classes and em_is_child?
Haven't really, but that seems like an important point. I will study it
more closely.
6. It would be good to not have to rt_fetch the same RangeTblEntry twice
here:
@@ -959,7 +969,9 @@ subquery_planner(PlannerGlobal *glob, Query *parse, * needs special processing, else go straight to grouping_planner. */ if (parse->resultRelation && - rt_fetch(parse->resultRelation, parse->rtable)->inh) + rt_fetch(parse->resultRelation, parse->rtable)->inh && + rt_fetch(parse->resultRelation, parse->rtable)->relkind != + RELKIND_PARTITIONED_TABLE) inheritance_planner(root);
The new version doesn't call inheritance_planner at all; there is no
inheritance_planner in the new version.
7. Why don't you just pass the Parse into the function as a parameter
instead of overwriting PlannerInfo's copy in:+ root->parse = partition_parse; + partitionwise_adjust_scanjoin_target(root, child_rel, + subroots, + partitioned_rels, + resultRelations, + subpaths, + WCOLists, + returningLists, + rowMarks); + /* Restore the Query for processing the next partition. */ + root->parse = parse;
Okay, done.
8. Can you update the following comment to mention why you're not
calling add_paths_to_append_rel for this case?@@ -6964,7 +7164,9 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
}/* Build new paths for this relation by appending child paths. */ - if (live_children != NIL) + if (live_children != NIL && + !(rel->reloptkind == RELOPT_BASEREL && + rel->relid == root->parse->resultRelation)) add_paths_to_append_rel(root, rel, live_children);
Oops, stray code from a previous revision. Removed this hunk.
9. The following should use >= 0, not > 0
+ while ((relid = bms_next_member(child_rel->relids, relid)) > 0)
Yeah, fixed.
Sorry, I haven't yet worked on your comments on 0002 and 0003. For time
being, I'd like to report what I've been up to these past couple of days,
because starting tomorrow until the end of this month, I won't be able to
reply to emails on -hackers due to personal vacation.
So, I spent a better part of last few days on trying to revise the patches
so that it changes the planning code for *all* inheritance cases instead
of just focusing on partitioning. Because, really, the only difference
between the partitioning code and regular inheritance code should be that
partitioning is able to use faster pruning, and all the other parts should
look and work more or less the same. There shouldn't be code here that
deals with partitioning and code there to deal with inheritance.
Minimizing code this way should be a good end to aim for, imho.
Attached is what I have at the moment. As part of the effort mentioned
above, I made 0001 to remove inheritance_planner altogether, instead of
just taking out the partitioning case out of it. Then there is the WIP
patch 0004 which tries to move even regular inheritance expansion to late
planning stage, just like the earlier versions did for partitioning. It
will need quite a bit of polishing before we could consider it for merging
with 0002. Of course, I'll need to address your comments before
considering doing that any seriously.
Thanks,
Amit
Attachments:
v3-0001-Overhaul-inheritance-update-delete-planning.patchtext/plain; charset=UTF-8; name=v3-0001-Overhaul-inheritance-update-delete-planning.patchDownload+631-731
v3-0002-Lazy-creation-of-partition-objects-for-planning.patchtext/plain; charset=UTF-8; name=v3-0002-Lazy-creation-of-partition-objects-for-planning.patchDownload+507-339
v3-0003-Only-lock-partitions-that-will-be-scanned-by-a-qu.patchtext/plain; charset=UTF-8; name=v3-0003-Only-lock-partitions-that-will-be-scanned-by-a-qu.patchDownload+111-74
v3-0004-WIP-generate-all-inheritance-child-RelOptInfos-af.patchtext/plain; charset=UTF-8; name=v3-0004-WIP-generate-all-inheritance-child-RelOptInfos-af.patchDownload+620-556
On Fri, Sep 14, 2018 at 3:58 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
Thanks a lot for your detailed review.
I was going through your patch (v3-0002) and I have some suggestions
1.
- if (nparts > 0)
+ /*
+ * For partitioned tables, we just allocate space for RelOptInfo's.
+ * pointers for all partitions and copy the partition OIDs from the
+ * relcache. Actual RelOptInfo is built for a partition only if it is
+ * not pruned.
+ */
+ if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+ {
rel->part_rels = (RelOptInfo **)
- palloc(sizeof(RelOptInfo *) * nparts);
+ palloc0(sizeof(RelOptInfo *) * rel->nparts);
+ return rel;
+ }
I think we can delay allocating memory for rel->part_rels? And we can
allocate in add_rel_partitions_to_query only
for those partitions which survive pruning.
2.
add_rel_partitions_to_query
...
+ /* Expand the PlannerInfo arrays to hold new partition objects. */
+ num_added_parts = scan_all_parts ? rel->nparts :
+ bms_num_members(partindexes);
+ new_size = root->simple_rel_array_size + num_added_parts;
+ root->simple_rte_array = (RangeTblEntry **)
+ repalloc(root->simple_rte_array,
+ sizeof(RangeTblEntry *) * new_size);
+ root->simple_rel_array = (RelOptInfo **)
+ repalloc(root->simple_rel_array,
+ sizeof(RelOptInfo *) * new_size);
+ if (root->append_rel_array)
+ root->append_rel_array = (AppendRelInfo **)
+ repalloc(root->append_rel_array,
+ sizeof(AppendRelInfo *) * new_size);
+ else
+ root->append_rel_array = (AppendRelInfo **)
+ palloc0(sizeof(AppendRelInfo *) *
+ new_size);
Instead of re-pallocing for every partitioned relation can't we first
count the total number of surviving partitions and
repalloc at once.
3.
+ /*
+ * And do prunin. Note that this adds AppendRelInfo's of only the
+ * partitions that are not pruned.
+ */
prunin/pruning
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com