How to make partitioning scale better for larger numbers of partitions

Started by Kato, Shoover 7 years ago21 messages
#1Kato, Sho
kato-sho@jp.fujitsu.com

Hi,

I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables.
But, statement latencies on a partitioned table is much slower than on a non-partitioned table.

UPDATE latency is 210 times slower than a non-partitioned table.
SELECT latency is 36 times slower than a non-partitioned table.
Surprisingly INSERT latency is almost same.

Of course I'm sure table partitioning work well with up to a hundred partitions as written on the postgresql document.
But, my customer will use partitioned table with 1.1k leaf partitions.
So, we need to improve performance.

Any ideas?

The results of pgbench and perf are listed below.

pgbench results
---------------

transaction type: update.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 648
latency average = 278.202 ms
tps = 3.594512 (including connections establishing)
tps = 3.594545 (excluding connections establishing)
statement latencies in milliseconds:
0.011 \set aid random(1, 1100 * 1)
0.004 \set delta random(-5000, 5000)
0.038 BEGIN;
277.005 UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
1.140 END;

transaction type: select.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 19415
latency average = 9.281 ms
tps = 107.745068 (including connections establishing)
tps = 107.746067 (excluding connections establishing)
statement latencies in milliseconds:
0.800 \set aid random(1, 1100 * 1)
0.137 \set delta random(-5000, 5000)
1.351 BEGIN;
4.941 SELECT abalance FROM test.accounts WHERE aid = :aid;
2.052 END;

transaction type: insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 31895
latency average = 5.654 ms
tps = 176.865541 (including connections establishing)
tps = 176.867086 (excluding connections establishing)
statement latencies in milliseconds:
2.083 \set aid random(1, 1100 * 1)
0.003 \set delta random(-5000, 5000)
0.029 BEGIN;
3.222 INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
0.317 END;

Top 10 of perf report
------------

UPDATE:
21.33% postgres postgres [.] range_table_mutator
12.57% postgres postgres [.] AllocSetAlloc
4.97% postgres postgres [.] palloc
4.48% postgres postgres [.] make_one_rel
3.96% postgres postgres [.] lappend
2.74% postgres [kernel.kallsyms] [k] get_page_from_freelist
1.87% postgres postgres [.] setup_append_rel_array
1.68% postgres [kernel.kallsyms] [k] list_del
1.64% postgres [kernel.kallsyms] [k] __alloc_pages_nodemask
1.62% postgres [kernel.kallsyms] [k] unmap_vmas

SELECT:
14.72% postgres postgres [.] AllocSetAlloc
5.14% postgres postgres [.] hash_search_with_hash_value
4.23% postgres postgres [.] palloc
4.06% postgres postgres [.] MemoryContextAllocZeroAligned
2.61% postgres postgres [.] copyObjectImpl
2.34% postgres postgres [.] expression_tree_mutator
2.13% postgres [kernel.kallsyms] [k] _spin_lock
1.91% postgres postgres [.] lappend
1.59% postgres [kernel.kallsyms] [k] __link_path_walk
1.50% postgres postgres [.] set_rel_size

INSERT:
20.75% postgres postgres [.] hash_search_with_hash_value
6.03% postgres postgres [.] hash_any
4.88% postgres postgres [.] AllocSetAlloc
4.05% postgres postgres [.] LWLockRelease
4.05% postgres postgres [.] LWLockAcquire
3.27% postgres postgres [.] oid_cmp
3.06% postgres postgres [.] SearchCatCache
2.97% postgres postgres [.] LockReleaseAll
2.57% postgres postgres [.] pg_qsort
2.37% postgres postgres [.] hash_seq_search

The following is information on the environment used for the benchmark.

Server spec
-----------

Server has 16 cpu.
Memory size is 264GB.
Database directory is on SSD.

database tuning
---------------

 shared_buffers = 102GB
max_locks_per_transactions = 1000000

postgresql version
------------------

11beta2 + patch1 + patch2

patch1: Allow direct lookups of AppendRelInfo by child relid
commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687

patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
https://commitfest.postgresql.org/18/1690/

table definition
----------------

create table test.accounts(aid serial, abalance int) partition by range(aid));
create table test.accounts_history(id serial, aid int, delta int, mtime timestamp without time zone) partition by range(aid);

create table test.account_part_1 partition of test.accounts for values from (1) to (2);
create table test.account_part_2 partition of test.accounts for values from (2) to (3);
.
.
create table test.account_part_1100 partition of test.accounts for values from (1100) to (1101);

accounts_history is also partitioned in the same way.

There is only one data in each leaf partitions for UPDATE/SELECT benchmark.

regards,

#2Kato, Sho
kato-sho@jp.fujitsu.com
In reply to: Kato, Sho (#1)
RE: How to make partitioning scale better for larger numbers of partitions

I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into child) ?

Thank you for your reply.

I compared to PG11beta2 with non-partitioned table.

Non-partitioned table has 1100 records in one table.
Partitioned table has one record on each leaf partitions.

Regards,
-----Original Message-----
From: Justin Pryzby [mailto:pryzby@telsasoft.com]
Sent: Friday, July 13, 2018 12:11 PM
To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>
Subject: Re: How to make partitioning scale better for larger numbers of partitions

Hi,

On Fri, Jul 13, 2018 at 02:58:53AM +0000, Kato, Sho wrote:

I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables.
But, statement latencies on a partitioned table is much slower than on a non-partitioned table.

UPDATE latency is 210 times slower than a non-partitioned table.
SELECT latency is 36 times slower than a non-partitioned table.
Surprisingly INSERT latency is almost same.

I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule or INSERT/UPDATE directly into child) ?

Justin

#3Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Kato, Sho (#1)
Re: How to make partitioning scale better for larger numbers of partitions

Kato-san,

On 2018/07/13 11:58, Kato, Sho wrote:

Hi,

I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables.

Thanks for sharing the results.

But, statement latencies on a partitioned table is much slower than on a non-partitioned table.

UPDATE latency is 210 times slower than a non-partitioned table.
SELECT latency is 36 times slower than a non-partitioned table.
Surprisingly INSERT latency is almost same.

Yes, INSERT comes out ahead because there is no overhead of partitioning
in the planning phase. As David Rowley reported on the nearby thread
("Speeding up INSERTs and UPDATEs to partitioned tables"), there is still
significant overhead during its execution, so we're still a bit a fair bit
away from the best possible performance.

For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
is pretty significant and gets worse as the number of partitions grows. I
had intended to fix that in PG 11, but we could only manage to get part of
that work into PG 11, with significant help from David and others. So,
while PG 11's overhead of partitioning during planning is less than PG
10's due to improved pruning algorithm, it's still pretty far from ideal,
because it isn't just the pruning algorithm that had overheads. In fact,
PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
carry the overhead that was in PG 10. The overheads I mention stem from
the fact that for partitioning we still rely on the old planner code
that's used to perform inheritance planning, which requires to lock and
open *all* partitions. We have so far been able to refactor just enough
to use the new code for partition pruning, but there is much refactoring
work left to avoid needlessly locking and opening all partitions.

Thanks,
Amit

#4Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Amit Langote (#3)
RE: How to make partitioning scale better for larger numbers of partitions

From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]

For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
is pretty significant and gets worse as the number of partitions grows.
I
had intended to fix that in PG 11, but we could only manage to get part
of
that work into PG 11, with significant help from David and others. So,
while PG 11's overhead of partitioning during planning is less than PG
10's due to improved pruning algorithm, it's still pretty far from ideal,
because it isn't just the pruning algorithm that had overheads. In fact,
PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
carry the overhead that was in PG 10.

David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.) What challenges are there for future versions, and which of them are being addressed by patches in progress for PG 12, and which issues are untouched?

The overheads I mention stem from
the fact that for partitioning we still rely on the old planner code
that's used to perform inheritance planning, which requires to lock and
open *all* partitions. We have so far been able to refactor just enough
to use the new code for partition pruning, but there is much refactoring
work left to avoid needlessly locking and opening all partitions.

I remember the issue of opening and locking all partitions was discussed before. Does this relate only to the planning phase?

Kato-san, could you try pgbench -M prepared?

Regards
Takayuki Tsunakawa

#5Justin Pryzby
pryzby@telsasoft.com
In reply to: Tsunakawa, Takayuki (#4)
Re: How to make partitioning scale better for larger numbers of partitions

On Fri, Jul 13, 2018 at 05:49:20AM +0000, Tsunakawa, Takayuki wrote:

David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.) What challenges are there for future versions, and which of them are being addressed by patches in progress for PG 12, and which issues are untouched?

Here's an known issue which I'm not sure is on anybody's list:
/messages/by-id/4F989CD8.8020804@strategicdata.com.au
=> planning of UPDATE/DELETE (but not SELECT) takes huge amount of RAM when run
on parent with many childs.

We don't typically have UPDATEs, and those we do are against the child tables;
but ran into this last month with a manual query on parent causing OOM. I
worked around it, but keep meaning to revisit to see if this changed at all in
v11 (very brief testing suggests not changed).

Justin

#6David Rowley
david.rowley@2ndquadrant.com
In reply to: Tsunakawa, Takayuki (#4)
Re: How to make partitioning scale better for larger numbers of partitions

On 13 July 2018 at 17:49, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.) What challenges are there for future versions, and which of them are being addressed by patches in progress for PG 12, and which issues are untouched?

I've not submitted that for PG12 yet. I had other ideas about just
getting rid of the inheritance planner altogether, but so far don't
have a patch for that. Still uncertain if there are any huge blockers
to that either. Join search needs performed multiple times, but a good
advantage will be that we can take advantage of partition pruning to
only join search the non-pruned partitions.

The reason I change my mind about the patch you're talking about is
that it meant that RangeTblEntries needed to keep the same relation id
in the inheritance planner as they did in the grouping planner, and
another patch I have semi-baked delays building both RelOptInfo and
RangeTblEntry records for partitions until after pruning. Without the
RangeTblEntry it was difficult to ensure the relids were in lock-step
between the two planners. There are ways to do it, it just didn't
look pretty.

Hoping to have something later in the cycle.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7David Rowley
david.rowley@2ndquadrant.com
In reply to: Kato, Sho (#1)
Re: How to make partitioning scale better for larger numbers of partitions

On 13 July 2018 at 14:58, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

Of course I'm sure table partitioning work well with up to a hundred
partitions as written on the postgresql document.

But, my customer will use partitioned table with 1.1k leaf partitions.

So, we need to improve performance.

Any ideas?

It would be really good if you could review and test my partitioning
patches in the current commitfest. These are the ones authored by me
with the word "partition" in the title.

These 4 patches don't solve all the problems, but they do make some
good gains in some areas. I've still more work to do, so the earlier I
can be finished with the 4 that are there now, the more I can focus on
the rest.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#8Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tsunakawa, Takayuki (#4)
Re: How to make partitioning scale better for larger numbers of partitions

On 2018/07/13 14:49, Tsunakawa, Takayuki wrote:

From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]

For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
is pretty significant and gets worse as the number of partitions grows.
I
had intended to fix that in PG 11, but we could only manage to get part
of
that work into PG 11, with significant help from David and others. So,
while PG 11's overhead of partitioning during planning is less than PG
10's due to improved pruning algorithm, it's still pretty far from ideal,
because it isn't just the pruning algorithm that had overheads. In fact,
PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
carry the overhead that was in PG 10.

David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)

I don't think he has posted a new patch for improving the speed for
UDPATE/DELETE planning yet, although I remember he had posted a PoC patch
back in February or March.

What challenges are there for future versions, and which of them are being addressed by patches in progress for PG 12, and which issues are untouched?

The immediate one I think is to refactor the planner such that the new
pruning code, that we were able to utilize for SELECT in PG 11, can also
be used for UPDATE/DELETE. Refactoring needed to replace the pruning
algorithm was minimal for SELECT, but not so much for UPDATE/DELETE.

Once we're able to utilize the new pruning algorithm for all the cases, we
can move on to refactoring to avoid locking and opening of all partitions.
As long as we're relying on constraint exclusion for partition pruning,
which we still do for UPDATE/DELETE, we cannot do that because constraint
exclusion has to look at each partition individually.

The UPDATE/DELETE planning for partitioning using huge memory and CPU is a
pretty big issue and refactoring planner to avoid that may be what's
hardest of all the problems to be solved here.

The overheads I mention stem from
the fact that for partitioning we still rely on the old planner code
that's used to perform inheritance planning, which requires to lock and
open *all* partitions. We have so far been able to refactor just enough
to use the new code for partition pruning, but there is much refactoring
work left to avoid needlessly locking and opening all partitions.

I remember the issue of opening and locking all partitions was discussed before.

Quite a few times I suppose.

Does this relate only to the planning phase?

If the benchmark contains queries that will need to access just one
partition, then yes the planning part has is the biggest overhead.

Execution-time overhead is limited to having an extra, possibly needless,
Append node, but I know David has patch for that too.

Thanks,
Amit

#9Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: David Rowley (#6)
RE: How to make partitioning scale better for larger numbers of partitions

From: David Rowley [mailto:david.rowley@2ndquadrant.com]

David has submitted multiple patches for PG 12, one of which speeds up

pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)
What challenges are there for future versions, and which of them are being
addressed by patches in progress for PG 12, and which issues are untouched?

I've not submitted that for PG12 yet. I had other ideas about just
getting rid of the inheritance planner altogether, but so far don't
have a patch for that. Still uncertain if there are any huge blockers
to that either.

Sorry, I seem to have misunderstood something.

By the way, what do you think is the "ideal and should-be-feasible" goal and the "realistic" goal we can reach in the near future (e.g. PG 12)? Say,

* Planning and execution time is O(log n), where n is the number of partitions
* Planning time is O(log n), execution time is O(1)
* Planning and execution time is O(1), where n is the number of partitions

Regards
Takayuki Tsunakawa

#10David Rowley
david.rowley@2ndquadrant.com
In reply to: Tsunakawa, Takayuki (#9)
Re: How to make partitioning scale better for larger numbers of partitions

On 13 July 2018 at 18:53, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

By the way, what do you think is the "ideal and should-be-feasible" goal and the "realistic" goal we can reach in the near future (e.g. PG 12)? Say,

Depends. Patched don't move that fast without review and nothing gets
committed without a committer.

* Planning and execution time is O(log n), where n is the number of partitions
* Planning time is O(log n), execution time is O(1)
* Planning and execution time is O(1), where n is the number of partitions

It's going to depend on how many partitions are pruned. We still need
to generate paths for all non-pruned partitions which is going to be
slow when there are many partitions.

I think we can get pretty close to the non-partitioned planning
performance with SELECT/UPDATE/DELETE when all but 1 partition
survives pruning. There are always going to be some additional
overhead we can't get rid of, but hopefully, those will be small.

Please feel free to review what I have in the July 'fest.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Amit Langote (#8)
RE: How to make partitioning scale better for larger numbers of partitions

From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]

The immediate one I think is to refactor the planner such that the new
pruning code, that we were able to utilize for SELECT in PG 11, can also
be used for UPDATE/DELETE. Refactoring needed to replace the pruning
algorithm was minimal for SELECT, but not so much for UPDATE/DELETE.

Once we're able to utilize the new pruning algorithm for all the cases,
we
can move on to refactoring to avoid locking and opening of all partitions.
As long as we're relying on constraint exclusion for partition pruning,
which we still do for UPDATE/DELETE, we cannot do that because constraint
exclusion has to look at each partition individually.

The UPDATE/DELETE planning for partitioning using huge memory and CPU is
a
pretty big issue and refactoring planner to avoid that may be what's
hardest of all the problems to be solved here.

Thank you. There seem to be many challenges to address... As a user and PG developer, I'd be happy to see some wiki page that describes the current performance characteristics in terms of # of partitions, the ideal and reachable performance, and the challenges to overcome to reach that ideal goal.

If the benchmark contains queries that will need to access just one
partition, then yes the planning part has is the biggest overhead.

Execution-time overhead is limited to having an extra, possibly needless,
Append node, but I know David has patch for that too.

That's good news, thanks. Our user will perform around a hundred single-row INSERT/SELECT/UPDATE/DELETE statements in each transaction, and those are PREPAREd. I hope PG 11 (with David's patches) will work well for that workload. I'll wait for Kato-san's pgbench -M prepared result.

Regards
Takayuki Tsunakawa

#12Kato, Sho
kato-sho@jp.fujitsu.com
In reply to: Amit Langote (#3)
RE: How to make partitioning scale better for larger numbers of partitions

Hi Amit,

Thank you for letting me know challenges of partitioning.

So, while PG 11's overhead of partitioning during planning is less than PG 10's due to improved pruning algorithm, it's still pretty far from ideal, because it isn't just the pruning algorithm that had overheads.

That makes sense.
I also benchmark PG10.
Actually, SELECT latency on PG11beta2 + patch1 is faster than PG10.

SELECT latency with 800 leaf partition
--------------------------------------
PG10 5.62 ms
PG11 3.869 ms

But, even PG11, SELECT statement takes 21.102ms on benchmark with 1600 leaf partitions.
It takes a long time though partition pruning algorithm of PG11 is binary search.

The overheads I mention stem from the fact that for partitioning we still rely on the old planner code that's used to perform inheritance planning, which requires to lock and open *all* partitions.

I debug update statement execution on partitioned table.
range_table_mutator seems process all leaf partitions.

-----Original Message-----
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
Sent: Friday, July 13, 2018 1:35 PM
To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; PostgreSQL mailing lists <pgsql-hackers@postgresql.org>
Subject: Re: How to make partitioning scale better for larger numbers of partitions

Kato-san,

On 2018/07/13 11:58, Kato, Sho wrote:

Hi,

I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables.

Thanks for sharing the results.

But, statement latencies on a partitioned table is much slower than on a non-partitioned table.

UPDATE latency is 210 times slower than a non-partitioned table.
SELECT latency is 36 times slower than a non-partitioned table.
Surprisingly INSERT latency is almost same.

Yes, INSERT comes out ahead because there is no overhead of partitioning in the planning phase. As David Rowley reported on the nearby thread ("Speeding up INSERTs and UPDATEs to partitioned tables"), there is still significant overhead during its execution, so we're still a bit a fair bit away from the best possible performance.

For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase is pretty significant and gets worse as the number of partitions grows. I had intended to fix that in PG 11, but we could only manage to get part of that work into PG 11, with significant help from David and others. So, while PG 11's overhead of partitioning during planning is less than PG 10's due to improved pruning algorithm, it's still pretty far from ideal, because it isn't just the pruning algorithm that had overheads. In fact, PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still carry the overhead that was in PG 10. The overheads I mention stem from the fact that for partitioning we still rely on the old planner code that's used to perform inheritance planning, which requires to lock and open *all* partitions. We have so far been able to refactor just enough to use the new code for partition pruning, but there is much refactoring work left to avoid needlessly locking and opening all partitions.

Thanks,
Amit

#13Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Kato, Sho (#12)
Re: How to make partitioning scale better for larger numbers of partitions

On 2018/07/13 16:29, Kato, Sho wrote:

I also benchmark PG10.
Actually, SELECT latency on PG11beta2 + patch1 is faster than PG10.

SELECT latency with 800 leaf partition
--------------------------------------
PG10 5.62 ms
PG11 3.869 ms

But, even PG11, SELECT statement takes 21.102ms on benchmark with 1600 leaf partitions.
It takes a long time though partition pruning algorithm of PG11 is binary search.

Yeah, pruning algorithm change evidently removed only a small percentage
of the overhead.

The overheads I mention stem from the fact that for partitioning we still rely on the old planner code that's used to perform inheritance planning, which requires to lock and open *all* partitions.

I debug update statement execution on partitioned table.
range_table_mutator seems process all leaf partitions.

That's one of the the symptoms of it, yes.

With the existing code for UPDATE/DELETE planning of partitioned tables,
the whole Query structure is modified to translate the parent's attribute
numbers to partition attribute numbers and planned freshly *for every
partition*. range_table_mutator() is invoked sometime during that
translation process and itself loops over all partitions, so I'd think it
is susceptible to being prominently visible in perf profiles.

Thanks,
Amit

#14Kato, Sho
kato-sho@jp.fujitsu.com
In reply to: Tsunakawa, Takayuki (#4)
RE: How to make partitioning scale better for larger numbers of partitions

Tsunakawa-san

Kato-san, could you try pgbench -M prepared?

I did pgbench -M prepared and perf record.

UPDATE latency in prepared mode is 95% shorter than in simple mode.
SELECT latency in prepared mode is 54% shorter than in simple mode.
INSERT latency in prepared mode is 8% shorter than in simple mode.

In perf report, AllocSetAlloc, hash_search_with_hash_value and hash_any appeared in all SQL.

Details are as follows.

pgbench results
--------------

transaction type: update.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 12295
latency average = 14.641 ms
tps = 68.302806 (including connections establishing)
tps = 68.303430 (excluding connections establishing)
statement latencies in milliseconds:
0.009 \set aid random(1, 1100 * 1)
0.004 \set delta random(-5000, 5000)
0.026 BEGIN;
14.089 UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.513 END;

transaction type: select.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 45145
latency average = 3.996 ms
tps = 250.272094 (including connections establishing)
tps = 250.274404 (excluding connections establishing)
statement latencies in milliseconds:
0.750 \set aid random(1, 1100 * 1)
0.714 \set delta random(-5000, 5000)
0.023 BEGIN;
2.262 SELECT abalance FROM test.accounts WHERE aid = :aid;
0.247 END;

transaction type: insert.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 34894
latency average = 5.158 ms
tps = 193.855216 (including connections establishing)
tps = 193.857020 (excluding connections establishing)
statement latencies in milliseconds:
1.007 \set aid random(1, 1100 * 1)
0.802 \set delta random(-5000, 5000)
0.025 BEGIN;
2.963 INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
0.360 END;

Top 10 of perf report
---------------------
UPDATE:
11.86% postgres postgres [.] list_nth
10.23% postgres postgres [.] ExecOpenScanRelation
7.47% postgres postgres [.] list_member_int
4.78% postgres postgres [.] AllocSetAlloc
3.61% postgres postgres [.] palloc0
3.09% postgres postgres [.] hash_search_with_hash_value
2.33% postgres postgres [.] ResourceArrayAdd
1.99% postgres postgres [.] hash_any
1.83% postgres postgres [.] copyObjectImpl
1.64% postgres postgres [.] SearchCatCache1

SELECT:
33.60% postgres postgres [.] hash_search_with_hash_value
13.02% postgres postgres [.] hash_any
5.30% postgres postgres [.] LockAcquireExtended
5.16% postgres postgres [.] LockReleaseAll
4.00% postgres postgres [.] hash_seq_search
3.84% postgres postgres [.] LWLockRelease
3.81% postgres postgres [.] AllocSetAlloc
3.23% postgres postgres [.] LWLockAcquire
2.55% postgres postgres [.] SetupLockInTable
1.90% postgres postgres [.] AcquireExecutorLocks

INSERT:
21.86% postgres postgres [.] hash_search_with_hash_value
6.36% postgres postgres [.] hash_any
4.95% postgres postgres [.] AllocSetAlloc
4.08% postgres postgres [.] LWLockRelease
3.83% postgres postgres [.] LWLockAcquire
3.26% postgres postgres [.] SearchCatCache
3.15% postgres postgres [.] oid_cmp
2.78% postgres postgres [.] LockReleaseAll
2.76% postgres postgres [.] pg_qsort
2.41% postgres postgres [.] SearchCatCache1

-----Original Message-----
From: Tsunakawa, Takayuki/綱川 貴之
Sent: Friday, July 13, 2018 2:49 PM
To: 'Amit Langote' <Langote_Amit_f8@lab.ntt.co.jp>; Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; PostgreSQL mailing lists <pgsql-hackers@postgresql.org>
Subject: RE: How to make partitioning scale better for larger numbers of partitions

From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]

For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
is pretty significant and gets worse as the number of partitions grows.
I
had intended to fix that in PG 11, but we could only manage to get part
of
that work into PG 11, with significant help from David and others. So,
while PG 11's overhead of partitioning during planning is less than PG
10's due to improved pruning algorithm, it's still pretty far from ideal,
because it isn't just the pruning algorithm that had overheads. In fact,
PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
carry the overhead that was in PG 10.

David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.) What challenges are there for future versions, and which of them are being addressed by patches in progress for PG 12, and which issues are untouched?

The overheads I mention stem from
the fact that for partitioning we still rely on the old planner code
that's used to perform inheritance planning, which requires to lock and
open *all* partitions. We have so far been able to refactor just enough
to use the new code for partition pruning, but there is much refactoring
work left to avoid needlessly locking and opening all partitions.

I remember the issue of opening and locking all partitions was discussed before. Does this relate only to the planning phase?

Kato-san, could you try pgbench -M prepared?

Regards
Takayuki Tsunakawa

#15Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Kato, Sho (#2)
Re: How to make partitioning scale better for larger numbers of partitions

On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into child) ?

Thank you for your reply.

I compared to PG11beta2 with non-partitioned table.

Non-partitioned table has 1100 records in one table.
Partitioned table has one record on each leaf partitions.

I don't think partitioning should be employed this way even for the
sake of comparison. Depending upon the size of each tuple, 1100 tuples
are inserted into a single table, they will probably occupy few
hundred pages. In a partitioned table with one tuple per partition
they will occupy 1100 pages at least. There is other space, locking
overheads to maintain 1100 tables. I think the right way to compare is
to have really large that which really requires 1100 partitions and
then compare performance by putting that data in 1100 partitions and
in an unpartitioned table. Even with that kind of data, you will see
some difference in performance, but that won't be as dramatic as you
report.

I might be missing something though.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#16Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Kato, Sho (#14)
RE: How to make partitioning scale better for larger numbers of partitions

From: Kato, Sho [mailto:kato-sho@jp.fujitsu.com]

I did pgbench -M prepared and perf record.

UPDATE latency in prepared mode is 95% shorter than in simple mode.
SELECT latency in prepared mode is 54% shorter than in simple mode.
INSERT latency in prepared mode is 8% shorter than in simple mode.

Thanks. And what does the comparison look like between the unpartitioned case and various partition counts? What's the performance characteristics in terms of the latency and partition count? I thought that's what you tried to reveal first?

Regards
Takayuki Tsunakawa

#17Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#15)
Re: How to make partitioning scale better for larger numbers of partitions

On 2018/07/13 22:10, Ashutosh Bapat wrote:

On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into child) ?

Thank you for your reply.

I compared to PG11beta2 with non-partitioned table.

Non-partitioned table has 1100 records in one table.
Partitioned table has one record on each leaf partitions.

I don't think partitioning should be employed this way even for the
sake of comparison. Depending upon the size of each tuple, 1100 tuples
are inserted into a single table, they will probably occupy few
hundred pages. In a partitioned table with one tuple per partition
they will occupy 1100 pages at least. There is other space, locking
overheads to maintain 1100 tables. I think the right way to compare is
to have really large that which really requires 1100 partitions and
then compare performance by putting that data in 1100 partitions and
in an unpartitioned table. Even with that kind of data, you will see
some difference in performance, but that won't be as dramatic as you
report.

I might be missing something though.

Perhaps, Kato-san only intended to report that the time that planner
spends for a partitioned table with 1100 partitions is just too high
compared to the time it spends on a non-partitioned table. It is and has
been clear that that's because the planning time explodes as the number of
partitions increases.

If there's lots of data in it, then the result will look completely
different as you say, because scanning a single partition (of the 1100
total) will spend less time than scanning a non-partitioned table
containing 1100 partitions worth of data. But the planning time would
still be more for the partitioned table, which seems to be the point of
this benchmark.

Thanks,
Amit

#18Kato, Sho
kato-sho@jp.fujitsu.com
In reply to: Amit Langote (#17)
RE: How to make partitioning scale better for larger numbers of partitions

On 2018/07/17 10:49, Amit Langote wrote:

Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100 partitions is just too high compared to the time it spends on a non-partitioned table.

yes, It is included for the purposes of this comparison.

The purpose of this comparison is to find where the partitioning bottleneck is.
Using the bottleneck as a hint of improvement, I'd like to bring the performance of partitioned table closer to the performance of unpartitioned table as much as possible.

-----Original Message-----
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
Sent: Tuesday, July 17, 2018 10:49 AM
To: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>; Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>
Cc: Justin Pryzby <pryzby@telsasoft.com>; PostgreSQL mailing lists <pgsql-hackers@postgresql.org>
Subject: Re: How to make partitioning scale better for larger numbers of partitions

On 2018/07/13 22:10, Ashutosh Bapat wrote:

On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into child) ?

Thank you for your reply.

I compared to PG11beta2 with non-partitioned table.

Non-partitioned table has 1100 records in one table.
Partitioned table has one record on each leaf partitions.

I don't think partitioning should be employed this way even for the
sake of comparison. Depending upon the size of each tuple, 1100 tuples
are inserted into a single table, they will probably occupy few
hundred pages. In a partitioned table with one tuple per partition
they will occupy 1100 pages at least. There is other space, locking
overheads to maintain 1100 tables. I think the right way to compare is
to have really large that which really requires 1100 partitions and
then compare performance by putting that data in 1100 partitions and
in an unpartitioned table. Even with that kind of data, you will see
some difference in performance, but that won't be as dramatic as you
report.

I might be missing something though.

Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100 partitions is just too high compared to the time it spends on a non-partitioned table. It is and has been clear that that's because the planning time explodes as the number of partitions increases.

If there's lots of data in it, then the result will look completely different as you say, because scanning a single partition (of the 1100
total) will spend less time than scanning a non-partitioned table containing 1100 partitions worth of data. But the planning time would still be more for the partitioned table, which seems to be the point of this benchmark.

Thanks,
Amit

#19Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Kato, Sho (#18)
Re: How to make partitioning scale better for larger numbers of partitions

On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

On 2018/07/17 10:49, Amit Langote wrote:

Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100 partitions is just too high compared to the time it spends on a non-partitioned table.

yes, It is included for the purposes of this comparison.

The purpose of this comparison is to find where the partitioning bottleneck is.
Using the bottleneck as a hint of improvement, I'd like to bring the performance of partitioned table closer to the performance of unpartitioned table as much as possible.

That's a good thing, but may not turn out to be realistic. We should
compare performance where partitioning matters and try to improve in
those contexts. Else we might improve performance in scenarios which
are never used.

In this case, even if we improve the planning time by 100%, it hardly
matters since planning time is neglegible compared to the execution
time because of huge data where partitioning is useful.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#20Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#19)
Re: How to make partitioning scale better for larger numbers of partitions

On 2018/07/17 12:14, Ashutosh Bapat wrote:

On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

On 2018/07/17 10:49, Amit Langote wrote:

Perhaps, Kato-san only intended to report that the time that planner spends for a partitioned table with 1100 partitions is just too high compared to the time it spends on a non-partitioned table.

yes, It is included for the purposes of this comparison.

The purpose of this comparison is to find where the partitioning bottleneck is.
Using the bottleneck as a hint of improvement, I'd like to bring the performance of partitioned table closer to the performance of unpartitioned table as much as possible.

That's a good thing, but may not turn out to be realistic. We should
compare performance where partitioning matters and try to improve in
those contexts. Else we might improve performance in scenarios which
are never used.

In this case, even if we improve the planning time by 100%, it hardly
matters since planning time is neglegible compared to the execution
time because of huge data where partitioning is useful.

While I agree that it's a good idea to tell users to use partitioning only
if the overhead of having the partitioning in the first place is bearable,
especially the planner overhead, this benchmark shows us that even for
what I assume might be fairly commonly occurring queries (select .. from /
update .. / delete from partitioned_table where partkey = ?), planner
spends way too many redundant cycles. Some amount of that overhead will
always remain and planning with partitioning will always be a bit slower
than without partitioning, it's *too* slow right now for non-trivial
number of partitions.

Thanks,
Amit

#21Kato, Sho
kato-sho@jp.fujitsu.com
In reply to: Tsunakawa, Takayuki (#16)
RE: How to make partitioning scale better for larger numbers of partitions

On 2018/07/16 13:16, Tsunakawa-san wrote:

Thanks. And what does the comparison look like between the unpartitioned case and various partition counts? What's the performance characteristics in terms of the latency and partition count? I thought that's what you tried to reveal first?

In unpartitioned table, latency of SELECT/UPDATE statement is close to O(n), where n is number of records.
Latency of INSERT statements is close to O(1).

In partitioned table, up to 400 partitions, latency of SELECT/INSERT statement is close to O(log n), where n is the number of partitions.
Between 400 and 6400 partitions, latency is close to O(n).
Up to 400 partitions, latency of UPDATE statement is close to O(n).
Between 400 and 6400 partitions, latency of UPDATE statement seems to O(n^2).

Details are as follows.

unpartitioned table result (prepared mode)
------------------------------------------

scale | latency_avg | tps_ex | update_latency | select_latency | insert_latency
-------+-------------+-------------+----------------+----------------+----------------
100 | 0.24 | 4174.395738 | 0.056 | 0.051 | 0.04
200 | 0.258 | 3873.099014 | 0.065 | 0.059 | 0.04
400 | 0.29 | 3453.171112 | 0.081 | 0.072 | 0.041
800 | 0.355 | 2814.936942 | 0.112 | 0.105 | 0.041
1600 | 0.493 | 2027.702689 | 0.18 | 0.174 | 0.042
3200 | 0.761 | 1313.532458 | 0.314 | 0.307 | 0.043
6400 | 1.214 | 824.001431 | 0.54 | 0.531 | 0.043

partitioned talble result (prepared mode)
-----------------------------------------

num_part | latency_avg | tps_ex | update_latency | select_latency | insert_latency
----------+-------------+-------------+----------------+----------------+----------------
100 | 0.937 | 1067.473258 | 0.557 | 0.087 | 0.123
200 | 1.65 | 606.244552 | 1.115 | 0.121 | 0.188
400 | 3.295 | 303.491681 | 2.446 | 0.19 | 0.322
800 | 8.102 | 123.422456 | 6.553 | 0.337 | 0.637
1600 | 36.996 | 27.030027 | 31.528 | 1.621 | 2.455
3200 | 147.998 | 6.756922 | 136.136 | 4.21 | 4.94
6400 | 666.947 | 1.499383 | 640.879 | 7.631 | 9.642

regards,
-----Original Message-----
From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]
Sent: Monday, July 16, 2018 1:16 PM
To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>
Cc: 'Amit Langote' <Langote_Amit_f8@lab.ntt.co.jp>; PostgreSQL mailing lists <pgsql-hackers@postgresql.org>
Subject: RE: How to make partitioning scale better for larger numbers of partitions

From: Kato, Sho [mailto:kato-sho@jp.fujitsu.com]

I did pgbench -M prepared and perf record.

UPDATE latency in prepared mode is 95% shorter than in simple mode.
SELECT latency in prepared mode is 54% shorter than in simple mode.
INSERT latency in prepared mode is 8% shorter than in simple mode.

Thanks. And what does the comparison look like between the unpartitioned case and various partition counts? What's the performance characteristics in terms of the latency and partition count? I thought that's what you tried to reveal first?

Regards
Takayuki Tsunakawa