[Patch] Optimize dropping of relation buffers using dlist
Hi,
Currently, we need to scan the WHOLE shared buffers when VACUUM
truncated off any empty pages at end of transaction or when relation
is TRUNCATEd.
As for our customer case, we periodically truncate thousands of tables,
and it's possible to TRUNCATE single table per transaction. This can be
problematic later on during recovery which could take longer, especially
when a sudden failover happens after those TRUNCATEs and when we
have to scan a large-sized shared buffer. In the performance test below,
it took almost 12.5 minutes for recovery to complete for 100GB shared
buffers. But we want to keep failover very short (within 10 seconds).
Previously, I made an improvement in speeding the truncates of relation
forks from 3 scans to one scan.[1]/messages/by-id/D09B13F772D2274BB348A310EE3027C64E2067@g01jpexmbkw24 This time, the aim of this patch is
to further speedup the invalidation of pages, by linking the cached pages
of the target relation in a doubly-linked list and just traversing it
instead of scanning the whole shared buffers. In DropRelFileNodeBuffers,
we just get the number of target buffers to invalidate for the relation.
There is a significant win in this patch, because we were able to
complete failover and recover in 3 seconds more or less.
I performed similar tests to what I did in the speedup truncates of
relations forks.[1]/messages/by-id/D09B13F772D2274BB348A310EE3027C64E2067@g01jpexmbkw24[2]/messages/by-id/D09B13F772D2274BB348A310EE3027C6502672@g01jpexmbkw24 However, this time using 100GB shared_buffers.
[Machine spec used in testing]
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
CPU: 16, Number of cores per socket: 8
RHEL6.5, Memory: 256GB++
[Test]
1. (Master) Create table (ex. 10,000 tables). Insert data to tables.
2. (Master) DELETE FROM TABLE (ex. all rows of 10,000 tables)
(Standby) To test with failover, pause the WAL replay on standby server.
(SELECT pg_wal_replay_pause();)
3. (M) psql -c "\timing on" (measures total execution of SQL queries)
4. (M) VACUUM (whole db)
5. (M) Stop primary server. pg_ctl stop -D $PGDATA -w
6. (S) Resume wal replay and promote standby.[2]/messages/by-id/D09B13F772D2274BB348A310EE3027C6502672@g01jpexmbkw24
[Results]
A. HEAD (origin/master branch)
A1. Vacuum execution on Primary server
Time: 730932.408 ms (12:10.932) ~12min 11s
A2. Vacuum + Failover (WAL Recovery on Standby)
waiting for server to promote...........................
.................................... stopped waiting
pg_ctl: server did not promote in time
2019/10/25_12:13:09.692─┐
2019/10/25_12:25:43.576─┘
-->Total: 12min34s
B. PATCH
B1. Vacuum execution on Primary/Master
Time: 6.518333s = 6518.333 ms
B2. Vacuum + Failover (WAL Recovery on Standby)
2019/10/25_14:17:21.822
waiting for server to promote...... done
server promoted
2019/10/25_14:17:24.827
2019/10/25_14:17:24.833
-->Total: 3.011s
[Other Notes]
Maybe one disadvantage is that we can have a variable number of
relations, and allocated the same number of relation structures as
the size of shared buffers. I tried to reduce the use of memory when
doing hash table lookup operation by having a fixed size array (100)
or threshold of target buffers to invalidate.
When doing CachedBufLookup() to scan the count of each buffer in the
dlist, I made sure to reduce the number of scans (2x at most).
First, we scan the dlist of cached buffers of relations.
Then store the target buffers in buf_id_array. Non-target buffers
would be removed from dlist but added to temporary dlist.
After reaching end of main dlist, we append the temporary dlist to
tail of main dlist.
I also performed pgbench buffer test, and this patch did not cause
overhead to normal DB access performance.
Another one that I'd need feedback of is the use of new dlist operations
for this cached buffer list. I did not use in this patch the existing
Postgres dlist architecture (ilist.h) because I want to save memory space
as much as possible especially when NBuffers become large. Both dlist_node
& dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.
Hope to hear your feedback and comments.
Thanks in advance,
Kirk Jamison
[1]: /messages/by-id/D09B13F772D2274BB348A310EE3027C64E2067@g01jpexmbkw24
[2]: /messages/by-id/D09B13F772D2274BB348A310EE3027C6502672@g01jpexmbkw24
Attachments:
v1-Optimize-dropping-of-relation-buffers-using-dlist.patchapplication/octet-stream; name=v1-Optimize-dropping-of-relation-buffers-using-dlist.patchDownload+527-41
Hi,
Another one that I'd need feedback of is the use of new dlist operations
for this cached buffer list. I did not use in this patch the existing
Postgres dlist architecture (ilist.h) because I want to save memory space
as much as possible especially when NBuffers become large. Both dlist_node
& dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.
In cb_dlist_combine(), the code block below can impact performance
especially for cases when the doubly linked list is long (IOW, many cached buffers).
/* Point to the tail of main dlist */
while (curr_main->next != CACHEDBLOCK_END_OF_LIST)
curr_main = cb_dlist_next(curr_main);
Attached is an improved version of the previous patch, which adds a pointer
information of the TAIL field in order to speed up the abovementioned operation.
I stored the tail field in the prev pointer of the head entry (maybe not a typical
approach). A more typical one is by adding a tail field (int tail) to CachedBufferEnt,
but I didn’t do that because as I mentioned in previous email I want to avoid
using more memory as much as possible.
The patch worked as intended and passed the tests.
Any thoughts?
Regards,
Kirk Jamison
Attachments:
v2-Optimize-dropping-of-relation-buffers-using-dlist.patchapplication/octet-stream; name=v2-Optimize-dropping-of-relation-buffers-using-dlist.patchDownload+539-41
Hi Kirk,
On Tue, Nov 05, 2019 at 09:58:22AM +0000, k.jamison@fujitsu.com wrote:
Hi,
Another one that I'd need feedback of is the use of new dlist operations
for this cached buffer list. I did not use in this patch the existing
Postgres dlist architecture (ilist.h) because I want to save memory space
as much as possible especially when NBuffers become large. Both dlist_node
& dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.
In cb_dlist_combine(), the code block below can impact performance
especially for cases when the doubly linked list is long (IOW, many cached buffers).
/* Point to the tail of main dlist */
while (curr_main->next != CACHEDBLOCK_END_OF_LIST)
curr_main = cb_dlist_next(curr_main);Attached is an improved version of the previous patch, which adds a pointer
information of the TAIL field in order to speed up the abovementioned operation.
I stored the tail field in the prev pointer of the head entry (maybe not a typical
approach). A more typical one is by adding a tail field (int tail) to CachedBufferEnt,
but I didn’t do that because as I mentioned in previous email I want to avoid
using more memory as much as possible.
The patch worked as intended and passed the tests.Any thoughts?
A couple of comments based on briefly looking at the patch.
1) I don't think you should / need to expose most of the ne stuff in
buf_internals.h. It's only used from buf_internals.c and having all
the various cb_dlist_* function in .h seems strange.
2) This adds another hashtable maintenance to BufferAlloc etc. but
you've only done tests / benchmark for the case this optimizes. I
think we need to see a benchmark for workload that allocates and
invalidates lot of buffers. A pgbench with a workload that fits into
RAM but not into shared buffers would be interesting.
3) I see this triggered a failure on cputube, in the commit_ts TAP test.
That's a bit strange, someone should investigate I guess.
https://travis-ci.org/postgresql-cfbot/postgresql/builds/607563900
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
2) This adds another hashtable maintenance to BufferAlloc etc. but
you've only done tests / benchmark for the case this optimizes. I
think we need to see a benchmark for workload that allocates and
invalidates lot of buffers. A pgbench with a workload that fits into
RAM but not into shared buffers would be interesting.
Yeah, it seems pretty hard to believe that this won't be bad for some
workloads. Not only do you have the overhead of the hash table
operations, but you also have locking overhead around that. A whole
new set of LWLocks where you have to take and release one of them
every time you allocate or invalidate a buffer seems likely to cause a
pretty substantial contention problem.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:2) This adds another hashtable maintenance to BufferAlloc etc. but
you've only done tests / benchmark for the case this optimizes. I
think we need to see a benchmark for workload that allocates and
invalidates lot of buffers. A pgbench with a workload that fits into
RAM but not into shared buffers would be interesting.Yeah, it seems pretty hard to believe that this won't be bad for some workloads.
Not only do you have the overhead of the hash table operations, but you also
have locking overhead around that. A whole new set of LWLocks where you have
to take and release one of them every time you allocate or invalidate a buffer
seems likely to cause a pretty substantial contention problem.
I'm sorry for the late reply. Thank you Tomas and Robert for checking this patch.
Attached is the v3 of the patch.
- I moved the unnecessary items from buf_internals.h to cached_buf.c since most of
of those items are only used in that file.
- Fixed the bug of v2. Seems to pass both RT and TAP test now
Thanks for the advice on benchmark test. Please refer below for test and results.
[Machine spec]
CPU: 16, Number of cores per socket: 8
RHEL6.5, Memory: 240GB
scale: 3125 (about 46GB DB size)
shared_buffers = 8GB
[workload that fits into RAM but not into shared buffers]
pgbench -i -s 3125 cachetest
pgbench -c 16 -j 8 -T 600 cachetest
[Patched]
scaling factor: 3125
query mode: simple
number of clients: 16
number of threads: 8
duration: 600 s
number of transactions actually processed: 8815123
latency average = 1.089 ms
tps = 14691.436343 (including connections establishing)
tps = 14691.482714 (excluding connections establishing)
[Master/Unpatched]
...
number of transactions actually processed: 8852327
latency average = 1.084 ms
tps = 14753.814648 (including connections establishing)
tps = 14753.861589 (excluding connections establishing)
My patch caused a little overhead of about 0.42-0.46%, which I think is small.
Kindly let me know your opinions/comments about the patch or tests, etc.
Thanks,
Kirk Jamison
Attachments:
v3-Optimize-dropping-of-relation-buffers-using-dlist.patchapplication/octet-stream; name=v3-Optimize-dropping-of-relation-buffers-using-dlist.patchDownload+514-40
On Tue, Nov 12, 2019 at 10:49:49AM +0000, k.jamison@fujitsu.com wrote:
On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:2) This adds another hashtable maintenance to BufferAlloc etc. but
you've only done tests / benchmark for the case this optimizes. I
think we need to see a benchmark for workload that allocates and
invalidates lot of buffers. A pgbench with a workload that fits into
RAM but not into shared buffers would be interesting.Yeah, it seems pretty hard to believe that this won't be bad for some workloads.
Not only do you have the overhead of the hash table operations, but you also
have locking overhead around that. A whole new set of LWLocks where you have
to take and release one of them every time you allocate or invalidate a buffer
seems likely to cause a pretty substantial contention problem.I'm sorry for the late reply. Thank you Tomas and Robert for checking this patch.
Attached is the v3 of the patch.
- I moved the unnecessary items from buf_internals.h to cached_buf.c since most of
of those items are only used in that file.
- Fixed the bug of v2. Seems to pass both RT and TAP test nowThanks for the advice on benchmark test. Please refer below for test and results.
[Machine spec]
CPU: 16, Number of cores per socket: 8
RHEL6.5, Memory: 240GBscale: 3125 (about 46GB DB size)
shared_buffers = 8GB[workload that fits into RAM but not into shared buffers]
pgbench -i -s 3125 cachetest
pgbench -c 16 -j 8 -T 600 cachetest[Patched]
scaling factor: 3125
query mode: simple
number of clients: 16
number of threads: 8
duration: 600 s
number of transactions actually processed: 8815123
latency average = 1.089 ms
tps = 14691.436343 (including connections establishing)
tps = 14691.482714 (excluding connections establishing)[Master/Unpatched]
...
number of transactions actually processed: 8852327
latency average = 1.084 ms
tps = 14753.814648 (including connections establishing)
tps = 14753.861589 (excluding connections establishing)My patch caused a little overhead of about 0.42-0.46%, which I think is small.
Kindly let me know your opinions/comments about the patch or tests, etc.
Now try measuring that with a read-only workload, with prepared
statements. I've tried that on a machine with 16 cores, doing
# 16 clients
pgbench -n -S -j 16 -c 16 -M prepared -T 60 test
# 1 client
pgbench -n -S -c 1 -M prepared -T 60 test
and average from 30 runs of each looks like this:
# clients master patched %
---------------------------------------------------------
1 29690 27833 93.7%
16 300935 283383 94.1%
That's quite significant regression, considering it's optimizing an
operation that is expected to be pretty rare (people are generally not
dropping dropping objects as often as they query them).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Nov 13, 2019 4:20AM (GMT +9), Tomas Vondra wrote:
On Tue, Nov 12, 2019 at 10:49:49AM +0000, k.jamison@fujitsu.com wrote:
On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com>
wrote:2) This adds another hashtable maintenance to BufferAlloc etc. but
you've only done tests / benchmark for the case this optimizes. I
think we need to see a benchmark for workload that allocates and
invalidates lot of buffers. A pgbench with a workload that fits into
RAM but not into shared buffers would be interesting.Yeah, it seems pretty hard to believe that this won't be bad for some
workloads.
Not only do you have the overhead of the hash table operations, but
you also have locking overhead around that. A whole new set of
LWLocks where you have to take and release one of them every time you
allocate or invalidate a buffer seems likely to cause a pretty substantialcontention problem.
I'm sorry for the late reply. Thank you Tomas and Robert for checking this
patch.
Attached is the v3 of the patch.
- I moved the unnecessary items from buf_internals.h to cached_buf.c
since most of
of those items are only used in that file.
- Fixed the bug of v2. Seems to pass both RT and TAP test nowThanks for the advice on benchmark test. Please refer below for test and
results.
[Machine spec]
CPU: 16, Number of cores per socket: 8
RHEL6.5, Memory: 240GBscale: 3125 (about 46GB DB size)
shared_buffers = 8GB[workload that fits into RAM but not into shared buffers] pgbench -i -s
3125 cachetest pgbench -c 16 -j 8 -T 600 cachetest[Patched]
scaling factor: 3125
query mode: simple
number of clients: 16
number of threads: 8
duration: 600 s
number of transactions actually processed: 8815123 latency average =
1.089 ms tps = 14691.436343 (including connections establishing) tps =
14691.482714 (excluding connections establishing)[Master/Unpatched]
...
number of transactions actually processed: 8852327 latency average =
1.084 ms tps = 14753.814648 (including connections establishing) tps =
14753.861589 (excluding connections establishing)My patch caused a little overhead of about 0.42-0.46%, which I think is small.
Kindly let me know your opinions/comments about the patch or tests, etc.Now try measuring that with a read-only workload, with prepared statements.
I've tried that on a machine with 16 cores, doing# 16 clients
pgbench -n -S -j 16 -c 16 -M prepared -T 60 test# 1 client
pgbench -n -S -c 1 -M prepared -T 60 testand average from 30 runs of each looks like this:
# clients master patched %
---------------------------------------------------------
1 29690 27833 93.7%
16 300935 283383 94.1%That's quite significant regression, considering it's optimizing an
operation that is expected to be pretty rare (people are generally not
dropping dropping objects as often as they query them).
I updated the patch and reduced the lock contention of new LWLock,
with tunable definitions in the code and instead of using rnode as the hash key,
I also added the modulo of block number.
#define NUM_MAP_PARTITIONS_FOR_REL 128 /* relation-level */
#define NUM_MAP_PARTITIONS_IN_REL 4 /* block-level */
#define NUM_MAP_PARTITIONS \
(NUM_MAP_PARTITIONS_FOR_REL * NUM_MAP_PARTITIONS_IN_REL)
I executed again a benchmark for read-only workload,
but regression currently sits at 3.10% (reduced from v3's 6%).
Average of 10 runs, 16 clients
read-only, prepared query mode
[Master]
num of txn processed: 11,950,983.67
latency average = 0.080 ms
tps = 199,182.24
tps = 199,189.54
[V4 Patch]
num of txn processed: 11,580,256.36
latency average = 0.083 ms
tps = 193,003.52
tps = 193,010.76
I checked the wait event statistics (non-impactful events omitted)
and got the following below.
I reset the stats before running the pgbench script,
Then showed the stats right after the run.
[Master]
wait_event_type | wait_event | calls | microsec
-----------------+-----------------------+----------+----------
Client | ClientRead | 25116 | 49552452
IO | DataFileRead | 14467109 | 92113056
LWLock | buffer_mapping | 204618 | 1364779
[Patch V4]
wait_event_type | wait_event | calls | microsec
-----------------+-----------------------+----------+----------
Client | ClientRead | 111393 | 68773946
IO | DataFileRead | 14186773 | 90399833
LWLock | buffer_mapping | 463844 | 4025198
LWLock | cached_buf_tranche_id | 83390 | 336080
It seems the buffer_mapping LWLock wait is 4x slower.
However, I'd like to continue working on this patch to next commitfest,
and further reduce its impact to read-only workloads.
Regards,
Kirk Jamison
Attachments:
v4-Optimize-dropping-of-relation-buffers-using-dlist.patchapplication/octet-stream; name=v4-Optimize-dropping-of-relation-buffers-using-dlist.patchDownload+551-40
Hi,
I have updated the patch (v5).
I tried to reduce the lock waiting times by using spinlock
when inserting/deleting buffers in the new hash table, and
exclusive lock when doing lookup for buffers to be dropped.
In summary, instead of scanning the whole buffer pool in
shared buffers, we just traverse the doubly-linked list of linked
buffers for the target relation and block.
In order to understand how this patch affects performance,
I also measured the cache hit rates in addition to
benchmarking db with various shared buffer size settings.
Using the same machine specs, I used the default script
of pgbench for read-only workload with prepared statement,
and executed about 15 runs for varying shared buffer sizes.
pgbench -i -s 3200 test //(about 48GB db size)
pgbench -S -n -M prepared -c 16 -j 16 -T 60 test
[TPS Regression]
shbuf | tps(master) | tps(patch) | %reg
---------+-----------------+-----------------+-------
5GB | 195,737.23 | 191,422.23 | 2.23
10GB | 197,067.93 | 194,011.66 | 1.55
20GB | 200,241.18 | 200,425.29 | -0.09
40GB | 208,772.81 | 209,807.38 | -0.50
50GB | 215,684.33 | 218,955.43 | -1.52
[CACHE HIT RATE]
Shbuf | master | patch
----------+--------------+----------
10GB | 0.141536 | 0.141485
20GB | 0.330088 | 0.329894
30GB | 0.573383 | 0.573377
40GB | 0.819499 | 0.819264
50GB | 0.999237 | 0.999577
For this workload, the regression increases for below 20GB
shared_buffers size. However, the cache hit rate both for
master and patch is 32% (20 GB shbuf). Therefore, I think we
can consider this kind of workload with low shared buffers
size as a “special case”, because in terms of db performance
tuning we want as much as possible for the db to have a higher
cache hit rate (99.9%, or maybe let's say 80% is acceptable).
And in this workload, ideal shared_buffers size would be
around 40GB more or less to hit that acceptable cache hit rate.
Looking at this patch's performance result, if it's within the acceptable
cache hit rate, there would be at least no regression and the results als
show almost similar tps compared to master.
Your feedback about the patch and tests are welcome.
Regards,
Kirk Jamison
Attachments:
v5-Optimize-dropping-of-relation-buffers-using-dlist.patchapplication/octet-stream; name=v5-Optimize-dropping-of-relation-buffers-using-dlist.patchDownload+639-42
Hi,
I have rebased the patch to keep the CFbot happy.
Apparently, in the previous patch there was a possibility of infinite loop
when allocating buffers, so I fixed that part and also removed some whitespaces.
Kindly check the attached V6 patch.
Any thoughts on this?
Regards,
Kirk Jamison
Attachments:
v6-Optimize-dropping-of-relation-buffers-using-dlist.patchapplication/octet-stream; name=v6-Optimize-dropping-of-relation-buffers-using-dlist.patchDownload+638-42
On Tue, Feb 4, 2020 at 4:57 AM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
Kindly check the attached V6 patch.
Any thoughts on this?
Unfortunately, I don't have time for detailed review of this. I am
suspicious that there are substantial performance regressions that you
just haven't found yet. I would not take the position that this is a
completely hopeless approach, or anything like that, but neither would
I conclude that the tests shown so far are anywhere near enough to be
confident that there are no problems.
Also, systems with very large shared_buffers settings are becoming
more common, and probably will continue to become more common, so I
don't think we can dismiss that as an edge case any more. People don't
want to run with an 8GB cache on a 1TB server.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
I know this might already be late at end of CommitFest, but attached
is the latest version of the patch. The previous version only includes buffer
invalidation improvement for VACUUM. The new patch adds the same
routine for TRUNCATE WAL replay.
In summary, this patch aims to improve the buffer invalidation process
of VACUUM and TRUNCATE. Although it may not be a common use
case, our customer uses these commands a lot. Recovery and WAL
replay of these commands can take time depending on the size of
database buffers. So this patch optimizes that using the newly-added
auxiliary cache and doubly-linked list on the shared memory, so that
we don't need to scan the shared buffers anymore.
As for the performance and how it affects the read-only workloads.
Using pgbench, I've kept the overload to a minimum, less than 1%.
I'll post follow-up results.
Although the additional hash table utilizes shared memory, there's
a significant performance gain for both TRUNCATE and VACUUM
from execution to recovery.
Regards,
Kirk Jamison
Attachments:
v7-Optimize-dropping-of-relation-buffers-using-dlist.patchapplication/octet-stream; name=v7-Optimize-dropping-of-relation-buffers-using-dlist.patchDownload+748-75
On Wednesday, March 25, 2020 3:25 PM, Kirk Jamison wrote:
As for the performance and how it affects the read-only workloads.
Using pgbench, I've kept the overload to a minimum, less than 1%.
I'll post follow-up results.
Here's the follow-up results.
I executed the similar tests from top of the thread.
I hope the performance test results shown below would suffice.
If not, I'd appreciate any feedback with regards to test or the patch itself.
A. VACUUM execution + Failover test
- 100GB shared_buffers
1. 1000 tables (18MB)
1.1. Execution Time
- [MASTER] 77755.218 ms (01:17.755)
- [PATCH] Execution Time: 2147.914 ms (00:02.148)
1.2. Failover Time (Recovery WAL Replay):
- [MASTER] 01:37.084 (1 min 37.884 s)
- [PATCH] 1627 ms (1.627 s)
2. 10000 tables (110MB)
2.1. Execution Time
- [MASTER] 844174.572 ms (14:04.175) ~14 min 4.175 s
- [PATCH] 75678.559 ms (01:15.679) ~1 min 15.679 s
2.2. Failover Time:
- [MASTER] est. 14 min++
(I didn't measure anymore because recovery takes
as much as the execution time.)
- [PATCH] 01:25.559 (1 min 25.559 s)
Significant performance results for VACUUM.
B. TPS Regression for READ-ONLY workload
(PREPARED QUERY MODE, NO VACUUM)
# [16 Clients]
- pgbench -n -S -j 16 -c 16 -M prepared -T 60 cachetest
|shbuf |Master |Patch |% reg |
|----------|--------------|---------------|----------|
|128MB| 77,416.76 | 77,162.78 |0.33% |
|1GB | 81,941.30 | 81,812.05 |0.16% |
|2GB | 84,273.69 | 84,356.38 |-0.10%|
|100GB| 83,807.30 | 83,924.68 |-0.14%|
# [1 Client]
- pgbench -n -S -c 1 -M prepared -T 60 cachetest
|shbuf |Master |Patch |% reg |
|----------|--------------|---------------|----------|
|128MB| 12,044.54 | 12,037.13 |0.06% |
|1GB | 12,736.57 | 12,774.77 |-0.30%|
|2GB | 12,948.98 | 13,159.90 |-1.63%|
|100GB| 12,982.98 | 13,064.04 |-0.62%|
Both were run for 10 times and average tps and % regression are
shown above. At some point only minimal overload was caused by
the patch. As for other cases, it has higher tps compared to master.
If it does not make it this CF, I hope to receive feedback in the future
on how to proceed. Thanks in advance!
Regards,
Kirk Jamison
Hi,
Since the last posted version of the patch fails, attached is a rebased version.
Written upthread were performance results and some benefits and challenges.
I'd appreciate your feedback/comments.
Regards,
Kirk Jamison
Attachments:
v8-Optimize-dropping-of-relation-buffers-using-dlist.patchapplication/octet-stream; name=v8-Optimize-dropping-of-relation-buffers-using-dlist.patchDownload+749-76
On 17.06.2020 09:14, k.jamison@fujitsu.com wrote:
Hi,
Since the last posted version of the patch fails, attached is a rebased version.
Written upthread were performance results and some benefits and challenges.
I'd appreciate your feedback/comments.Regards,
Kirk Jamison
As far as i understand this patch can provide significant improvement of
performance only in case of
recovery of truncates of large number of tables. You have added shared
hash of relation buffers and certainly if adds some
extra overhead. According to your latest results this overhead is quite
small. But it will be hard to prove that there will be no noticeable
regression
at some workloads.
I wonder if you have considered case of local hash (maintained only
during recovery)?
If there is after-crash recovery, then there will be no concurrent
access to shared buffers and this hash will be up-to-date.
in case of hot-standby replica we can use some simple invalidation (just
one flag or counter which indicates that buffer cache was updated).
This hash also can be constructed on demand when DropRelFileNodeBuffers
is called first time (so w have to scan all buffers once, but subsequent
drop operation will be fast.
i have not thought much about it, but it seems to me that as far as this
problem only affects recovery, we do not need shared hash for it.
On Wednesday, July 29, 2020 4:55 PM, Konstantin Knizhnik wrote:
On 17.06.2020 09:14, k.jamison@fujitsu.com wrote:
Hi,
Since the last posted version of the patch fails, attached is a rebased version.
Written upthread were performance results and some benefits and challenges.
I'd appreciate your feedback/comments.Regards,
Kirk Jamison
As far as i understand this patch can provide significant improvement of
performance only in case of recovery of truncates of large number of tables. You
have added shared hash of relation buffers and certainly if adds some extra
overhead. According to your latest results this overhead is quite small. But it will
be hard to prove that there will be no noticeable regression at some workloads.
Thank you for taking a look at this.
Yes, one of the aims is to speed up recovery of truncations, but at the same time the
patch also improves autovacuum, vacuum and relation truncate index executions.
I showed results of pgbench results above for different types of workloads,
but I am not sure if those are validating enough...
I wonder if you have considered case of local hash (maintained only during
recovery)?
If there is after-crash recovery, then there will be no concurrent access to shared
buffers and this hash will be up-to-date.
in case of hot-standby replica we can use some simple invalidation (just one flag
or counter which indicates that buffer cache was updated).
This hash also can be constructed on demand when DropRelFileNodeBuffers is
called first time (so w have to scan all buffers once, but subsequent drop
operation will be fast.i have not thought much about it, but it seems to me that as far as this problem
only affects recovery, we do not need shared hash for it.
The idea of the patch is to mark the relation buffers to be dropped after scanning
the whole shared buffers, and store them into shared memory maintained in a dlist,
and traverse the dlist on the next scan.
But I understand the point that it is expensive and may cause overhead, that is why
I tried to define a macro to limit the number of pages that we can cache for cases
that lookup cost can be problematic (i.e. too many pages of relation).
#define BUF_ID_ARRAY_SIZE 100
int buf_id_array[BUF_ID_ARRAY_SIZE];
int forknum_indexes[BUF_ID_ARRAY_SIZE];
In DropRelFileNodeBuffers
do
{
nbufs = CachedBlockLookup(..., forknum_indexes, buf_id_array, lengthof(buf_id_array));
for (i = 0; i < nbufs; i++)
{
...
}
} while (nbufs == lengthof(buf_id_array));
Perhaps the patch affects complexities so we want to keep it simpler, or commit piece by piece?
I will look further into your suggestion of maintaining local hash only during recovery.
Thank you for the suggestion.
Regards,
Kirk Jamison
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested
I have tested this patch at various workloads and hardware (including Power2 server with 384 virtual cores)
and didn't find performance regression.
The new status of this patch is: Ready for Committer
On Friday, July 31, 2020 2:37 AM, Konstantin Knizhnik wrote:
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not testedI have tested this patch at various workloads and hardware (including Power2
server with 384 virtual cores) and didn't find performance regression.The new status of this patch is: Ready for Committer
Thank you very much, Konstantin, for testing the patch for different workloads.
I wonder if I need to modify some documentations.
I'll leave the final review to the committer/s as well.
Regards,
Kirk Jamison
Robert Haas <robertmhaas@gmail.com> writes:
Unfortunately, I don't have time for detailed review of this. I am
suspicious that there are substantial performance regressions that you
just haven't found yet. I would not take the position that this is a
completely hopeless approach, or anything like that, but neither would
I conclude that the tests shown so far are anywhere near enough to be
confident that there are no problems.
I took a quick look through the v8 patch, since it's marked RFC, and
my feeling is about the same as Robert's: it is just about impossible
to believe that doubling (or more) the amount of hashtable manipulation
involved in allocating a buffer won't hurt common workloads. The
offered pgbench results don't reassure me; we've so often found that
pgbench fails to expose performance problems, except maybe when it's
used just so.
But aside from that, I noted a number of things I didn't like a bit:
* The amount of new shared memory this needs seems several orders
of magnitude higher than what I'd call acceptable: according to my
measurements it's over 10KB per shared buffer! Most of that is going
into the CachedBufTableLock data structure, which seems fundamentally
misdesigned --- how could we be needing a lock per map partition *per
buffer*? For comparison, the space used by buf_table.c is about 56
bytes per shared buffer; I think this needs to stay at least within
hailing distance of there.
* It is fairly suspicious that the new data structure is manipulated
while holding per-partition locks for the existing buffer hashtable.
At best that seems bad for concurrency, and at worst it could result
in deadlocks, because I doubt we can assume that the new hash table
has partition boundaries identical to the old one.
* More generally, it seems like really poor design that this has been
written completely independently of the existing buffer hash table.
Can't we get any benefit by merging them somehow?
* I do not like much of anything in the code details. "CachedBuf"
is as unhelpful as could be as a data structure identifier --- what
exactly is not "cached" about shared buffers already? "CombinedLock"
is not too helpful either, nor could I find any documentation explaining
why you need to invent new locking technology in the first place.
At best, CombinedLockAcquireSpinLock seems like a brute-force approach
to an undocumented problem.
* The commentary overall is far too sparse to be of any value ---
basically, any reader will have to reverse-engineer your entire design.
That's not how we do things around here. There should be either a README,
or a long file header comment, explaining what's going on, how the data
structure is organized, and what the locking requirements are.
See src/backend/storage/buffer/README for the sort of documentation
that I think this needs.
Even if I were convinced that there's no performance gotchas,
I wouldn't commit this in anything like its current form.
Robert again:
Also, systems with very large shared_buffers settings are becoming
more common, and probably will continue to become more common, so I
don't think we can dismiss that as an edge case any more. People don't
want to run with an 8GB cache on a 1TB server.
I do agree that it'd be great to improve this area. Just not convinced
that this is how.
regards, tom lane
Hi,
On 2020-07-31 13:39:37 -0400, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Unfortunately, I don't have time for detailed review of this. I am
suspicious that there are substantial performance regressions that you
just haven't found yet. I would not take the position that this is a
completely hopeless approach, or anything like that, but neither would
I conclude that the tests shown so far are anywhere near enough to be
confident that there are no problems.I took a quick look through the v8 patch, since it's marked RFC, and
my feeling is about the same as Robert's: it is just about impossible
to believe that doubling (or more) the amount of hashtable manipulation
involved in allocating a buffer won't hurt common workloads. The
offered pgbench results don't reassure me; we've so often found that
pgbench fails to expose performance problems, except maybe when it's
used just so.
Indeed. The buffer mapping hashtable already is visible as a major
bottleneck in a number of workloads. Even in readonly pgbench if s_b is
large enough (so the hashtable is larger than the cache). Not to speak
of things like a cached sequential scan with a cheap qual and wide rows.
Robert again:
Also, systems with very large shared_buffers settings are becoming
more common, and probably will continue to become more common, so I
don't think we can dismiss that as an edge case any more. People don't
want to run with an 8GB cache on a 1TB server.I do agree that it'd be great to improve this area. Just not convinced
that this is how.
Wonder if the temporary fix is just to do explicit hashtable probes for
all pages iff the size of the relation is < s_b / 500 or so. That'll
address the case where small tables are frequently dropped - and
dropping large relations is more expensive from the OS and data loading
perspective, so it's not gonna happen as often.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
Indeed. The buffer mapping hashtable already is visible as a major
bottleneck in a number of workloads. Even in readonly pgbench if s_b is
large enough (so the hashtable is larger than the cache). Not to speak
of things like a cached sequential scan with a cheap qual and wide rows.
To be fair, the added overhead is in buffer allocation not buffer lookup.
So it shouldn't add cost to fully-cached cases. As Tomas noted upthread,
the potential trouble spot is where the working set is bigger than shared
buffers but still fits in RAM (so there's no actual I/O needed, but we do
still have to shuffle buffers a lot).
Wonder if the temporary fix is just to do explicit hashtable probes for
all pages iff the size of the relation is < s_b / 500 or so. That'll
address the case where small tables are frequently dropped - and
dropping large relations is more expensive from the OS and data loading
perspective, so it's not gonna happen as often.
Oooh, interesting idea. We'd need a reliable idea of how long the
relation had been (preferably without adding an lseek call), but maybe
that's do-able.
regards, tom lane