Parallel heap vacuum
Hi all,
The parallel vacuum we have today supports only for index vacuuming.
Therefore, while multiple workers can work on different indexes in
parallel, the heap table is always processed by the single process.
I'd like to propose $subject, which enables us to have multiple
workers running on the single heap table. This would be helpful to
speedup vacuuming for tables without indexes or tables with
INDEX_CLENAUP = off.
I've attached a PoC patch for this feature. It implements only
parallel heap scans in lazyvacum. We can extend this feature to
support parallel heap vacuum as well in the future or in the same
patch.
# Overall idea (for parallel heap scan in lazy vacuum)
At the beginning of vacuum, we determine how many workers to launch
based on the table size like other parallel query operations. The
number of workers is capped by max_parallel_maitenance_workers. Once
we decided to use parallel heap scan, we prepared DSM to share data
among parallel workers and leader. The information include at least
the vacuum option such as aggressive, the counters collected during
lazy vacuum such as scanned_pages, vacuum cutoff such as VacuumCutoffs
and GlobalVisState, and parallel scan description.
Before starting heap scan in lazy vacuum, we launch parallel workers
and then each worker (and the leader) process different blocks. Each
worker does HOT-pruning on pages and collects dead tuple TIDs. When
adding dead tuple TIDs, workers need to hold an exclusive lock on
TidStore. At the end of heap scan phase, workers exit and the leader
will wait for all workers to exit. After that, the leader process
gather the counters collected by parallel workers, and compute the
oldest relfrozenxid (and relminmxid). Then if parallel index vacuum is
also enabled, we launch other parallel workers for parallel index
vacuuming.
When it comes to parallel heap scan in lazy vacuum, I think we can use
the table_block_parallelscan_XXX() family. One tricky thing we need to
deal with is that if the TideStore memory usage reaches the limit, we
stop the parallel scan, do index vacuum and table vacuum, and then
resume the parallel scan from the previous state. In order to do that,
in the patch, we store ParallelBlockTableScanWorker, per-worker
parallel scan state, into DSM so that different parallel workers can
resume the scan using the same parallel scan state.
In addition to that, since we could end up launching fewer workers
than requested, it could happen that some ParallelBlockTableScanWorker
data is used once and never be used while remaining unprocessed
blocks. To handle this case, in the patch, the leader process checks
at the end of the parallel scan if there is an uncompleted parallel
scan. If so, the leader process does the scan using worker's
ParallelBlockTableScanWorker data on behalf of workers.
# Discussions
I'm somewhat convinced the brief design of this feature, but there are
some points regarding the implementation we need to discuss.
In the patch, I extended vacuumparalle.c to support parallel table
scan (and vacuum in the future). So I was required to add some table
AM callbacks such as DSM size estimation, DSM initialization, and
actual table scans etc. We need to verify these APIs are appropriate.
Specifically, if we want to support both parallel heap scan and
parallel heap vacuum, do we want to add separate callbacks for them?
It could be overkill since such a 2-pass vacuum strategy is specific
to heap AM.
As another implementation idea, we might want to implement parallel
heap scan/vacuum in lazyvacuum.c while minimizing changes for
vacuumparallel.c. That way, we would not need to add table AM
callbacks. However, we would end up having duplicate codes related to
parallel operation in vacuum such as vacuum delays.
Also, we might need to add some functions to share GlobalVisState
among parallel workers, since GlobalVisState is a private struct.
Other points I'm somewhat uncomfortable with or need to be discussed
remain in the code with XXX comments.
# Benchmark results
* Test-1: parallel heap scan on the table without indexes
I created 20GB table, made garbage on the table, and run vacuum while
changing parallel degree:
create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;
Here are the results (total time and heap scan time):
PARALLEL 0: 21.99 s (single process)
PARALLEL 1: 11.39 s
PARALLEL 2: 8.36 s
PARALLEL 3: 6.14 s
PARALLEL 4: 5.08 s
* Test-2: parallel heap scan on the table with one index
I used a similar table to the test case 1 but created one btree index on it:
create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
create index on test (a);
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;
I've measured the total execution time as well as the time of each
vacuum phase (from left heap scan time, index vacuum time, and heap
vacuum time):
PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)
Overall, I can see the parallel heap scan in lazy vacuum has a decent
scalability; In both test-1 and test-2, the execution time of heap
scan got ~4x faster with 4 parallel workers. On the other hand, when
it comes to the total vacuum execution time, I could not see much
performance improvement in test-2 (45.11 vs. 35.84). Looking at the
results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
(21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
and heap scan in case 2 was not as fast as in case 1 with 1 parallel
worker (12.75 vs. 11.39).
I think the reason is the shared TidStore is not very scalable since
we have a single lock on it. In all cases in the test-1, we don't use
the shared TidStore since all dead tuples are removed during heap
pruning. So the scalability was better overall than in test-2. In
parallel 0 case in test-2, we use the local TidStore, and from
parallel degree of 1 in test-2, we use the shared TidStore and
parallel worker concurrently update it. Also, I guess that the lookup
performance of the local TidStore is better than the shared TidStore's
lookup performance because of the differences between a bump context
and an DSA area. I think that this difference contributed the fact
that index vacuuming got slower (16.74 vs. 22.04).
There are two obvious improvement ideas to improve overall vacuum
execution time: (1) improve the shared TidStore scalability and (2)
support parallel heap vacuum. For (1), several ideas are proposed by
the ART authors[1]https://db.in.tum.de/~leis/papers/artsync.pdf. I've not tried these ideas but it might be
applicable to our ART implementation. But I prefer to start with (2)
since it would be easier. Feedback is very welcome.
Regards,
[1]: https://db.in.tum.de/~leis/papers/artsync.pdf
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
parallel_heap_vacuum_scan.patchapplication/x-patch; name=parallel_heap_vacuum_scan.patchDownload+1072-231
On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
# Benchmark results
* Test-1: parallel heap scan on the table without indexes
I created 20GB table, made garbage on the table, and run vacuum while
changing parallel degree:create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;Here are the results (total time and heap scan time):
PARALLEL 0: 21.99 s (single process)
PARALLEL 1: 11.39 s
PARALLEL 2: 8.36 s
PARALLEL 3: 6.14 s
PARALLEL 4: 5.08 s* Test-2: parallel heap scan on the table with one index
I used a similar table to the test case 1 but created one btree index on it:
create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
create index on test (a);
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;I've measured the total execution time as well as the time of each
vacuum phase (from left heap scan time, index vacuum time, and heap
vacuum time):PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)Overall, I can see the parallel heap scan in lazy vacuum has a decent
scalability; In both test-1 and test-2, the execution time of heap
scan got ~4x faster with 4 parallel workers. On the other hand, when
it comes to the total vacuum execution time, I could not see much
performance improvement in test-2 (45.11 vs. 35.84). Looking at the
results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
(21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
and heap scan in case 2 was not as fast as in case 1 with 1 parallel
worker (12.75 vs. 11.39).I think the reason is the shared TidStore is not very scalable since
we have a single lock on it. In all cases in the test-1, we don't use
the shared TidStore since all dead tuples are removed during heap
pruning. So the scalability was better overall than in test-2. In
parallel 0 case in test-2, we use the local TidStore, and from
parallel degree of 1 in test-2, we use the shared TidStore and
parallel worker concurrently update it. Also, I guess that the lookup
performance of the local TidStore is better than the shared TidStore's
lookup performance because of the differences between a bump context
and an DSA area. I think that this difference contributed the fact
that index vacuuming got slower (16.74 vs. 22.04).There are two obvious improvement ideas to improve overall vacuum
execution time: (1) improve the shared TidStore scalability and (2)
support parallel heap vacuum. For (1), several ideas are proposed by
the ART authors[1]. I've not tried these ideas but it might be
applicable to our ART implementation. But I prefer to start with (2)
since it would be easier. Feedback is very welcome.
Starting with (2) sounds like a reasonable approach. We should study a
few more things like (a) the performance results where there are 3-4
indexes, (b) What is the reason for performance improvement seen with
only heap scans. We normally get benefits of parallelism because of
using multiple CPUs but parallelizing scans (I/O) shouldn't give much
benefits. Is it possible that you are seeing benefits because most of
the data is either in shared_buffers or in memory? We can probably try
vacuuming tables by restarting the nodes to ensure the data is not in
memory.
--
With Regards,
Amit Kapila.
Dear Sawada-san,
The parallel vacuum we have today supports only for index vacuuming.
Therefore, while multiple workers can work on different indexes in
parallel, the heap table is always processed by the single process.
I'd like to propose $subject, which enables us to have multiple
workers running on the single heap table. This would be helpful to
speedup vacuuming for tables without indexes or tables with
INDEX_CLENAUP = off.
Sounds great. IIUC, vacuuming is still one of the main weak point of postgres.
I've attached a PoC patch for this feature. It implements only
parallel heap scans in lazyvacum. We can extend this feature to
support parallel heap vacuum as well in the future or in the same
patch.
Before diving into deep, I tested your PoC but found unclear point.
When the vacuuming is requested with parallel > 0 with almost the same workload
as yours, only the first page was scanned and cleaned up.
When parallel was set to zero, I got:
```
INFO: vacuuming "postgres.public.test"
INFO: finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total)
tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
new relfrozenxid: 739, which is 1 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s
buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s
```
This meant that all pages were surely scanned and dead tuples were removed.
However, when parallel was set to one, I got another result:
```
INFO: vacuuming "postgres.public.test"
INFO: launched 1 parallel vacuum worker for table scanning (planned: 1)
INFO: finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total)
tuples: 12 removed, 0 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s
buffer usage: 96 hits, 660 misses, 6 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s
```
It looked like that only a page was scanned and 12 tuples were removed.
It looks very strange for me...
Attached script emulate my test. IIUC it was almost the same as yours, but
the instance was restarted before vacuuming.
Can you reproduce and see the reason? Based on the requirement I can
provide further information.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/
Attachments:
On Fri, Jul 5, 2024 at 6:51 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Sawada-san,
The parallel vacuum we have today supports only for index vacuuming.
Therefore, while multiple workers can work on different indexes in
parallel, the heap table is always processed by the single process.
I'd like to propose $subject, which enables us to have multiple
workers running on the single heap table. This would be helpful to
speedup vacuuming for tables without indexes or tables with
INDEX_CLENAUP = off.Sounds great. IIUC, vacuuming is still one of the main weak point of postgres.
I've attached a PoC patch for this feature. It implements only
parallel heap scans in lazyvacum. We can extend this feature to
support parallel heap vacuum as well in the future or in the same
patch.Before diving into deep, I tested your PoC but found unclear point.
When the vacuuming is requested with parallel > 0 with almost the same workload
as yours, only the first page was scanned and cleaned up.When parallel was set to zero, I got:
```
INFO: vacuuming "postgres.public.test"
INFO: finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total)
tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
new relfrozenxid: 739, which is 1 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s
buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s
```This meant that all pages were surely scanned and dead tuples were removed.
However, when parallel was set to one, I got another result:```
INFO: vacuuming "postgres.public.test"
INFO: launched 1 parallel vacuum worker for table scanning (planned: 1)
INFO: finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total)
tuples: 12 removed, 0 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s
buffer usage: 96 hits, 660 misses, 6 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s
```It looked like that only a page was scanned and 12 tuples were removed.
It looks very strange for me...Attached script emulate my test. IIUC it was almost the same as yours, but
the instance was restarted before vacuuming.Can you reproduce and see the reason? Based on the requirement I can
provide further information.
Thank you for the test!
I could reproduce this issue and it's a bug; it skipped even
non-all-visible pages. I've attached the new version patch.
BTW since we compute the number of parallel workers for the heap scan
based on the table size, it's possible that we launch multiple workers
even if most blocks are all-visible. It seems to be better if we
calculate it based on (relpages - relallvisible).
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
parallel_heap_vacuum_scan_v2.patchapplication/octet-stream; name=parallel_heap_vacuum_scan_v2.patchDownload+1081-231
On Fri, Jun 28, 2024 at 9:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
# Benchmark results
* Test-1: parallel heap scan on the table without indexes
I created 20GB table, made garbage on the table, and run vacuum while
changing parallel degree:create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;Here are the results (total time and heap scan time):
PARALLEL 0: 21.99 s (single process)
PARALLEL 1: 11.39 s
PARALLEL 2: 8.36 s
PARALLEL 3: 6.14 s
PARALLEL 4: 5.08 s* Test-2: parallel heap scan on the table with one index
I used a similar table to the test case 1 but created one btree index on it:
create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
create index on test (a);
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;I've measured the total execution time as well as the time of each
vacuum phase (from left heap scan time, index vacuum time, and heap
vacuum time):PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)Overall, I can see the parallel heap scan in lazy vacuum has a decent
scalability; In both test-1 and test-2, the execution time of heap
scan got ~4x faster with 4 parallel workers. On the other hand, when
it comes to the total vacuum execution time, I could not see much
performance improvement in test-2 (45.11 vs. 35.84). Looking at the
results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
(21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
and heap scan in case 2 was not as fast as in case 1 with 1 parallel
worker (12.75 vs. 11.39).I think the reason is the shared TidStore is not very scalable since
we have a single lock on it. In all cases in the test-1, we don't use
the shared TidStore since all dead tuples are removed during heap
pruning. So the scalability was better overall than in test-2. In
parallel 0 case in test-2, we use the local TidStore, and from
parallel degree of 1 in test-2, we use the shared TidStore and
parallel worker concurrently update it. Also, I guess that the lookup
performance of the local TidStore is better than the shared TidStore's
lookup performance because of the differences between a bump context
and an DSA area. I think that this difference contributed the fact
that index vacuuming got slower (16.74 vs. 22.04).
Thank you for the comments!
There are two obvious improvement ideas to improve overall vacuum
execution time: (1) improve the shared TidStore scalability and (2)
support parallel heap vacuum. For (1), several ideas are proposed by
the ART authors[1]. I've not tried these ideas but it might be
applicable to our ART implementation. But I prefer to start with (2)
since it would be easier. Feedback is very welcome.Starting with (2) sounds like a reasonable approach. We should study a
few more things like (a) the performance results where there are 3-4
indexes,
Here are the results with 4 indexes (and restarting the server before
the benchmark):
PARALLEL 0: 115.48 s (32.76, 64.46, 18.24)
PARALLEL 1: 74.88 s (17.11, 44.43, 13.25)
PARALLEL 2: 71.15 s (14.13, 44.82, 12.12)
PARALLEL 3: 46.78 s (10.74, 24.50, 11.43)
PARALLEL 4: 46.42 s (8.95, 24.96, 12.39) (launched 4 workers for heap
scan and 3 workers for index vacuum)
(b) What is the reason for performance improvement seen with
only heap scans. We normally get benefits of parallelism because of
using multiple CPUs but parallelizing scans (I/O) shouldn't give much
benefits. Is it possible that you are seeing benefits because most of
the data is either in shared_buffers or in memory? We can probably try
vacuuming tables by restarting the nodes to ensure the data is not in
memory.
I think it depends on the storage performance. FYI I use an EC2
instance (m6id.metal).
I've run the same benchmark script (table with no index) with
restarting the server before executing the vacuum, and here are the
results:
PARALLEL 0: 32.75 s
PARALLEL 1: 17.46 s
PARALLEL 2: 13.41 s
PARALLEL 3: 10.31 s
PARALLEL 4: 8.48 s
With the above two tests, I used the updated patch that I just submitted[1]/messages/by-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn+2KYoSXEv=vXauw@mail.gmail.com.
Regards,
[1]: /messages/by-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn+2KYoSXEv=vXauw@mail.gmail.com
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Dear Sawada-san,
Thank you for the test!
I could reproduce this issue and it's a bug; it skipped even
non-all-visible pages. I've attached the new version patch.BTW since we compute the number of parallel workers for the heap scan
based on the table size, it's possible that we launch multiple workers
even if most blocks are all-visible. It seems to be better if we
calculate it based on (relpages - relallvisible).
Thanks for updating the patch. I applied and confirmed all pages are scanned.
I used almost the same script (just changed max_parallel_maintenance_workers)
and got below result. I think the tendency was the same as yours.
```
parallel 0: 61114.369 ms
parallel 1: 34870.316 ms
parallel 2: 23598.115 ms
parallel 3: 17732.363 ms
parallel 4: 15203.271 ms
parallel 5: 13836.025 ms
```
I started to read your codes but takes much time because I've never seen before...
Below part contains initial comments.
1.
This patch cannot be built when debug mode is enabled. See [1]``` vacuumlazy.c: In function ‘lazy_scan_prune’: vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no member named ‘NewRelminMxid’ Assert(MultiXactIdIsValid(vacrel->NewRelminMxid)); ^~ .... ```.
IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared.
So you should update like " vacrel->counters->NewRelminMxid".
2.
I compared parallel heap scan and found that it does not have compute_worker API.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the heap structure,
so should we add the API for table scan as well?)
[1]: ``` vacuumlazy.c: In function ‘lazy_scan_prune’: vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no member named ‘NewRelminMxid’ Assert(MultiXactIdIsValid(vacrel->NewRelminMxid)); ^~ .... ```
```
vacuumlazy.c: In function ‘lazy_scan_prune’:
vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no member named ‘NewRelminMxid’
Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
^~
....
```
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Thu, Jul 25, 2024 at 2:58 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Sawada-san,
Thank you for the test!
I could reproduce this issue and it's a bug; it skipped even
non-all-visible pages. I've attached the new version patch.BTW since we compute the number of parallel workers for the heap scan
based on the table size, it's possible that we launch multiple workers
even if most blocks are all-visible. It seems to be better if we
calculate it based on (relpages - relallvisible).Thanks for updating the patch. I applied and confirmed all pages are scanned.
I used almost the same script (just changed max_parallel_maintenance_workers)
and got below result. I think the tendency was the same as yours.```
parallel 0: 61114.369 ms
parallel 1: 34870.316 ms
parallel 2: 23598.115 ms
parallel 3: 17732.363 ms
parallel 4: 15203.271 ms
parallel 5: 13836.025 ms
```
Thank you for testing!
I started to read your codes but takes much time because I've never seen before...
Below part contains initial comments.1.
This patch cannot be built when debug mode is enabled. See [1].
IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared.
So you should update like " vacrel->counters->NewRelminMxid".
Right, will fix.
2.
I compared parallel heap scan and found that it does not have compute_worker API.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the heap structure,
so should we add the API for table scan as well?)
There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.
I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Dear Sawada-san,
Thank you for testing!
I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.
2.
I compared parallel heap scan and found that it does not have compute_workerAPI.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on theheap structure,
so should we add the API for table scan as well?)
There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.
Okay, thanks for confirming. I wanted to ask others as well.
I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.
Here are other preliminary comments for v2 patch. This does not contain
cosmetic ones.
1.
Shared data structure PHVShared does not contain the mutex lock. Is it intentional
because they are accessed by leader only after parallel workers exit?
2.
Per my understanding, the vacuuming goes like below steps.
a. paralell workers are launched for scanning pages
b. leader waits until scans are done
c. leader does vacuum alone (you may extend here...)
d. parallel workers are launched again to cleanup indeces
If so, can we reuse parallel workers for the cleanup? Or, this is painful
engineering than the benefit?
3.
According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
"parallel worker ..." Can we extend this to improve the trackability in the
pg_stat_activity?
4.
I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
bottleneck when tid is added to the shared TidStore. My another primitive idea
is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
and gather after the heap scanning. If you extend like parallel workers do vacuuming,
the gathering may not be needed: they can access own TidStore and clean up.
One downside is that the memory consumption may be quite large.
How do you think?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
40_flamegraph.svgapplication/octet-stream; name=40_flamegraph.svgDownload
Sorry for the very late reply.
On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Sawada-san,
Thank you for testing!
I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.2.
I compared parallel heap scan and found that it does not have compute_workerAPI.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on theheap structure,
so should we add the API for table scan as well?)
There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.Okay, thanks for confirming. I wanted to ask others as well.
I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.Here are other preliminary comments for v2 patch. This does not contain
cosmetic ones.1.
Shared data structure PHVShared does not contain the mutex lock. Is it intentional
because they are accessed by leader only after parallel workers exit?
Yes, the fields in PHVShared are read-only for workers. Since no
concurrent reads/writes happen on these fields we don't need to
protect them.
2.
Per my understanding, the vacuuming goes like below steps.a. paralell workers are launched for scanning pages
b. leader waits until scans are done
c. leader does vacuum alone (you may extend here...)
d. parallel workers are launched again to cleanup indecesIf so, can we reuse parallel workers for the cleanup? Or, this is painful
engineering than the benefit?
I've not thought of this idea but I think it's possible from a
technical perspective. It saves overheads of relaunching workers but
I'm not sure how much it would help for a better performance and I'm
concerned it would make the code complex. For example, different
numbers of workers might be required for table vacuuming and index
vacuuming. So we would end up increasing or decreasing workers.
3.
According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
"parallel worker ..." Can we extend this to improve the trackability in the
pg_stat_activity?
It would be a good improvement for better trackability. But I think we
should do that in a separate patch as it's not just a problem for
parallel heap vacuum.
4.
I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
bottleneck when tid is added to the shared TidStore. My another primitive idea
is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
and gather after the heap scanning. If you extend like parallel workers do vacuuming,
the gathering may not be needed: they can access own TidStore and clean up.
One downside is that the memory consumption may be quite large.
Interesting idea. Suppose we support parallel heap vacuum as well, we
wouldn't need locks and to support shared-iteration on TidStore. I
think each worker should use a fraction of maintenance_work_mem.
However, one downside would be that we need to check as many TidStore
as workers during index vacuuming.
FYI I've implemented the parallel heap vacuum part and am doing some
benchmark tests. I'll share the updated patches along with test
results this week.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Oct 22, 2024 at 4:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Sorry for the very late reply.
On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:Dear Sawada-san,
Thank you for testing!
I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.2.
I compared parallel heap scan and found that it does not have compute_workerAPI.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on theheap structure,
so should we add the API for table scan as well?)
There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.Okay, thanks for confirming. I wanted to ask others as well.
I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.Here are other preliminary comments for v2 patch. This does not contain
cosmetic ones.1.
Shared data structure PHVShared does not contain the mutex lock. Is it intentional
because they are accessed by leader only after parallel workers exit?Yes, the fields in PHVShared are read-only for workers. Since no
concurrent reads/writes happen on these fields we don't need to
protect them.2.
Per my understanding, the vacuuming goes like below steps.a. paralell workers are launched for scanning pages
b. leader waits until scans are done
c. leader does vacuum alone (you may extend here...)
d. parallel workers are launched again to cleanup indecesIf so, can we reuse parallel workers for the cleanup? Or, this is painful
engineering than the benefit?I've not thought of this idea but I think it's possible from a
technical perspective. It saves overheads of relaunching workers but
I'm not sure how much it would help for a better performance and I'm
concerned it would make the code complex. For example, different
numbers of workers might be required for table vacuuming and index
vacuuming. So we would end up increasing or decreasing workers.3.
According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
"parallel worker ..." Can we extend this to improve the trackability in the
pg_stat_activity?It would be a good improvement for better trackability. But I think we
should do that in a separate patch as it's not just a problem for
parallel heap vacuum.4.
I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
bottleneck when tid is added to the shared TidStore. My another primitive idea
is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
and gather after the heap scanning. If you extend like parallel workers do vacuuming,
the gathering may not be needed: they can access own TidStore and clean up.
One downside is that the memory consumption may be quite large.Interesting idea. Suppose we support parallel heap vacuum as well, we
wouldn't need locks and to support shared-iteration on TidStore. I
think each worker should use a fraction of maintenance_work_mem.
However, one downside would be that we need to check as many TidStore
as workers during index vacuuming.
On further thoughts, I think this idea doesn't go well. The index
vacuuming is the most time-consuming phase among vacuum phases, I
think it would not be a good idea to make it slower even if we can do
parallel heap scan and heap vacuum without any locking. Also, merging
multiple TidStore to one is not straightforward since the block ranges
that each worker processes are overlapped.
FYI I've implemented the parallel heap vacuum part and am doing some
benchmark tests. I'll share the updated patches along with test
results this week.
Please find the attached patches. From the previous version, I made a
lot of changes including bug fixes, addressing review comments, and
adding parallel heap vacuum support. Parallel vacuum related
infrastructure are implemented in vacuumparallel.c, and lazyvacuum.c
now uses ParallelVacuumState for parallel heap scan/vacuum, index
bulkdelete/cleanup, or both. Parallel vacuum workers launch at the
beginning of each phase, and exit at the end of each phase. Since
different numbers of workers could be used for heap scan/vacuum and
index bulkdelete/cleanup, it's possible that only either heap
scan/vacuum or index bulkdelete/cleanup is parallelized.
In order to implement parallel heap vacuum, I extended radix tree and
tidstore to support shared iteration. The shared iteration works only
with a shared tidstore but a non-shared iteration works with a local
tidstore as well as a shared tidstore. For example, if a table is
large and has one index, we use only parallel heap scan/vacuum. In
this case, we store dead item TIDs into a shared tidstore during
parallel heap scan, but during parallel index bulk-deletion we perform
a non-shared iteration on the shared tidstore, which is more efficient
as it doesn't acquire any locks during the iteration.
I've done benchmark tests with a 10GB unlogged table (created on a
tmpfs tablespace) having 4 btree indexes while changing parallel
degrees. I restarted postgres server before each run to ensure that
data is not on the shared memory. I avoided disk I/O during lazy
vacuum as much as possible. Here is comparison between HEAD and
patched (took median of 5 runs):
+----------+-----------+-----------+-------------+
| parallel | HEAD | patched | improvement |
+----------+-----------+-----------+-------------+
| 0 | 53079.53 | 53468.734 | 1.007 |
| 1 | 48101.46 | 35712.613 | 0.742 |
| 2 | 37767.902 | 23566.426 | 0.624 |
| 4 | 38005.836 | 20192.055 | 0.531 |
| 8 | 37754.47 | 18614.717 | 0.493 |
+----------+-----------+-----------+-------------+
Here is the breakdowns of execution times of each vacuum phase (from
left, heap scan, index bulkdel, and heap vacuum):
- HEAD
parallel 0: 53079.530 (15886, 28039, 9270)
parallel 1: 48101.460 (15931, 23247, 9215)
parallel 2: 37767.902 (15259, 12888, 9479)
parallel 4: 38005.836 (16097, 12683, 9217)
parallel 8: 37754.470 (16016, 12535, 9306)
- Patched
parallel 0: 53468.734 (15990, 28296, 9465)
parallel 1: 35712.613 ( 8254, 23569, 3700)
parallel 2: 23566.426 ( 6180, 12760, 3283)
parallel 4: 20192.055 ( 4058, 12776, 2154)
parallel 8: 18614.717 ( 2797, 13244, 1579)
The index bulkdel phase is saturated with parallel 2 as one worker is
assigned to one index. On HEAD, there is no further performance gain
with more than 'parallel 4'. On the other hand, on Patched, it got
faster even at 'parallel 4' and 'parallel 8' since other two phases
were also done with parallel workers.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v3-0004-Support-parallel-heap-vacuum-during-lazy-vacuum.patchapplication/octet-stream; name=v3-0004-Support-parallel-heap-vacuum-during-lazy-vacuum.patchDownload+106-52
v3-0003-Support-shared-itereation-on-TidStore.patchapplication/octet-stream; name=v3-0003-Support-shared-itereation-on-TidStore.patchDownload+62-1
v3-0001-Support-parallel-heap-scan-during-lazy-vacuum.patchapplication/octet-stream; name=v3-0001-Support-parallel-heap-scan-during-lazy-vacuum.patchDownload+1313-266
v3-0002-raidxtree.h-support-shared-iteration.patchapplication/octet-stream; name=v3-0002-raidxtree.h-support-shared-iteration.patchDownload+190-32
On Fri, Oct 25, 2024 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Oct 22, 2024 at 4:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Sorry for the very late reply.
On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:Dear Sawada-san,
Thank you for testing!
I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.2.
I compared parallel heap scan and found that it does not have compute_workerAPI.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on theheap structure,
so should we add the API for table scan as well?)
There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.Okay, thanks for confirming. I wanted to ask others as well.
I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.Here are other preliminary comments for v2 patch. This does not contain
cosmetic ones.1.
Shared data structure PHVShared does not contain the mutex lock. Is it intentional
because they are accessed by leader only after parallel workers exit?Yes, the fields in PHVShared are read-only for workers. Since no
concurrent reads/writes happen on these fields we don't need to
protect them.2.
Per my understanding, the vacuuming goes like below steps.a. paralell workers are launched for scanning pages
b. leader waits until scans are done
c. leader does vacuum alone (you may extend here...)
d. parallel workers are launched again to cleanup indecesIf so, can we reuse parallel workers for the cleanup? Or, this is painful
engineering than the benefit?I've not thought of this idea but I think it's possible from a
technical perspective. It saves overheads of relaunching workers but
I'm not sure how much it would help for a better performance and I'm
concerned it would make the code complex. For example, different
numbers of workers might be required for table vacuuming and index
vacuuming. So we would end up increasing or decreasing workers.3.
According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
"parallel worker ..." Can we extend this to improve the trackability in the
pg_stat_activity?It would be a good improvement for better trackability. But I think we
should do that in a separate patch as it's not just a problem for
parallel heap vacuum.4.
I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
bottleneck when tid is added to the shared TidStore. My another primitive idea
is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
and gather after the heap scanning. If you extend like parallel workers do vacuuming,
the gathering may not be needed: they can access own TidStore and clean up.
One downside is that the memory consumption may be quite large.Interesting idea. Suppose we support parallel heap vacuum as well, we
wouldn't need locks and to support shared-iteration on TidStore. I
think each worker should use a fraction of maintenance_work_mem.
However, one downside would be that we need to check as many TidStore
as workers during index vacuuming.On further thoughts, I think this idea doesn't go well. The index
vacuuming is the most time-consuming phase among vacuum phases, I
think it would not be a good idea to make it slower even if we can do
parallel heap scan and heap vacuum without any locking. Also, merging
multiple TidStore to one is not straightforward since the block ranges
that each worker processes are overlapped.FYI I've implemented the parallel heap vacuum part and am doing some
benchmark tests. I'll share the updated patches along with test
results this week.Please find the attached patches. From the previous version, I made a
lot of changes including bug fixes, addressing review comments, and
adding parallel heap vacuum support. Parallel vacuum related
infrastructure are implemented in vacuumparallel.c, and lazyvacuum.c
now uses ParallelVacuumState for parallel heap scan/vacuum, index
bulkdelete/cleanup, or both. Parallel vacuum workers launch at the
beginning of each phase, and exit at the end of each phase. Since
different numbers of workers could be used for heap scan/vacuum and
index bulkdelete/cleanup, it's possible that only either heap
scan/vacuum or index bulkdelete/cleanup is parallelized.In order to implement parallel heap vacuum, I extended radix tree and
tidstore to support shared iteration. The shared iteration works only
with a shared tidstore but a non-shared iteration works with a local
tidstore as well as a shared tidstore. For example, if a table is
large and has one index, we use only parallel heap scan/vacuum. In
this case, we store dead item TIDs into a shared tidstore during
parallel heap scan, but during parallel index bulk-deletion we perform
a non-shared iteration on the shared tidstore, which is more efficient
as it doesn't acquire any locks during the iteration.I've done benchmark tests with a 10GB unlogged table (created on a
tmpfs tablespace) having 4 btree indexes while changing parallel
degrees. I restarted postgres server before each run to ensure that
data is not on the shared memory. I avoided disk I/O during lazy
vacuum as much as possible. Here is comparison between HEAD and
patched (took median of 5 runs):+----------+-----------+-----------+-------------+ | parallel | HEAD | patched | improvement | +----------+-----------+-----------+-------------+ | 0 | 53079.53 | 53468.734 | 1.007 | | 1 | 48101.46 | 35712.613 | 0.742 | | 2 | 37767.902 | 23566.426 | 0.624 | | 4 | 38005.836 | 20192.055 | 0.531 | | 8 | 37754.47 | 18614.717 | 0.493 | +----------+-----------+-----------+-------------+Here is the breakdowns of execution times of each vacuum phase (from
left, heap scan, index bulkdel, and heap vacuum):- HEAD
parallel 0: 53079.530 (15886, 28039, 9270)
parallel 1: 48101.460 (15931, 23247, 9215)
parallel 2: 37767.902 (15259, 12888, 9479)
parallel 4: 38005.836 (16097, 12683, 9217)
parallel 8: 37754.470 (16016, 12535, 9306)- Patched
parallel 0: 53468.734 (15990, 28296, 9465)
parallel 1: 35712.613 ( 8254, 23569, 3700)
parallel 2: 23566.426 ( 6180, 12760, 3283)
parallel 4: 20192.055 ( 4058, 12776, 2154)
parallel 8: 18614.717 ( 2797, 13244, 1579)The index bulkdel phase is saturated with parallel 2 as one worker is
assigned to one index. On HEAD, there is no further performance gain
with more than 'parallel 4'. On the other hand, on Patched, it got
faster even at 'parallel 4' and 'parallel 8' since other two phases
were also done with parallel workers.
I've attached new version patches that fixes failures reported by
cfbot. I hope these changes make cfbot happy.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v4-0004-Support-parallel-heap-vacuum-during-lazy-vacuum.patchapplication/octet-stream; name=v4-0004-Support-parallel-heap-vacuum-during-lazy-vacuum.patchDownload+122-54
v4-0001-Support-parallel-heap-scan-during-lazy-vacuum.patchapplication/octet-stream; name=v4-0001-Support-parallel-heap-scan-during-lazy-vacuum.patchDownload+1318-266
v4-0003-Support-shared-itereation-on-TidStore.patchapplication/octet-stream; name=v4-0003-Support-shared-itereation-on-TidStore.patchDownload+62-1
v4-0002-raidxtree.h-support-shared-iteration.patchapplication/octet-stream; name=v4-0002-raidxtree.h-support-shared-iteration.patchDownload+190-32
Dear Sawda-san,
I've attached new version patches that fixes failures reported by
cfbot. I hope these changes make cfbot happy.
Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot
for Linux/Windows said ok.
I'm still learning the feature so I can post only one comment :-(.
I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?
Another approach is to restrict TidStoreBeginIterate() to support only the local one.
How do you think?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Sawda-san,
I've attached new version patches that fixes failures reported by
cfbot. I hope these changes make cfbot happy.Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot
for Linux/Windows said ok.
I'm still learning the feature so I can post only one comment :-(.I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?
TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.
On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've attached new version patches that fixes failures reported by
cfbot. I hope these changes make cfbot happy.
I just started reviewing the patch and found the following comments
while going through the patch:
1) I felt we should add some documentation for this at [1]https://www.postgresql.org/docs/devel/sql-vacuum.html.
2) Can we add some tests in vacuum_parallel with tables having no
indexes and having dead tuples.
3) This should be included in typedefs.list:
3.a)
+/*
+ * Relation statistics collected during heap scanning and need to be
shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats
+{
+ BlockNumber scanned_pages; /* # pages examined (not
skipped via VM) */
+ BlockNumber removed_pages; /* # pages removed by relation
truncation */
+ BlockNumber frozen_pages; /* # pages with newly frozen tuples */
3.b) Similarly this too:
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+ bool aggressive;
+ bool skipwithvm;
+
3.c) Similarly this too:
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerState
+{
+ bool initialized;
3.d) Similarly this too:
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+ /* Parallel scan description shared among parallel workers */
4) Since we are initializing almost all the members of structure,
should we use palloc0 in this case:
+ scan_stats = palloc(sizeof(LVRelScanStats));
+ scan_stats->scanned_pages = 0;
+ scan_stats->removed_pages = 0;
+ scan_stats->frozen_pages = 0;
+ scan_stats->lpdead_item_pages = 0;
+ scan_stats->missed_dead_pages = 0;
+ scan_stats->nonempty_pages = 0;
+
+ /* Initialize remaining counters (be tidy) */
+ scan_stats->tuples_deleted = 0;
+ scan_stats->tuples_frozen = 0;
+ scan_stats->lpdead_items = 0;
+ scan_stats->live_tuples = 0;
+ scan_stats->recently_dead_tuples = 0;
+ scan_stats->missed_dead_tuples = 0;
5) typo paralle should be parallel
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+ return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}
[1]: https://www.postgresql.org/docs/devel/sql-vacuum.html
Regards,
Vignesh
Dear Sawada-san,
TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.
Okay, thanks for the description. I felt it is OK to keep.
I read 0001 again and here are comments.
01. vacuumlazy.c
```
+#define LV_PARALLEL_SCAN_SHARED 0xFFFF0001
+#define LV_PARALLEL_SCAN_DESC 0xFFFF0002
+#define LV_PARALLEL_SCAN_DESC_WORKER 0xFFFF0003
```
I checked other DMS keys used for parallel work, and they seems to have name
like PARALEL_KEY_XXX. Can we follow it?
02. LVRelState
```
+ BlockNumber next_fsm_block_to_vacuum;
```
Only the attribute does not have comments Can we add like:
"Next freespace map page to be checked"?
03. parallel_heap_vacuum_gather_scan_stats
```
+ vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
```
Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined
in 0004. Can you move it?
04. heap_parallel_vacuum_estimate
```
+
+ heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+ &shared_len, &pscanwork_len);
+
+ /* space for PHVShared */
+ shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+ /* space for ParallelBlockTableScanDesc */
+ pscan_len = table_block_parallelscan_estimate(rel);
+ shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+ /* space for per-worker scan state, PHVScanWorkerState */
+ pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+ shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
```
I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size().
Can we remove table_block_parallelscan_estimate() and mul_size() from here?
05. Idea
Can you update documentations?
06. Idea
AFAICS pg_stat_progress_vacuum does not contain information related with the
parallel execution. How do you think adding an attribute which shows a list of pids?
Not sure it is helpful for users but it can show the parallelism.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Wed, Nov 13, 2024 at 3:10 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Sawada-san,
TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.Okay, thanks for the description. I felt it is OK to keep.
I read 0001 again and here are comments.
Thank you for the review comments!
01. vacuumlazy.c ``` +#define LV_PARALLEL_SCAN_SHARED 0xFFFF0001 +#define LV_PARALLEL_SCAN_DESC 0xFFFF0002 +#define LV_PARALLEL_SCAN_DESC_WORKER 0xFFFF0003 ```I checked other DMS keys used for parallel work, and they seems to have name
like PARALEL_KEY_XXX. Can we follow it?
Yes. How about LV_PARALLEL_KEY_XXX?
02. LVRelState
```
+ BlockNumber next_fsm_block_to_vacuum;
```Only the attribute does not have comments Can we add like:
"Next freespace map page to be checked"?
Agreed. I'll add a comment "next block to check for FSM vacuum*.
03. parallel_heap_vacuum_gather_scan_stats
```
+ vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
```Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined
in 0004. Can you move it?
Will remove.
04. heap_parallel_vacuum_estimate ``` + + heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len, + &shared_len, &pscanwork_len); + + /* space for PHVShared */ + shm_toc_estimate_chunk(&pcxt->estimator, shared_len); + shm_toc_estimate_keys(&pcxt->estimator, 1); + + /* space for ParallelBlockTableScanDesc */ + pscan_len = table_block_parallelscan_estimate(rel); + shm_toc_estimate_chunk(&pcxt->estimator, pscan_len); + shm_toc_estimate_keys(&pcxt->estimator, 1); + + /* space for per-worker scan state, PHVScanWorkerState */ + pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers); + shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len); + shm_toc_estimate_keys(&pcxt->estimator, 1); ```I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size().
Can we remove table_block_parallelscan_estimate() and mul_size() from here?
Yes, it's an oversight. Will remove.
05. Idea
Can you update documentations?
Will update the doc as well.
06. Idea
AFAICS pg_stat_progress_vacuum does not contain information related with the
parallel execution. How do you think adding an attribute which shows a list of pids?
Not sure it is helpful for users but it can show the parallelism.
I think it's possible to show the parallelism even today (for parallel
index vacuuming):
=# select leader.pid, leader.query, array_agg(worker.pid) from
pg_stat_activity as leader, pg_stat_activity as worker,
pg_stat_progress_vacuum as v where leader.pid = worker.leader_pid and
leader.pid = v.pid group by 1, 2;
pid | query | array_agg
---------+---------------------+-------------------
2952103 | vacuum (verbose) t; | {2952257,2952258}
(1 row)
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Nov 12, 2024 at 3:21 AM vignesh C <vignesh21@gmail.com> wrote:
On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've attached new version patches that fixes failures reported by
cfbot. I hope these changes make cfbot happy.I just started reviewing the patch and found the following comments
while going through the patch:
1) I felt we should add some documentation for this at [1].2) Can we add some tests in vacuum_parallel with tables having no
indexes and having dead tuples.3) This should be included in typedefs.list: 3.a) +/* + * Relation statistics collected during heap scanning and need to be shared among + * parallel vacuum workers. + */ +typedef struct LVRelScanStats +{ + BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */ + BlockNumber removed_pages; /* # pages removed by relation truncation */ + BlockNumber frozen_pages; /* # pages with newly frozen tuples */3.b) Similarly this too: +/* + * Struct for information that need to be shared among parallel vacuum workers + */ +typedef struct PHVShared +{ + bool aggressive; + bool skipwithvm; +3.c) Similarly this too: +/* Per-worker scan state for parallel heap vacuum scan */ +typedef struct PHVScanWorkerState +{ + bool initialized;3.d) Similarly this too: +/* Struct for parallel heap vacuum */ +typedef struct PHVState +{ + /* Parallel scan description shared among parallel workers */4) Since we are initializing almost all the members of structure, should we use palloc0 in this case: + scan_stats = palloc(sizeof(LVRelScanStats)); + scan_stats->scanned_pages = 0; + scan_stats->removed_pages = 0; + scan_stats->frozen_pages = 0; + scan_stats->lpdead_item_pages = 0; + scan_stats->missed_dead_pages = 0; + scan_stats->nonempty_pages = 0; + + /* Initialize remaining counters (be tidy) */ + scan_stats->tuples_deleted = 0; + scan_stats->tuples_frozen = 0; + scan_stats->lpdead_items = 0; + scan_stats->live_tuples = 0; + scan_stats->recently_dead_tuples = 0; + scan_stats->missed_dead_tuples = 0;5) typo paralle should be parallel +/* + * Return the number of parallel workers for a parallel vacuum scan of this + * relation. + */ +static inline int +table_paralle_vacuum_compute_workers(Relation rel, int requested) +{ + return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested); +}
Thank you for the comments! I'll address these comments in the next
version patch.
BTW while updating the patch, I found that we might want to launch
different numbers of workers for scanning heap and vacuuming heap. The
number of parallel workers is determined based on the number of blocks
in the table. However, even if this number is high, it could happen
that we want to launch fewer workers to vacuum heap pages when there
are not many pages having garbage. And the number of workers for
vacuuming heap could vary on each vacuum pass. I'm considering
implementing it.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Dear Swada-san,
BTW while updating the patch, I found that we might want to launch
different numbers of workers for scanning heap and vacuuming heap. The
number of parallel workers is determined based on the number of blocks
in the table. However, even if this number is high, it could happen
that we want to launch fewer workers to vacuum heap pages when there
are not many pages having garbage. And the number of workers for
vacuuming heap could vary on each vacuum pass. I'm considering
implementing it.
Just to clarify - this idea looks good to me. I imagine you will add new APIs for
tableam like parallel_vacuum_compute_workers_for_scaning and parallel_vacuum_compute_workers_for_vacuuming.
If other tableam developers want to use the same number of workers as scanning,
they can pass the same function to the pointer. Is it right?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Hi Sawada-San,
FYI, the patch 0001 fails to build stand-alone
vacuumlazy.c: In function ‘parallel_heap_vacuum_gather_scan_stats’:
vacuumlazy.c:3739:21: error: ‘LVRelScanStats’ has no member named
‘vacuumed_pages’
vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
^
vacuumlazy.c:3739:43: error: ‘LVRelScanStats’ has no member named
‘vacuumed_pages’
vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
^
make[4]: *** [vacuumlazy.o] Error 1
It appears to be using a struct field which is not even introduced
until the patch 0004 of the patch set.
======
Kind Regards,
Peter Smith.
Fujitsu Australia.
Hi Sawada-San,
I started to look at patch v4-0001 in this thread.
It is quite a big patch so this is a WIP, and these below are just the
comments I have so far.
======
src/backend/access/heap/vacuumlazy.c
1.1.
+/*
+ * Relation statistics collected during heap scanning and need to be
shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats
The comment wording is not quite right.
/Relation statistics collected during heap scanning/Relation
statistics that are collected during heap scanning/
~~~
1.2
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
The comment wording is not quite right.
/that need to be shared/that needs to be shared/
~~~
1.3.
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+ /* Parallel scan description shared among parallel workers */
+ ParallelBlockTableScanDesc pscandesc;
+
+ /* Shared information */
+ PHVShared *shared;
If 'pscandesc' is described as 'shared among parallel workers', should
that field be within 'PHVShared' instead?
~~~
1.4.
/* Initialize page counters explicitly (be tidy) */
- vacrel->scanned_pages = 0;
- vacrel->removed_pages = 0;
- vacrel->frozen_pages = 0;
- vacrel->lpdead_item_pages = 0;
- vacrel->missed_dead_pages = 0;
- vacrel->nonempty_pages = 0;
- /* dead_items_alloc allocates vacrel->dead_items later on */
+ scan_stats = palloc(sizeof(LVRelScanStats));
+ scan_stats->scanned_pages = 0;
+ scan_stats->removed_pages = 0;
+ scan_stats->frozen_pages = 0;
+ scan_stats->lpdead_item_pages = 0;
+ scan_stats->missed_dead_pages = 0;
+ scan_stats->nonempty_pages = 0;
+
+ /* Initialize remaining counters (be tidy) */
+ scan_stats->tuples_deleted = 0;
+ scan_stats->tuples_frozen = 0;
+ scan_stats->lpdead_items = 0;
+ scan_stats->live_tuples = 0;
+ scan_stats->recently_dead_tuples = 0;
+ scan_stats->missed_dead_tuples = 0;
+
+ vacrel->scan_stats = scan_stats;
1.4a.
Or, maybe just palloc0 this and provide a comment to say all counters
have been zapped to 0.
~
1.4b.
Maybe you don't need that 'scan_stats' variable; just assign the
palloc0 directly to the field instead.
~~~
1.5.
- vacrel->missed_dead_tuples = 0;
+ /* dead_items_alloc allocates vacrel->dead_items later on */
The patch seems to have moved this "dead_items_alloc" comment to now
be below the "Allocate/initialize output statistics state" stuff (??).
======
src/backend/commands/vacuumparallel.c
parallel_vacuum_init:
1.6.
int parallel_workers = 0;
+ int nworkers_table;
+ int nworkers_index;
The local vars and function params are named like this (here and in
other functions). But the struct field names say 'nworkers_for_XXX'
e.g.
shared->nworkers_for_table = nworkers_table;
shared->nworkers_for_index = nworkers_index;
It may be better to use these 'nworkers_for_table' and
'nworkers_for_index' names consistently everywhere.
~~~
parallel_vacuum_compute_workers:
1.7.
- int parallel_workers;
+ int parallel_workers_table = 0;
+ int parallel_workers_index = 0;
+
+ *nworkers_table = 0;
+ *nworkers_index = 0;
The local variables 'parallel_workers_table' and
'parallel_workers_index; are hardly needed because AFAICT those
results can be directly assigned to *nworkers_table and
*nworkers_index.
~~~
parallel_vacuum_process_all_indexes:
1.8.
/* Reinitialize parallel context to relaunch parallel workers */
- if (num_index_scans > 0)
+ if (num_index_scans > 0 || pvs->num_table_scans > 0)
ReinitializeParallelDSM(pvs->pcxt);
I don't know if it is feasible or even makes sense to change, but
somehow it seemed strange that the 'num_index_scans' field is not
co-located with the 'num_table_scans' in the ParallelVacuumState. If
this is doable, then lots of functions also would no longer need to
pass 'num_index_scans' since they are already passing 'pvs'.
~~~
parallel_vacuum_table_scan_begin:
1.9.
+ (errmsg(ngettext("launched %d parallel vacuum worker for table
processing (planned: %d)",
+ "launched %d parallel vacuum workers for table processing (planned: %d)",
+ pvs->pcxt->nworkers_launched),
Isn't this the same as errmsg_plural?
~~~
1.10.
+/* Return the array of indexes associated to the given table to be vacuumed */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
Even though the function comment can fit on one line it is nicer to
use a block-style comment with a period, like below. It then will be
consistent with other function comments (e.g.
parallel_vacuum_table_scan_end, parallel_vacuum_process_table, etc).
There are multiple places that this review comment can apply to.
(also typo /associated to/associated with/)
SUGGESTION
/*
* Return the array of indexes associated with the given table to be vacuumed.
*/
~~~
parallel_vacuum_get_nworkers_table:
parallel_vacuum_get_nworkers_index:
1.11.
+/* Return the number of workers for parallel table vacuum */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+ return pvs->shared->nworkers_for_table;
+}
+
+/* Return the number of workers for parallel index processing */
+int
+parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs)
+{
+ return pvs->shared->nworkers_for_index;
+}
+
Are these functions needed? AFAICT, they are called only from macros
where it seems just as easy to reference the pvs fields directly.
~~~
parallel_vacuum_process_table:
1.12.
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+ Assert(VacuumActiveNWorkers);
Maybe here also we should Assert(pvs.shared->do_vacuum_table_scan);
~~~
1.13.
- /* Process indexes to perform vacuum/cleanup */
- parallel_vacuum_process_safe_indexes(&pvs);
+ if (pvs.shared->do_vacuum_table_scan)
+ {
+ parallel_vacuum_process_table(&pvs);
+ }
+ else
+ {
+ ErrorContextCallback errcallback;
+
+ /* Setup error traceback support for ereport() */
+ errcallback.callback = parallel_vacuum_error_callback;
+ errcallback.arg = &pvs;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* Process indexes to perform vacuum/cleanup */
+ parallel_vacuum_process_safe_indexes(&pvs);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ }
There are still some functions following this code (like
'shm_toc_lookup') that could potentially raise ERRORs. But, now the
error_context_stack is getting assigned/reset earlier than previously
was the case. Is that going to be a potential problem?
======
src/include/access/tableam.h
1.14.
+ /*
+ * Compute the amount of DSM space AM need in the parallel table vacuum.
+ *
Maybe reword this comment to be more like table_parallelscan_estimate.
SUGGESTION
Estimate the size of shared memory that the parallel table vacuum needs for AM.
~~~
1.15.
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
int nworkers,
+ void *state)
+{
+ rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
int nworkers,
+ void *state)
+{
+ rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Start a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_scan(Relation rel, ParallelVacuumState *pvs,
+ ParallelWorkerContext *pwcxt)
+{
+ rel->rd_tableam->parallel_vacuum_scan_worker(rel, pvs, pwcxt);
+}
+
All of the "Callbacks for parallel table vacuum." had comments saying
"Not called if parallel table vacuum is disabled.". So, IIUC that
means all of these table_parallel_vacuum_XXX functions (other than the
compute_workers one) could have Assert(nworkers > 0); just to
double-check that is true.
~~~
table_paralle_vacuum_compute_workers:
1.16.
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+ return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}
Typo in function name. /paralle/parallel/
======
Kind Regards,
Peter Smith.
Fujitsu Australia