Add heap and index vacuum timings to pg_stat_progress_vacuum
Hi,
When vacuum is running for hours on a table with many indexes, we can see
what phase it is in and how many index vacuum cycles have happened
(index_vacuum_count), but not how much time each phase has taken. That
makes it hard to tell what is actually slow.
The attached patch adds three columns to pg_stat_progress_vacuum that
report the total time spent in heap vacuuming, index vacuuming, and index
cleanup. The values are accumulated across overflow cycles when the dead
TID store fills up.
These columns make it easy to see whether the bottleneck is heap scanning
or index processing, analyze why vacuum is slow, and adjust
maintenance_work_mem accordingly to reduce TID store overflow cycles. They
also make it possible to estimate how long
vacuum takes to finish - polling these timing columns along with scanned
blocks gives a good estimated time.
We could put this behind a track_vacuum_timing GUC similar to
track_io_timing and track_wal_timing of the time capturing calls
seem costly. I'd prefer not to add another GUC, but I'm open to thoughts on
this.
For example, I tested with a 3M-row table (3.9 GB), 4 btree indexes, and
maintenance_work_mem=256kB (kept low to force TID-store overflows). About
34 minutes in, only 13% through the vacuum [1]SELECT * FROM pg_stat_progress_vacuum; pid | 18764 relid | 16384 phase | vacuuming indexes heap_blks_total | 500000 heap_blks_scanned | 66202 heap_blks_vacuumed | 66202 index_vacuum_count | 13188 max_dead_tuple_bytes | 262144 dead_tuple_bytes | 0 num_dead_item_ids | 0 indexes_total | 4 indexes_processed | 4 delay_time | 0 mode | manual started_by | user heap_vacuum_time | 2.098345 index_vacuum_time | 2005.931383 index_cleanup_time | 0. index_vacuum_time is 2006s
vs heap_vacuum_time of 2.1s - index processing is where all the time goes.
13,188 overflow cycles × 4 indexes = 52,752 index scans so far. In
production, tables with more indexes, these columns show exactly where
vacuum time is going and whether raising maintenance_work_mem would help.
With the help of the attached sql function, the estimated time to finish
the vacuum is about 3.5 hours [2]SELECT * FROM estimate_vacuum_completion(); pid | 18764 relname | vac_timing_test pct_done | 13.24 time_remaining | 03:35:39.
Thoughts?
[1]: SELECT * FROM pg_stat_progress_vacuum; pid | 18764 relid | 16384 phase | vacuuming indexes heap_blks_total | 500000 heap_blks_scanned | 66202 heap_blks_vacuumed | 66202 index_vacuum_count | 13188 max_dead_tuple_bytes | 262144 dead_tuple_bytes | 0 num_dead_item_ids | 0 indexes_total | 4 indexes_processed | 4 delay_time | 0 mode | manual started_by | user heap_vacuum_time | 2.098345 index_vacuum_time | 2005.931383 index_cleanup_time | 0
SELECT * FROM pg_stat_progress_vacuum;
pid | 18764
relid | 16384
phase | vacuuming indexes
heap_blks_total | 500000
heap_blks_scanned | 66202
heap_blks_vacuumed | 66202
index_vacuum_count | 13188
max_dead_tuple_bytes | 262144
dead_tuple_bytes | 0
num_dead_item_ids | 0
indexes_total | 4
indexes_processed | 4
delay_time | 0
mode | manual
started_by | user
heap_vacuum_time | 2.098345
index_vacuum_time | 2005.931383
index_cleanup_time | 0
[2]: SELECT * FROM estimate_vacuum_completion(); pid | 18764 relname | vac_timing_test pct_done | 13.24 time_remaining | 03:35:39
SELECT * FROM estimate_vacuum_completion();
pid | 18764
relname | vac_timing_test
pct_done | 13.24
time_remaining | 03:35:39
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com