Add index scan progress to pg_stat_progress_vacuum

Started by Sami Imseihover 4 years ago138 messageshackers

samimseih@gmail.com

over 4 years ago

The current implementation of pg_stat_progress_vacuum does not provide progress on which index is being vacuumed making it difficult for a user to determine if the "vacuuming indexes" phase is making progress. By exposing which index is being scanned as well as the total progress the scan has made for the current cycle, a user can make better estimations on when the vacuum will complete.

The proposed patch adds 4 new columns to pg_stat_progress_vacuum:

1. indrelid - the relid of the index being vacuumed
2. index_blks_total - total number of blocks to be scanned in the current cycle
3. index_blks_scanned - number of blocks scanned in the current cycle
4. leader_pid - if the pid for the pg_stat_progress_vacuum entry is a leader or a vacuum worker. This patch places an entry for every worker pid ( if parallel ) as well as the leader pid

Attached is the patch.

Here is a sample output of a parallel vacuum for table with relid = 16638

Regards,

Sami Imseih
Database Engineer @ Amazon Web Services

Nathan Bossart

nathandbossart@gmail.com

over 4 years ago

In reply to: Sami Imseih (#1)

Re: Add index scan progress to pg_stat_progress_vacuum

On 12/1/21, 3:02 PM, "Imseih (AWS), Sami" <simseih@amazon.com> wrote:

The current implementation of pg_stat_progress_vacuum does not
provide progress on which index is being vacuumed making it
difficult for a user to determine if the "vacuuming indexes" phase
is making progress. By exposing which index is being scanned as well
as the total progress the scan has made for the current cycle, a
user can make better estimations on when the vacuum will complete.

The proposed patch adds 4 new columns to pg_stat_progress_vacuum:

1. indrelid - the relid of the index being vacuumed
2. index_blks_total - total number of blocks to be scanned in the
current cycle
3. index_blks_scanned - number of blocks scanned in the current
cycle
4. leader_pid - if the pid for the pg_stat_progress_vacuum entry is
a leader or a vacuum worker. This patch places an entry for every
worker pid ( if parallel ) as well as the leader pid

nitpick: Shouldn't index_blks_scanned be index_blks_vacuumed? IMO it
is more analogous to heap_blks_vacuumed.

This will tell us which indexes are currently being vacuumed and the
current progress of those operations, but it doesn't tell us which
indexes have already been vacuumed or which ones are pending vacuum.
I think such information is necessary to truly understand the current
progress of vacuuming indexes, and I can think of a couple of ways we
might provide it:

1. Make the new columns you've proposed return arrays. This isn't
very clean, but it would keep all the information for a given
vacuum operation in a single row. The indrelids column would be
populated with all the indexes that have been vacuumed, need to
be vacuumed, or are presently being vacuumed. The other index-
related columns would then have the associated stats and the
worker PID (which might be the same as the pid column depending
on whether parallel index vacuum was being done). Alternatively,
the index column could have an array of records, each containing
all the information for a given index.
2. Create a new view for just index vacuum progress information.
This would have similar information as 1. There would be an
entry for each index that has been vacuumed, needs to be
vacuumed, or is currently being vacuumed. And there would be an
easy way to join with pg_stat_progress_vacuum (e.g., leader_pid,
which again might be the same as our index vacuum PID depending
on whether we were doing parallel index vacuum). Note that it
would be possible for the PID of these entries to be null before
and after we process the index.
3. Instead of adding columns to pg_stat_progress_vacuum, adjust the
current ones to be more general, and then add new entries for
each of the indexes that have been, need to be, or currently are
being vacuumed. This is the most similar option to your current
proposal, but instead of introducing a column like
index_blks_total, we'd rename heap_blks_total to blks_total and
use that for both the heap and indexes. I think we'd still want
to add a leader_pid column. Again, we have to be prepared for
the PID to be null in this case. Or we could just make the pid
column always refer to the leader, and we could introduce a
worker_pid column. That might create confusion, though.

I wish option #1 was cleaner, because I think it would be really nice
to have all this information in a single row. However, I don't expect
much support for a 3-dimensional view, so I suspect option #2
(creating a separate view for index vacuum progress) is the way to go.
The other benefit of option #2 versus option #3 or your original
proposal is that it cleanly separates the top-level vacuum operations
and the index vacuum operations, which are related at the moment, but
which might not always be tied so closely together.

Nathan

Add index scan progress to pg_stat_progress_vacuum

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: