Parallel heap vacuum

Started by Masahiko Sawadaover 1 year ago135 messages

sawada.mshk@gmail.com

over 1 year ago

1 attachment(s)

Hi all,

The parallel vacuum we have today supports only for index vacuuming.
Therefore, while multiple workers can work on different indexes in
parallel, the heap table is always processed by the single process.
I'd like to propose $subject, which enables us to have multiple
workers running on the single heap table. This would be helpful to
speedup vacuuming for tables without indexes or tables with
INDEX_CLENAUP = off.

I've attached a PoC patch for this feature. It implements only
parallel heap scans in lazyvacum. We can extend this feature to
support parallel heap vacuum as well in the future or in the same
patch.

# Overall idea (for parallel heap scan in lazy vacuum)

At the beginning of vacuum, we determine how many workers to launch
based on the table size like other parallel query operations. The
number of workers is capped by max_parallel_maitenance_workers. Once
we decided to use parallel heap scan, we prepared DSM to share data
among parallel workers and leader. The information include at least
the vacuum option such as aggressive, the counters collected during
lazy vacuum such as scanned_pages, vacuum cutoff such as VacuumCutoffs
and GlobalVisState, and parallel scan description.

Before starting heap scan in lazy vacuum, we launch parallel workers
and then each worker (and the leader) process different blocks. Each
worker does HOT-pruning on pages and collects dead tuple TIDs. When
adding dead tuple TIDs, workers need to hold an exclusive lock on
TidStore. At the end of heap scan phase, workers exit and the leader
will wait for all workers to exit. After that, the leader process
gather the counters collected by parallel workers, and compute the
oldest relfrozenxid (and relminmxid). Then if parallel index vacuum is
also enabled, we launch other parallel workers for parallel index
vacuuming.

When it comes to parallel heap scan in lazy vacuum, I think we can use
the table_block_parallelscan_XXX() family. One tricky thing we need to
deal with is that if the TideStore memory usage reaches the limit, we
stop the parallel scan, do index vacuum and table vacuum, and then
resume the parallel scan from the previous state. In order to do that,
in the patch, we store ParallelBlockTableScanWorker, per-worker
parallel scan state, into DSM so that different parallel workers can
resume the scan using the same parallel scan state.

In addition to that, since we could end up launching fewer workers
than requested, it could happen that some ParallelBlockTableScanWorker
data is used once and never be used while remaining unprocessed
blocks. To handle this case, in the patch, the leader process checks
at the end of the parallel scan if there is an uncompleted parallel
scan. If so, the leader process does the scan using worker's
ParallelBlockTableScanWorker data on behalf of workers.

# Discussions

I'm somewhat convinced the brief design of this feature, but there are
some points regarding the implementation we need to discuss.

In the patch, I extended vacuumparalle.c to support parallel table
scan (and vacuum in the future). So I was required to add some table
AM callbacks such as DSM size estimation, DSM initialization, and
actual table scans etc. We need to verify these APIs are appropriate.
Specifically, if we want to support both parallel heap scan and
parallel heap vacuum, do we want to add separate callbacks for them?
It could be overkill since such a 2-pass vacuum strategy is specific
to heap AM.

As another implementation idea, we might want to implement parallel
heap scan/vacuum in lazyvacuum.c while minimizing changes for
vacuumparallel.c. That way, we would not need to add table AM
callbacks. However, we would end up having duplicate codes related to
parallel operation in vacuum such as vacuum delays.

Also, we might need to add some functions to share GlobalVisState
among parallel workers, since GlobalVisState is a private struct.

Other points I'm somewhat uncomfortable with or need to be discussed
remain in the code with XXX comments.

# Benchmark results

* Test-1: parallel heap scan on the table without indexes

I created 20GB table, made garbage on the table, and run vacuum while
changing parallel degree:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

Here are the results (total time and heap scan time):

PARALLEL 0: 21.99 s (single process)
PARALLEL 1: 11.39 s
PARALLEL 2: 8.36 s
PARALLEL 3: 6.14 s
PARALLEL 4: 5.08 s

* Test-2: parallel heap scan on the table with one index

I used a similar table to the test case 1 but created one btree index on it:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
create index on test (a);
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

I've measured the total execution time as well as the time of each
vacuum phase (from left heap scan time, index vacuum time, and heap
vacuum time):

PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)

Overall, I can see the parallel heap scan in lazy vacuum has a decent
scalability; In both test-1 and test-2, the execution time of heap
scan got ~4x faster with 4 parallel workers. On the other hand, when
it comes to the total vacuum execution time, I could not see much
performance improvement in test-2 (45.11 vs. 35.84). Looking at the
results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
(21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
and heap scan in case 2 was not as fast as in case 1 with 1 parallel
worker (12.75 vs. 11.39).

I think the reason is the shared TidStore is not very scalable since
we have a single lock on it. In all cases in the test-1, we don't use
the shared TidStore since all dead tuples are removed during heap
pruning. So the scalability was better overall than in test-2. In
parallel 0 case in test-2, we use the local TidStore, and from
parallel degree of 1 in test-2, we use the shared TidStore and
parallel worker concurrently update it. Also, I guess that the lookup
performance of the local TidStore is better than the shared TidStore's
lookup performance because of the differences between a bump context
and an DSA area. I think that this difference contributed the fact
that index vacuuming got slower (16.74 vs. 22.04).

There are two obvious improvement ideas to improve overall vacuum
execution time: (1) improve the shared TidStore scalability and (2)
support parallel heap vacuum. For (1), several ideas are proposed by
the ART authors[1]https://db.in.tum.de/~leis/papers/artsync.pdf. I've not tried these ideas but it might be
applicable to our ART implementation. But I prefer to start with (2)
since it would be easier. Feedback is very welcome.

Regards,

[1]: https://db.in.tum.de/~leis/papers/artsync.pdf

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

parallel_heap_vacuum_scan.patchapplication/x-patch; name=parallel_heap_vacuum_scan.patchDownload

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6f8b1b7929..cf8c6614cd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2630,6 +2630,12 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_scan_worker = heap_parallel_vacuum_scan_worker,
+
 	.scan_analyze_next_block = heapam_scan_analyze_next_block,
 	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3f88cf1e8e..4ccf15ffe3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -49,6 +49,7 @@
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -117,10 +118,22 @@
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
 /*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
+ * DSM keys for heap parallel vacuum scan. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_SCAN_SHARED			0xFFFF0001
+#define LV_PARALLEL_SCAN_DESC			0xFFFF0002
+#define LV_PARALLEL_SCAN_DESC_WORKER	0xFFFF0003
+
+/*
+ * Macro to check if we are in a parallel vacuum.  If ParallelVacuumIsActive() is
+ * true, we are in the parallel mode, meaning that we do either parallel index
+ * vacuuming or parallel table vacuuming, or both. If ParallelHeapVacuumIsActive()
+ * is true, we do at least parallel table vacuuming.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->phvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -133,6 +146,80 @@ typedef enum
 	VACUUM_ERRCB_PHASE_TRUNCATE,
 } VacErrPhase;
 
+/*
+ * Relation statistics collected during heap scanning and need to be shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelCounters
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber frozen_pages;	/* # pages with newly frozen tuples */
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVRelCounters;
+
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+	bool	aggressive;
+	bool	skipwithvm;
+
+	/* The initial values shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState 	vistest;
+
+	LVRelCounters	worker_relcnts[FLEXIBLE_ARRAY_MEMBER];
+} PHVShared;
+#define SizeOfPHVShared (offsetof(PHVShared, worker_relcnts))
+
+/* Per-worker scan state */
+typedef struct PHVScanWorkerState
+{
+	ParallelBlockTableScanWorkerData	state;
+	bool	maybe_have_blocks;
+} PHVScanWorkerState;
+
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc		pscandesc;
+
+	/* Shared information */
+	PHVShared			*shared;
+
+	/* Per-worker scan state */
+	PHVScanWorkerState	*myscanstate;
+
+	/* Points to all per-worker scan state array */
+	PHVScanWorkerState *scanstates;
+
+	/* The number of workers launched for parallel heap vacuum */
+	int		nworkers_launched;
+} PHVState;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -144,6 +231,12 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
+	/* Parallel heap vacuum state and sizes for each struct */
+	PHVState	*phvstate;
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -159,10 +252,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -188,12 +277,10 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber frozen_pages;	/* # pages with newly frozen tuples */
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	BlockNumber next_fsm_block_to_vacuum;
+
+	/* Block and tuple counters for the relation */
+	LVRelCounters	*counters;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -203,13 +290,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -229,6 +309,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static bool do_lazy_scan_heap(LVRelState *vacrel);
 static bool heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 									 bool *all_visible_according_to_vm);
 static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
@@ -271,6 +352,12 @@ static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 static void update_relstats_all_indexes(LVRelState *vacrel);
+
+
+static void	do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_heap_vacuum_gather_scan_stats(LVRelState *vacrel);
+static void parallel_heap_complete_unfinised_scan(LVRelState *vacrel);
+
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
 									 LVSavedErrInfo *saved_vacrel,
@@ -296,6 +383,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVRelCounters *counters;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -406,14 +494,28 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		Assert(params->index_cleanup == VACOPTVALUE_AUTO);
 	}
 
+	vacrel->next_fsm_block_to_vacuum = 0;
+
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->frozen_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	counters = palloc(sizeof(LVRelCounters));
+	counters->scanned_pages = 0;
+	counters->removed_pages = 0;
+	counters->frozen_pages = 0;
+	counters->lpdead_item_pages = 0;
+	counters->missed_dead_pages = 0;
+	counters->nonempty_pages = 0;
+
+	/* Initialize remaining counters (be tidy) */
+	counters->tuples_deleted = 0;
+	counters->tuples_frozen = 0;
+	counters->lpdead_items = 0;
+	counters->live_tuples = 0;
+	counters->recently_dead_tuples = 0;
+	counters->missed_dead_tuples = 0;
+
+	vacrel->counters = counters;
+
+	vacrel->num_index_scans = 0;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -421,14 +523,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
-	/* Initialize remaining counters (be tidy) */
-	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -450,9 +545,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
+	vacrel->counters->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->counters->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->counters->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -533,15 +628,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->counters->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->counters->NewRelfrozenXid));
+	Assert(vacrel->counters->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->counters->NewRelminMxid));
+	if (vacrel->counters->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -549,8 +644,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->counters->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->counters->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
@@ -571,7 +666,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->counters->NewRelfrozenXid, vacrel->counters->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -587,8 +682,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples);
+						 vacrel->counters->recently_dead_tuples +
+						 vacrel->counters->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	if (instrument)
@@ -651,21 +746,21 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
-							 vacrel->removed_pages,
+							 vacrel->counters->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->counters->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->counters->scanned_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->counters->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->counters->recently_dead_tuples);
+			if (vacrel->counters->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->counters->missed_dead_tuples,
+								 vacrel->counters->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -673,25 +768,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->counters->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->counters->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->counters->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->counters->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->frozen_pages,
+							 vacrel->counters->frozen_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->frozen_pages / orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 100.0 * vacrel->counters->frozen_pages / orig_rel_pages,
+							 (long long) vacrel->counters->tuples_frozen);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -711,10 +806,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->counters->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->counters->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->counters->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -815,14 +910,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_fsm_block_to_vacuum = 0;
-	bool		all_visible_according_to_vm;
-
-	TidStore   *dead_items = vacrel->dead_items;
+	BlockNumber rel_pages = vacrel->rel_pages;
 	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
-	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -842,14 +931,78 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_allvis = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel);
+
+	vacrel->blkno = InvalidBlockNumber;
+
+	/* report that everything is now scanned */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->counters->scanned_pages,
+													 vacrel->counters->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->counters->recently_dead_tuples +
+		vacrel->counters->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+								rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * Return true if we processed all blocks, otherwise false if we exit from this function
+ * while not completing the heap scan due to full of dead item TIDs. In serial heap scan
+ * case, this function always returns true. In parallel heap vacuum scan, this function
+ * is called by both worker processes and the leader process, and could return false.
+ */
+static bool
+do_lazy_scan_heap(LVRelState *vacrel)
+{
+	bool	all_visible_according_to_vm;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
+	BlockNumber	blkno;
+	Buffer	vmbuffer = InvalidBuffer;
+	bool	scan_done = true;
+
 	while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
 	{
-		Buffer		buf;
-		Page		page;
-		bool		has_lpdead_items;
-		bool		got_cleanup_lock = false;
+		Buffer	buf;
+		Page	page;
+		bool	has_lpdead_items;
+		bool	got_cleanup_lock = false;
 
-		vacrel->scanned_pages++;
+		vacrel->counters->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -867,46 +1020,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (!IsParallelWorker() &&
+			vacrel->counters->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
-		/*
-		 * Consider if we definitely have enough space to process TIDs on page
-		 * already.  If we are close to overrunning the available space for
-		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page.
-		 */
-		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
-		{
-			/*
-			 * Before beginning index vacuuming, we release any pin we may
-			 * hold on the visibility map page.  This isn't necessary for
-			 * correctness, but we do it anyway to avoid holding the pin
-			 * across a lengthy, unrelated operation.
-			 */
-			if (BufferIsValid(vmbuffer))
-			{
-				ReleaseBuffer(vmbuffer);
-				vmbuffer = InvalidBuffer;
-			}
-
-			/* Perform a round of index and heap vacuuming */
-			vacrel->consider_bypass_optimization = false;
-			lazy_vacuum(vacrel);
-
-			/*
-			 * Vacuum the Free Space Map to make newly-freed space visible on
-			 * upper-level FSM pages.  Note we have not yet processed blkno.
-			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-									blkno);
-			next_fsm_block_to_vacuum = blkno;
-
-			/* Report that we are once again scanning the heap */
-			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
-		}
-
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
@@ -994,10 +1111,14 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * also be no opportunity to update the FSM later, because we'll never
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
+		 *
+		 * XXX: in parallel heap scan, some blocks before blkno might not
+		 * been processed yet. Is it worth vacuuming FSM?
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1011,57 +1132,144 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
 			UnlockReleaseBuffer(buf);
+
+		/*
+		 * Consider if we definitely have enough space to process TIDs on page
+		 * already.  If we are close to overrunning the available space for
+		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
+		 * this page.
+		 */
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
+		{
+			/*
+			 * Before beginning index vacuuming, we release any pin we may
+			 * hold on the visibility map page.  This isn't necessary for
+			 * correctness, but we do it anyway to avoid holding the pin
+			 * across a lengthy, unrelated operation.
+			 */
+			if (BufferIsValid(vmbuffer))
+			{
+				ReleaseBuffer(vmbuffer);
+				vmbuffer = InvalidBuffer;
+			}
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/*
+				 * In parallel heap vacuum case, both the leader process and the
+				 * worker processes have to exit without invoking index and heap
+				 * vacuuming. The leader process will wait for all workers to
+				 * finish and perform index and heap vacuuming.
+				 */
+				scan_done = false;
+				break;
+			}
+
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
+
+			/*
+			 * Vacuum the Free Space Map to make newly-freed space visible on
+			 * upper-level FSM pages.
+			 *
+			 * XXX: in parallel heap scan, some blocks before blkno might not
+			 * been processed yet. Is it worth vacuuming FSM?
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = blkno;
+
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			continue;
+		}
 	}
 
-	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	return scan_done;
+}
 
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+/*
+ * A parallel scan variant of heap_vac_scan_next_block.
+ *
+ * In parallel vacuum scan, we don't use the SKIP_PAGES_THRESHOLD optimization.
+ */
+static bool
+heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno,
+								  bool *all_visible_according_to_vm)
+{
+	PHVState	*phvstate = vacrel->phvstate;
+	BlockNumber next_block;
+	Buffer vmbuffer = InvalidBuffer;
+	uint8	mapbits = 0;
 
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+	Assert(ParallelHeapVacuumIsActive(vacrel));
 
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
+	for (;;)
+	{
+		next_block = table_block_parallelscan_nextpage(vacrel->rel,
+													   &(phvstate->myscanstate->state),
+													   phvstate->pscandesc);
 
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+		/* Have we reached the end of the table? */
+		if (!BlockNumberIsValid(next_block) || next_block >= vacrel->rel_pages)
+		{
+			if (BufferIsValid(vmbuffer))
+				ReleaseBuffer(vmbuffer);
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+			*blkno = vacrel->rel_pages;
+			return false;
+		}
 
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
+		/* We always treat the last block as unsafe to skip */
+		if (next_block == vacrel->rel_pages - 1)
+			break;
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_block, &vmbuffer);
+
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		if (!vacrel->skipwithvm)
+			break;
+
+		/*
+		 * Aggressive VACUUM caller can't skip pages just because they are
+		 * all-visible.
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * All-visible block is safe to skip in non-aggressive case. But
+			 * remember that the final range contains such a block for later.
+			 */
+			vacrel->counters->skippedallvis = true;
+		}
+	}
+
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
+
+	*blkno = next_block;
+	*all_visible_according_to_vm = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
+
+	return true;
 }
 
 /*
@@ -1088,6 +1296,9 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 {
 	BlockNumber next_block;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		return heap_vac_scan_next_block_parallel(vacrel, blkno, all_visible_according_to_vm);
+
 	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
 	next_block = vacrel->current_block + 1;
 
@@ -1137,7 +1348,7 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->counters->skippedallvis = true;
 		}
 	}
 
@@ -1210,7 +1421,7 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 
 		/*
 		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
+		 * (caller must have the opportunity to set vacrel->counters->nonempty_pages).
 		 * This rule avoids having lazy_truncate_heap() take access-exclusive
 		 * lock on rel to attempt a truncation that fails anyway, just because
 		 * there are tuples on the last page (it is likely that there will be
@@ -1439,10 +1650,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->counters->NewRelfrozenXid, &vacrel->counters->NewRelminMxid);
 
 	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(TransactionIdIsValid(vacrel->counters->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1451,7 +1662,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * nfrozen == 0, since it only counts pages with newly frozen tuples
 		 * (don't confuse that with pages newly set all-frozen in VM).
 		 */
-		vacrel->frozen_pages++;
+		vacrel->counters->frozen_pages++;
 	}
 
 	/*
@@ -1486,7 +1697,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->counters->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -1501,15 +1712,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->counters->tuples_deleted += presult.ndeleted;
+	vacrel->counters->tuples_frozen += presult.nfrozen;
+	vacrel->counters->lpdead_items += presult.lpdead_items;
+	vacrel->counters->live_tuples += presult.live_tuples;
+	vacrel->counters->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->counters->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -1659,8 +1870,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->counters->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->counters->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1787,8 +1998,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->counters->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->counters->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -1815,25 +2026,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->counters->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->counters->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->counters->live_tuples += live_tuples;
+	vacrel->counters->recently_dead_tuples += recently_dead_tuples;
+	vacrel->counters->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->counters->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->counters->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -1862,7 +2073,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->counters->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -1896,7 +2107,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->counters->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -1923,7 +2134,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		bypass = (vacrel->counters->lpdead_item_pages < threshold &&
 				  (TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L)));
 	}
 
@@ -2061,7 +2272,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->counters->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2165,8 +2376,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->counters->lpdead_items &&
+			vacuumed_pages == vacrel->counters->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2347,7 +2558,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->counters->scanned_pages < vacrel->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -2528,7 +2739,7 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->counters->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
 		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
@@ -2554,7 +2765,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->counters->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -2655,7 +2866,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->counters->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
@@ -2663,7 +2874,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->counters->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -2691,7 +2902,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->counters->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2803,7 +3014,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->counters->nonempty_pages;
 }
 
 /*
@@ -2821,12 +3032,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -2844,11 +3051,18 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * For parallel index vacuuming, only one worker can be used for an
+			 * index, we invoke parallelism only if there are at least two indexes
+			 * on a table.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -2889,9 +3103,19 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	/*
+	 * Protect both dead_items and dead_items_info from concurrent updates
+	 * in parallel heap scan cases.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(dead_items);
+
 	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(dead_items);
@@ -3093,6 +3317,359 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of parallel workers for parallel vacuum heap scan.
+ *
+ * The calculation logic is borrowed from compute_parallel_worker().
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	int parallel_workers = 0;
+	int heap_parallel_threshold;
+	int heap_pages;
+
+	if (nrequested == 0)
+	{
+		/*
+		 * Select the number of workers based on the log of the size of
+		 * the relation.  This probably needs to be a good deal more
+		 * sophisticated, but we need something here for now.  Note that
+		 * the upper limit of the min_parallel_table_scan_size GUC is
+		 * chosen to prevent overflow here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		heap_pages = RelationGetNumberOfBlocks(rel);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nrequested;
+
+	return parallel_workers;
+}
+
+/*
+ * Compute the amount of space we'll need in the parallel heap vacuum
+ * DSM, and inform pcxt->estimator about our needs.
+ *
+ * nworkers is the number of workers for the table vacuum. Note that it could
+ * be different than pcxt->nworkers since it is the maximum of number of
+ * workers for table vacuum and index vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+							  int nworkers, void *state)
+{
+	Size	size = 0;
+	LVRelState *vacrel = (LVRelState *) state;
+
+	/* space for PHVShared */
+	size = add_size(size, SizeOfPHVShared);
+	size = add_size(size, mul_size(sizeof(LVRelCounters), nworkers));
+	vacrel->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for ParallelBlockTableScanDesc */
+	vacrel->pscan_len = table_block_parallelscan_estimate(rel);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->pscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for per-worker scan state, PHVScanWorkerState */
+	vacrel->pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->pscanwork_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+								int nworkers, void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ParallelBlockTableScanDesc pscan;
+	PHVScanWorkerState *pscanwork;
+	PHVShared 	*shared;
+	PHVState	*phvstate;
+
+	phvstate = (PHVState *) palloc(sizeof(PHVState));
+
+	shared = shm_toc_allocate(pcxt->toc, vacrel->shared_len);
+
+	/* Prepare the shared information */
+
+	MemSet(shared, 0, vacrel->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->counters->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->counters->NewRelminMxid;
+	shared->skippedallvis = vacrel->counters->skippedallvis;
+
+	/*
+	 * XXX: we copy the contents of vistest to the shared area, but in order to do
+	 * that, we need to either expose GlobalVisTest or to provide functions to copy
+	 * contents of GlobalVisTest to somewhere. Currently we do the former but not
+	 * sure it's the best choice.
+	 *
+	 * Alternative idea is to have each worker determine cutoff and have their own
+	 * vistest. But we need to carefully consider it since parallel workers end up
+	 * having different cutoff and horizon.
+	 */
+	shared->vistest = *vacrel->vistest;
+
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_SHARED, shared);
+
+	phvstate->shared = shared;
+
+	/* prepare the  parallel block table scan description */
+	pscan = shm_toc_allocate(pcxt->toc, vacrel->pscan_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_DESC, pscan);
+
+	/* initialize parallel scan description */
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pscan);
+	phvstate->pscandesc = pscan;
+
+	/* prepare the workers' parallel block table scan state */
+	pscanwork = shm_toc_allocate(pcxt->toc, vacrel->pscanwork_len);
+	MemSet(pscanwork, 0, vacrel->pscanwork_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_DESC_WORKER, pscanwork);
+	phvstate->scanstates = pscanwork;
+
+	vacrel->phvstate = phvstate;
+}
+
+/*
+ * Main function for parallel heap vacuum workers.
+ */
+void
+heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
+								 ParallelWorkerContext *pwcxt)
+{
+	LVRelState	vacrel = {0};
+	PHVState	*phvstate;
+	PHVShared	*shared;
+	ParallelBlockTableScanDesc pscandesc;
+	PHVScanWorkerState	*scanstate;
+	LVRelCounters *counters;
+	bool		scan_done;
+
+	phvstate = palloc(sizeof(PHVState));
+
+	pscandesc = (ParallelBlockTableScanDesc) shm_toc_lookup(pwcxt->toc,
+														   LV_PARALLEL_SCAN_DESC,
+														   false);
+	phvstate->pscandesc = pscandesc;
+
+	shared = (PHVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_SCAN_SHARED,
+										  false);
+	phvstate->shared = shared;
+
+	scanstate = (PHVScanWorkerState *) shm_toc_lookup(pwcxt->toc,
+													  LV_PARALLEL_SCAN_DESC_WORKER,
+													  false);
+
+	phvstate->myscanstate = &(scanstate[ParallelWorkerNumber]);
+	counters = &(shared->worker_relcnts[ParallelWorkerNumber]);
+
+	/* Prepare LVRelState */
+	vacrel.rel = rel;
+	vacrel.indrels = parallel_vacuum_get_table_indexes(pvs, &vacrel.nindexes);
+	vacrel.pvs = pvs;
+	vacrel.phvstate = phvstate;
+	vacrel.aggressive = shared->aggressive;
+	vacrel.skipwithvm = shared->skipwithvm;
+	vacrel.cutoffs = shared->cutoffs;
+	vacrel.vistest = &(shared->vistest);
+	vacrel.dead_items = parallel_vacuum_get_dead_items(pvs,
+													   &vacrel.dead_items_info);
+	vacrel.rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel.counters = counters;
+
+	/* initialize per-worker relation statistics */
+	MemSet(counters, 0, sizeof(LVRelCounters));
+
+	vacrel.counters->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel.counters->NewRelminMxid = shared->NewRelminMxid;
+	vacrel.counters->skippedallvis = shared->skippedallvis;
+
+	/*
+	 * XXX: the following fields are not set yet:
+	 * - index vacuum related fields such as consider_bypass_optimization,
+	 * 	 do_index_vacuuming etc.
+	 * - error reporting state.
+	 * - statistics such as scanned_pages etc.
+	 * - oldest extant XID/MXID.
+	 * - states maintained by heap_vac_scan_next_block()
+	 */
+
+	/* Initialize the start block if not yet */
+	if (!phvstate->myscanstate->maybe_have_blocks)
+	{
+		table_block_parallelscan_startblock_init(rel,
+												 &(phvstate->myscanstate->state),
+												 phvstate->pscandesc);
+
+		phvstate->myscanstate->maybe_have_blocks = false;
+	}
+
+	/*
+	 * XXX: if we want to support parallel heap *vacuum*, we need to allow
+	 * workers to call different function based on the shared information.
+	 */
+	scan_done = do_lazy_scan_heap(&vacrel);
+
+	phvstate->myscanstate->maybe_have_blocks = !scan_done;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+parallel_heap_complete_unfinised_scan(LVRelState *vacrel)
+{
+	int nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		PHVScanWorkerState *wstate = &(vacrel->phvstate->scanstates[i]);
+		bool	scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		if (!wstate->maybe_have_blocks)
+			continue;
+
+		vacrel->phvstate->myscanstate = wstate;
+
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		Assert(scan_done);
+	}
+}
+
+/*
+ * Accumulate relation counters that parallel workers collected into the
+ * leader's counters.
+ */
+static void
+parallel_heap_vacuum_gather_scan_stats(LVRelState *vacrel)
+{
+	PHVState *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		LVRelCounters *counters = &(phvstate->shared->worker_relcnts[i]);
+
+#define LV_ACCUM_ITEM(item) (vacrel)->counters->item += (counters)->item
+
+		LV_ACCUM_ITEM(scanned_pages);
+		LV_ACCUM_ITEM(removed_pages);
+		LV_ACCUM_ITEM(frozen_pages);
+		LV_ACCUM_ITEM(lpdead_item_pages);
+		LV_ACCUM_ITEM(missed_dead_pages);
+		LV_ACCUM_ITEM(nonempty_pages);
+		LV_ACCUM_ITEM(tuples_deleted);
+		LV_ACCUM_ITEM(tuples_frozen);
+		LV_ACCUM_ITEM(lpdead_items);
+		LV_ACCUM_ITEM(live_tuples);
+		LV_ACCUM_ITEM(recently_dead_tuples);
+		LV_ACCUM_ITEM(missed_dead_tuples);
+
+#undef LV_ACCUM_ITEM
+
+		if (TransactionIdPrecedes(counters->NewRelfrozenXid, vacrel->counters->NewRelfrozenXid))
+			vacrel->counters->NewRelfrozenXid = counters->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(counters->NewRelminMxid, vacrel->counters->NewRelminMxid))
+			vacrel->counters->NewRelminMxid = counters->NewRelminMxid;
+
+		if (!vacrel->counters->skippedallvis && counters->skippedallvis)
+			vacrel->counters->skippedallvis = true;
+	}
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches parallel
+ * workers to scan the heap in parallel.
+ */
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	PHVScanWorkerState *scanstate;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+	/* initialize parallel scan description to join as a worker */
+	scanstate = palloc(sizeof(PHVScanWorkerState));
+	table_block_parallelscan_startblock_init(vacrel->rel, &(scanstate->state),
+											 vacrel->phvstate->pscandesc);
+	vacrel->phvstate->myscanstate = scanstate;
+
+	for (;;)
+	{
+		bool scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		/*
+		 * Scan the table until either we are close to overrunning the available
+		 * space for dead_items TIDs or we reach the end of the table.
+		 */
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/* stop parallel workers and gather the collected stats */
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+		parallel_heap_vacuum_gather_scan_stats(vacrel);
+
+		/*
+		 * Consider if we definitely have enough space to process TIDs on page
+		 * already.  If we are close to overrunning the available space for
+		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
+		 * this page.
+		 */
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
+		{
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
+
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			/* re-launcher workers */
+			vacrel->phvstate->nworkers_launched =
+				parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+			continue;
+		}
+
+		/* We reach the end of the table */
+		Assert(scan_done);
+		break;
+	}
+
+	parallel_heap_complete_unfinised_scan(vacrel);
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26070bff2..968addf94f 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -28,6 +28,7 @@
 
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -64,6 +65,12 @@ typedef struct PVShared
 	Oid			relid;
 	int			elevel;
 
+	/*
+	 * True if the caller wants parallel workers to invoke vacuum table scan
+	 * callback.
+	 */
+	bool		do_vacuum_table_scan;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -163,6 +170,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Passed to parallel table scan workers. NULL for leader process */
+	ParallelWorkerContext *pwcxt;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -192,6 +202,16 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/*
+	 * The number of workers for parallel table scan/vacuuming and index vacuuming,
+	 * respectively.
+	 */
+	int			nworkers_for_table;
+	int			nworkers_for_index;
+
+	/* How many parallel table vacuum scan is called? */
+	int			num_table_scans;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
@@ -220,8 +240,9 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static void	parallel_vacuum_compute_workers(Relation rel,  Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_table,
+											int *nworkers_index, bool *will_parallel_vacuum);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -241,7 +262,7 @@ static void parallel_vacuum_error_callback(void *arg);
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -255,6 +276,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_table;
+	int			nworkers_index;
 	int			querylen;
 
 	/*
@@ -262,15 +285,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	 * relation
 	 */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
-													   nrequested_workers,
-													   will_parallel_vacuum);
+	parallel_vacuum_compute_workers(rel, indrels, nindexes, nrequested_workers,
+									&nworkers_table, &nworkers_index,
+									will_parallel_vacuum);
+
+	parallel_workers = Max(nworkers_table, nworkers_index);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
@@ -284,6 +309,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
+	pvs->nworkers_for_table = nworkers_table;
+	pvs->nworkers_for_index = nworkers_index;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -326,6 +353,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (nworkers_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, nworkers_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -417,6 +448,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Prepare AM-specific DSM for parallel table vacuum */
+	if (nworkers_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, nworkers_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -538,27 +573,41 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
  * min_parallel_index_scan_size as invoking workers for very small indexes
  * can hurt performance.
  *
+ * XXX needs to mention about the number of workers for table.
+ *
  * nrequested is the number of parallel workers that user requested.  If
  * nrequested is 0, we compute the parallel degree based on nindexes, that is
  * the number of indexes that support parallel vacuum.  This function also
  * sets will_parallel_vacuum to remember indexes that participate in parallel
  * vacuum.
  */
-static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+static void
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested,	int *nworkers_table,
+								int *nworkers_index, bool *will_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			parallel_workers_table = 0;
+	int			parallel_workers_index = 0;
+
+	*nworkers_table = 0;
+	*nworkers_index = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
 	 * when parallelism is disabled.
 	 */
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
-		return 0;
+		return;
+
+	/*
+	 * Compute the number of workers for parallel table scan. Cap by
+	 * max_parallel_maintenance_workers.
+	 */
+	parallel_workers_table = Min(table_paralle_vacuum_compute_workers(rel, nrequested),
+								 max_parallel_maintenance_workers);
 
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
@@ -589,17 +638,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	nindexes_parallel--;
 
 	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Compute the parallel degree for parallel index vacuum */
+		parallel_workers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		parallel_workers_index = Min(parallel_workers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table = parallel_workers_table;
+	*nworkers_index = parallel_workers_index;
 }
 
 /*
@@ -669,7 +719,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	if (nworkers > 0)
 	{
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (num_index_scans > 0 || pvs->num_table_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -978,6 +1028,120 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+	/*
+	 * Increment the active worker count if we are able to launch any worker.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do table vacuum scan */
+	table_parallel_vacuum_scan(pvs->heaprel, pvs, pvs->pwcxt);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * Prepare DSM and vacuum delay, and launch parallel workers for parallel
+ * table vacuum scan.
+ */
+int
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->do_vacuum_table_scan = true;
+
+	if (pvs->num_table_scans > 0)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->nworkers_for_table);
+
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have
+		 * already accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+	}
+
+	ereport(pvs->shared->elevel,
+			(errmsg(ngettext("launched %d parallel vacuum worker for table scanning (planned: %d)",
+							 "launched %d parallel vacuum workers for table scanning (planned: %d)",
+							 pvs->pcxt->nworkers_launched),
+					pvs->pcxt->nworkers_launched, pvs->nworkers_for_table)));
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel table vacuum scan, and gather statistics.
+ */
+void
+parallel_vacuum_table_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	pvs->shared->do_vacuum_table_scan = false;
+	pvs->num_table_scans++;
+}
+
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -1026,7 +1190,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	if (shared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = shared->maintenance_work_mem_worker;
@@ -1060,6 +1223,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
 	pvs.heaprel = rel;
 
+	pvs.pwcxt = palloc(sizeof(ParallelWorkerContext));
+	pvs.pwcxt->toc = toc;
+	pvs.pwcxt->seg = seg;
+
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
 	pvs.status = PARALLEL_INDVAC_STATUS_INITIAL;
@@ -1077,8 +1244,15 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	if (pvs.shared->do_vacuum_table_scan)
+	{
+		parallel_vacuum_process_table(&pvs);
+	}
+	else
+	{
+		/* Process indexes to perform vacuum/cleanup */
+		parallel_vacuum_process_safe_indexes(&pvs);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index d5165aa0d9..37035cc186 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -164,15 +164,6 @@ typedef struct ProcArrayStruct
  *
  * The typedef is in the header.
  */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9e9aec88a6..6c5e48e478 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -20,6 +20,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "commands/vacuum.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -393,6 +394,13 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int heap_parallel_vacuum_compute_workers(Relation rel, int requested);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+										  int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
+											 ParallelWorkerContext *pwcxt);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8e583b45cd..b10b047ca1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -655,6 +656,46 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum.
+	 * The function must return 0 to disable parallel table vacuum.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel, int requested);
+
+	/*
+	 * Compute the amount of DSM space AM need in the parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * This callback is called for parallel table vacuum workers.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_scan_worker) (Relation rel,
+												ParallelVacuumState *pvs,
+												ParallelWorkerContext *pwcxt);
+
 	/*
 	 * Prepare to analyze the next block in the read stream.  Returns false if
 	 * the stream is exhausted and true otherwise. The scan must have been
@@ -1720,6 +1761,33 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}
+
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							   void *state)
+{
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+static inline void
+table_parallel_vacuum_scan(Relation rel, ParallelVacuumState *pvs,
+						   ParallelWorkerContext *pwcxt)
+{
+	rel->rd_tableam->parallel_vacuum_scan_worker(rel, pvs, pwcxt);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..598bb5218f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -360,7 +360,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -372,6 +373,10 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
+extern int parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051..6ccb19a29f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -102,8 +102,20 @@ extern char *ExportSnapshot(Snapshot snapshot);
 /*
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
+ *
+ * XXX the struct definition is temporarily moved from procarray.c for
+ * parallel table vacuum development. We need to find a suitable way for
+ * parallel table vacuum workers to share the GlobalVisState.
  */
-typedef struct GlobalVisState GlobalVisState;
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#1)

Re: Parallel heap vacuum

On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

# Benchmark results

* Test-1: parallel heap scan on the table without indexes

I created 20GB table, made garbage on the table, and run vacuum while
changing parallel degree:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

Here are the results (total time and heap scan time):

PARALLEL 0: 21.99 s (single process)
PARALLEL 1: 11.39 s
PARALLEL 2: 8.36 s
PARALLEL 3: 6.14 s
PARALLEL 4: 5.08 s

* Test-2: parallel heap scan on the table with one index

I used a similar table to the test case 1 but created one btree index on it:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
create index on test (a);
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

I've measured the total execution time as well as the time of each
vacuum phase (from left heap scan time, index vacuum time, and heap
vacuum time):

PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)

Overall, I can see the parallel heap scan in lazy vacuum has a decent
scalability; In both test-1 and test-2, the execution time of heap
scan got ~4x faster with 4 parallel workers. On the other hand, when
it comes to the total vacuum execution time, I could not see much
performance improvement in test-2 (45.11 vs. 35.84). Looking at the
results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
(21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
and heap scan in case 2 was not as fast as in case 1 with 1 parallel
worker (12.75 vs. 11.39).

I think the reason is the shared TidStore is not very scalable since
we have a single lock on it. In all cases in the test-1, we don't use
the shared TidStore since all dead tuples are removed during heap
pruning. So the scalability was better overall than in test-2. In
parallel 0 case in test-2, we use the local TidStore, and from
parallel degree of 1 in test-2, we use the shared TidStore and
parallel worker concurrently update it. Also, I guess that the lookup
performance of the local TidStore is better than the shared TidStore's
lookup performance because of the differences between a bump context
and an DSA area. I think that this difference contributed the fact
that index vacuuming got slower (16.74 vs. 22.04).

There are two obvious improvement ideas to improve overall vacuum
execution time: (1) improve the shared TidStore scalability and (2)
support parallel heap vacuum. For (1), several ideas are proposed by
the ART authors[1]. I've not tried these ideas but it might be
applicable to our ART implementation. But I prefer to start with (2)
since it would be easier. Feedback is very welcome.

Starting with (2) sounds like a reasonable approach. We should study a
few more things like (a) the performance results where there are 3-4
indexes, (b) What is the reason for performance improvement seen with
only heap scans. We normally get benefits of parallelism because of
using multiple CPUs but parallelizing scans (I/O) shouldn't give much
benefits. Is it possible that you are seeing benefits because most of
the data is either in shared_buffers or in memory? We can probably try
vacuuming tables by restarting the nodes to ensure the data is not in
memory.

--
With Regards,
Amit Kapila.

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: Masahiko Sawada (#1)

1 attachment(s)

RE: Parallel heap vacuum

Dear Sawada-san,

The parallel vacuum we have today supports only for index vacuuming.
Therefore, while multiple workers can work on different indexes in
parallel, the heap table is always processed by the single process.
I'd like to propose $subject, which enables us to have multiple
workers running on the single heap table. This would be helpful to
speedup vacuuming for tables without indexes or tables with
INDEX_CLENAUP = off.

Sounds great. IIUC, vacuuming is still one of the main weak point of postgres.

I've attached a PoC patch for this feature. It implements only
parallel heap scans in lazyvacum. We can extend this feature to
support parallel heap vacuum as well in the future or in the same
patch.

Before diving into deep, I tested your PoC but found unclear point.
When the vacuuming is requested with parallel > 0 with almost the same workload
as yours, only the first page was scanned and cleaned up.

When parallel was set to zero, I got:
```
INFO: vacuuming "postgres.public.test"
INFO: finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total)
tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
new relfrozenxid: 739, which is 1 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s
buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s
```

This meant that all pages were surely scanned and dead tuples were removed.
However, when parallel was set to one, I got another result:

```
INFO: vacuuming "postgres.public.test"
INFO: launched 1 parallel vacuum worker for table scanning (planned: 1)
INFO: finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total)
tuples: 12 removed, 0 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s
buffer usage: 96 hits, 660 misses, 6 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s
```

It looked like that only a page was scanned and 12 tuples were removed.
It looks very strange for me...

Attached script emulate my test. IIUC it was almost the same as yours, but
the instance was restarted before vacuuming.

Can you reproduce and see the reason? Based on the requirement I can
provide further information.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#3)

1 attachment(s)

Re: Parallel heap vacuum

On Fri, Jul 5, 2024 at 6:51 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

The parallel vacuum we have today supports only for index vacuuming.
Therefore, while multiple workers can work on different indexes in
parallel, the heap table is always processed by the single process.
I'd like to propose $subject, which enables us to have multiple
workers running on the single heap table. This would be helpful to
speedup vacuuming for tables without indexes or tables with
INDEX_CLENAUP = off.

Sounds great. IIUC, vacuuming is still one of the main weak point of postgres.

I've attached a PoC patch for this feature. It implements only
parallel heap scans in lazyvacum. We can extend this feature to
support parallel heap vacuum as well in the future or in the same
patch.

Before diving into deep, I tested your PoC but found unclear point.
When the vacuuming is requested with parallel > 0 with almost the same workload
as yours, only the first page was scanned and cleaned up.

When parallel was set to zero, I got:
```
INFO: vacuuming "postgres.public.test"
INFO: finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total)
tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
new relfrozenxid: 739, which is 1 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s
buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s
```

This meant that all pages were surely scanned and dead tuples were removed.
However, when parallel was set to one, I got another result:

```
INFO: vacuuming "postgres.public.test"
INFO: launched 1 parallel vacuum worker for table scanning (planned: 1)
INFO: finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total)
tuples: 12 removed, 0 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s
buffer usage: 96 hits, 660 misses, 6 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s
```

It looked like that only a page was scanned and 12 tuples were removed.
It looks very strange for me...

Attached script emulate my test. IIUC it was almost the same as yours, but
the instance was restarted before vacuuming.

Can you reproduce and see the reason? Based on the requirement I can
provide further information.

Thank you for the test!

I could reproduce this issue and it's a bug; it skipped even
non-all-visible pages. I've attached the new version patch.

BTW since we compute the number of parallel workers for the heap scan
based on the table size, it's possible that we launch multiple workers
even if most blocks are all-visible. It seems to be better if we
calculate it based on (relpages - relallvisible).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

parallel_heap_vacuum_scan_v2.patchapplication/octet-stream; name=parallel_heap_vacuum_scan_v2.patchDownload

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6f8b1b7929..cf8c6614cd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2630,6 +2630,12 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_scan_worker = heap_parallel_vacuum_scan_worker,
+
 	.scan_analyze_next_block = heapam_scan_analyze_next_block,
 	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3f88cf1e8e..ca44d04e66 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -49,6 +49,7 @@
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -117,10 +118,22 @@
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
 /*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
+ * DSM keys for heap parallel vacuum scan. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_SCAN_SHARED			0xFFFF0001
+#define LV_PARALLEL_SCAN_DESC			0xFFFF0002
+#define LV_PARALLEL_SCAN_DESC_WORKER	0xFFFF0003
+
+/*
+ * Macro to check if we are in a parallel vacuum.  If ParallelVacuumIsActive() is
+ * true, we are in the parallel mode, meaning that we do either parallel index
+ * vacuuming or parallel table vacuuming, or both. If ParallelHeapVacuumIsActive()
+ * is true, we do at least parallel table vacuuming.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->phvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -133,6 +146,80 @@ typedef enum
 	VACUUM_ERRCB_PHASE_TRUNCATE,
 } VacErrPhase;
 
+/*
+ * Relation statistics collected during heap scanning and need to be shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelCounters
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber frozen_pages;	/* # pages with newly frozen tuples */
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+}			LVRelCounters;
+
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The initial values shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState vistest;
+
+	LVRelCounters worker_relcnts[FLEXIBLE_ARRAY_MEMBER];
+}			PHVShared;
+#define SizeOfPHVShared (offsetof(PHVShared, worker_relcnts))
+
+/* Per-worker scan state */
+typedef struct PHVScanWorkerState
+{
+	ParallelBlockTableScanWorkerData state;
+	bool		maybe_have_blocks;
+}			PHVScanWorkerState;
+
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pscandesc;
+
+	/* Shared information */
+	PHVShared  *shared;
+
+	/* Per-worker scan state */
+	PHVScanWorkerState *myscanstate;
+
+	/* Points to all per-worker scan state array */
+	PHVScanWorkerState *scanstates;
+
+	/* The number of workers launched for parallel heap vacuum */
+	int			nworkers_launched;
+}			PHVState;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -144,6 +231,12 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
+	/* Parallel heap vacuum state and sizes for each struct */
+	PHVState   *phvstate;
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -159,10 +252,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -188,12 +277,10 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber frozen_pages;	/* # pages with newly frozen tuples */
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	BlockNumber next_fsm_block_to_vacuum;
+
+	/* Block and tuple counters for the relation */
+	LVRelCounters *counters;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -203,13 +290,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -229,6 +309,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static bool do_lazy_scan_heap(LVRelState *vacrel);
 static bool heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 									 bool *all_visible_according_to_vm);
 static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
@@ -271,6 +352,12 @@ static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 static void update_relstats_all_indexes(LVRelState *vacrel);
+
+
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_heap_vacuum_gather_scan_stats(LVRelState *vacrel);
+static void parallel_heap_complete_unfinised_scan(LVRelState *vacrel);
+
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
 									 LVSavedErrInfo *saved_vacrel,
@@ -296,6 +383,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVRelCounters *counters;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -406,14 +494,28 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		Assert(params->index_cleanup == VACOPTVALUE_AUTO);
 	}
 
+	vacrel->next_fsm_block_to_vacuum = 0;
+
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->frozen_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	counters = palloc(sizeof(LVRelCounters));
+	counters->scanned_pages = 0;
+	counters->removed_pages = 0;
+	counters->frozen_pages = 0;
+	counters->lpdead_item_pages = 0;
+	counters->missed_dead_pages = 0;
+	counters->nonempty_pages = 0;
+
+	/* Initialize remaining counters (be tidy) */
+	counters->tuples_deleted = 0;
+	counters->tuples_frozen = 0;
+	counters->lpdead_items = 0;
+	counters->live_tuples = 0;
+	counters->recently_dead_tuples = 0;
+	counters->missed_dead_tuples = 0;
+
+	vacrel->counters = counters;
+
+	vacrel->num_index_scans = 0;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -421,14 +523,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
-	/* Initialize remaining counters (be tidy) */
-	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -450,9 +545,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
+	vacrel->counters->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->counters->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->counters->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -533,15 +628,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->counters->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->counters->NewRelfrozenXid));
+	Assert(vacrel->counters->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->counters->NewRelminMxid));
+	if (vacrel->counters->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -549,8 +644,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->counters->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->counters->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
@@ -571,7 +666,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->counters->NewRelfrozenXid, vacrel->counters->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -587,8 +682,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples);
+						 vacrel->counters->recently_dead_tuples +
+						 vacrel->counters->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	if (instrument)
@@ -651,21 +746,21 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
-							 vacrel->removed_pages,
+							 vacrel->counters->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->counters->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->counters->scanned_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->counters->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->counters->recently_dead_tuples);
+			if (vacrel->counters->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->counters->missed_dead_tuples,
+								 vacrel->counters->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -673,25 +768,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->counters->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->counters->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->counters->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->counters->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->frozen_pages,
+							 vacrel->counters->frozen_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->frozen_pages / orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 100.0 * vacrel->counters->frozen_pages / orig_rel_pages,
+							 (long long) vacrel->counters->tuples_frozen);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -711,10 +806,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->counters->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->counters->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->counters->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -815,14 +910,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_fsm_block_to_vacuum = 0;
-	bool		all_visible_according_to_vm;
-
-	TidStore   *dead_items = vacrel->dead_items;
+	BlockNumber rel_pages = vacrel->rel_pages;
 	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
-	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -842,6 +931,70 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_allvis = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel);
+
+	vacrel->blkno = InvalidBlockNumber;
+
+	/* report that everything is now scanned */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->counters->scanned_pages,
+													 vacrel->counters->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->counters->recently_dead_tuples +
+		vacrel->counters->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+								rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * Return true if we processed all blocks, otherwise false if we exit from this function
+ * while not completing the heap scan due to full of dead item TIDs. In serial heap scan
+ * case, this function always returns true. In parallel heap vacuum scan, this function
+ * is called by both worker processes and the leader process, and could return false.
+ */
+static bool
+do_lazy_scan_heap(LVRelState *vacrel)
+{
+	bool		all_visible_according_to_vm;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
+	BlockNumber blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		scan_done = true;
+
 	while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
 	{
 		Buffer		buf;
@@ -849,7 +1002,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		bool		has_lpdead_items;
 		bool		got_cleanup_lock = false;
 
-		vacrel->scanned_pages++;
+		vacrel->counters->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -867,46 +1020,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (!IsParallelWorker() &&
+			vacrel->counters->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
-		/*
-		 * Consider if we definitely have enough space to process TIDs on page
-		 * already.  If we are close to overrunning the available space for
-		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page.
-		 */
-		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
-		{
-			/*
-			 * Before beginning index vacuuming, we release any pin we may
-			 * hold on the visibility map page.  This isn't necessary for
-			 * correctness, but we do it anyway to avoid holding the pin
-			 * across a lengthy, unrelated operation.
-			 */
-			if (BufferIsValid(vmbuffer))
-			{
-				ReleaseBuffer(vmbuffer);
-				vmbuffer = InvalidBuffer;
-			}
-
-			/* Perform a round of index and heap vacuuming */
-			vacrel->consider_bypass_optimization = false;
-			lazy_vacuum(vacrel);
-
-			/*
-			 * Vacuum the Free Space Map to make newly-freed space visible on
-			 * upper-level FSM pages.  Note we have not yet processed blkno.
-			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-									blkno);
-			next_fsm_block_to_vacuum = blkno;
-
-			/* Report that we are once again scanning the heap */
-			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
-		}
-
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
@@ -994,10 +1111,14 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * also be no opportunity to update the FSM later, because we'll never
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
+		 *
+		 * XXX: in parallel heap scan, some blocks before blkno might not been
+		 * processed yet. Is it worth vacuuming FSM?
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1011,57 +1132,154 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
 			UnlockReleaseBuffer(buf);
+
+		/*
+		 * Consider if we definitely have enough space to process TIDs on page
+		 * already.  If we are close to overrunning the available space for
+		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
+		 * this page.
+		 */
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
+		{
+			/*
+			 * Before beginning index vacuuming, we release any pin we may
+			 * hold on the visibility map page.  This isn't necessary for
+			 * correctness, but we do it anyway to avoid holding the pin
+			 * across a lengthy, unrelated operation.
+			 */
+			if (BufferIsValid(vmbuffer))
+			{
+				ReleaseBuffer(vmbuffer);
+				vmbuffer = InvalidBuffer;
+			}
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/*
+				 * In parallel heap vacuum case, both the leader process and
+				 * the worker processes have to exit without invoking index
+				 * and heap vacuuming. The leader process will wait for all
+				 * workers to finish and perform index and heap vacuuming.
+				 */
+				scan_done = false;
+				break;
+			}
+
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
+
+			/*
+			 * Vacuum the Free Space Map to make newly-freed space visible on
+			 * upper-level FSM pages.
+			 *
+			 * XXX: in parallel heap scan, some blocks before blkno might not
+			 * been processed yet. Is it worth vacuuming FSM?
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = blkno;
+
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			continue;
+		}
 	}
 
-	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	return scan_done;
+}
 
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+/*
+ * A parallel scan variant of heap_vac_scan_next_block.
+ *
+ * In parallel vacuum scan, we don't use the SKIP_PAGES_THRESHOLD optimization.
+ */
+static bool
+heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno,
+								  bool *all_visible_according_to_vm)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+	BlockNumber next_block;
+	Buffer		vmbuffer = InvalidBuffer;
+	uint8		mapbits = 0;
 
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+	Assert(ParallelHeapVacuumIsActive(vacrel));
 
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
+	for (;;)
+	{
+		next_block = table_block_parallelscan_nextpage(vacrel->rel,
+													   &(phvstate->myscanstate->state),
+													   phvstate->pscandesc);
 
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+		/* Have we reached the end of the table? */
+		if (!BlockNumberIsValid(next_block) || next_block >= vacrel->rel_pages)
+		{
+			if (BufferIsValid(vmbuffer))
+				ReleaseBuffer(vmbuffer);
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+			*blkno = vacrel->rel_pages;
+			return false;
+		}
 
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
+		/* We always treat the last block as unsafe to skip */
+		if (next_block == vacrel->rel_pages - 1)
+			break;
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_block, &vmbuffer);
+
+		/*
+		 * A block is unskippable if it is not all visible according to the
+		 * visibility map.
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+			break;
+		}
+
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		if (!vacrel->skipwithvm)
+			break;
+
+		/*
+		 * Aggressive VACUUM caller can't skip pages just because they are
+		 * all-visible.
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * All-visible block is safe to skip in non-aggressive case. But
+			 * remember that the final range contains such a block for later.
+			 */
+			vacrel->counters->skippedallvis = true;
+		}
+	}
+
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
+
+	*blkno = next_block;
+	*all_visible_according_to_vm = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
+
+	return true;
 }
 
 /*
@@ -1088,6 +1306,9 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 {
 	BlockNumber next_block;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		return heap_vac_scan_next_block_parallel(vacrel, blkno, all_visible_according_to_vm);
+
 	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
 	next_block = vacrel->current_block + 1;
 
@@ -1137,7 +1358,7 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->counters->skippedallvis = true;
 		}
 	}
 
@@ -1210,11 +1431,12 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 
 		/*
 		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
+		 * (caller must have the opportunity to set
+		 * vacrel->counters->nonempty_pages). This rule avoids having
+		 * lazy_truncate_heap() take access-exclusive lock on rel to attempt a
+		 * truncation that fails anyway, just because there are tuples on the
+		 * last page (it is likely that there will be tuples on other nearby
+		 * pages as well, but those can be skipped).
 		 *
 		 * Implement this by always treating the last block as unsafe to skip.
 		 */
@@ -1439,10 +1661,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->counters->NewRelfrozenXid, &vacrel->counters->NewRelminMxid);
 
 	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(TransactionIdIsValid(vacrel->counters->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1451,7 +1673,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * nfrozen == 0, since it only counts pages with newly frozen tuples
 		 * (don't confuse that with pages newly set all-frozen in VM).
 		 */
-		vacrel->frozen_pages++;
+		vacrel->counters->frozen_pages++;
 	}
 
 	/*
@@ -1486,7 +1708,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->counters->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -1501,15 +1723,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->counters->tuples_deleted += presult.ndeleted;
+	vacrel->counters->tuples_frozen += presult.nfrozen;
+	vacrel->counters->lpdead_items += presult.lpdead_items;
+	vacrel->counters->live_tuples += presult.live_tuples;
+	vacrel->counters->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->counters->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -1659,8 +1881,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->counters->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->counters->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1787,8 +2009,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->counters->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->counters->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -1815,25 +2037,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->counters->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->counters->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->counters->live_tuples += live_tuples;
+	vacrel->counters->recently_dead_tuples += recently_dead_tuples;
+	vacrel->counters->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->counters->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->counters->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -1862,7 +2084,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->counters->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -1896,7 +2118,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->counters->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -1923,7 +2145,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		bypass = (vacrel->counters->lpdead_item_pages < threshold &&
 				  (TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L)));
 	}
 
@@ -2061,7 +2283,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->counters->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2165,8 +2387,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->counters->lpdead_items &&
+			vacuumed_pages == vacrel->counters->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2347,7 +2569,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->counters->scanned_pages < vacrel->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -2528,7 +2750,7 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->counters->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
 		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
@@ -2554,7 +2776,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->counters->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -2655,7 +2877,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->counters->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
@@ -2663,7 +2885,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->counters->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -2691,7 +2913,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->counters->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2803,7 +3025,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->counters->nonempty_pages;
 }
 
 /*
@@ -2821,12 +3043,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -2844,11 +3062,18 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * For parallel index vacuuming, only one worker can be used for
+			 * an index, we invoke parallelism only if there are at least two
+			 * indexes on a table.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -2889,9 +3114,19 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	/*
+	 * Protect both dead_items and dead_items_info from concurrent updates in
+	 * parallel heap scan cases.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(dead_items);
+
 	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(dead_items);
@@ -3093,6 +3328,357 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of parallel workers for parallel vacuum heap scan.
+ *
+ * The calculation logic is borrowed from compute_parallel_worker().
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	int			parallel_workers = 0;
+	int			heap_parallel_threshold;
+	int			heap_pages;
+
+	if (nrequested == 0)
+	{
+		/*
+		 * Select the number of workers based on the log of the size of the
+		 * relation.  This probably needs to be a good deal more
+		 * sophisticated, but we need something here for now.  Note that the
+		 * upper limit of the min_parallel_table_scan_size GUC is chosen to
+		 * prevent overflow here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		heap_pages = RelationGetNumberOfBlocks(rel);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nrequested;
+
+	return parallel_workers;
+}
+
+/*
+ * Compute the amount of space we'll need in the parallel heap vacuum
+ * DSM, and inform pcxt->estimator about our needs.
+ *
+ * nworkers is the number of workers for the table vacuum. Note that it could
+ * be different than pcxt->nworkers since it is the maximum of number of
+ * workers for table vacuum and index vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+							  int nworkers, void *state)
+{
+	Size		size = 0;
+	LVRelState *vacrel = (LVRelState *) state;
+
+	/* space for PHVShared */
+	size = add_size(size, SizeOfPHVShared);
+	size = add_size(size, mul_size(sizeof(LVRelCounters), nworkers));
+	vacrel->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for ParallelBlockTableScanDesc */
+	vacrel->pscan_len = table_block_parallelscan_estimate(rel);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->pscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for per-worker scan state, PHVScanWorkerState */
+	vacrel->pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->pscanwork_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+								int nworkers, void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ParallelBlockTableScanDesc pscan;
+	PHVScanWorkerState *pscanwork;
+	PHVShared  *shared;
+	PHVState   *phvstate;
+
+	phvstate = (PHVState *) palloc(sizeof(PHVState));
+
+	shared = shm_toc_allocate(pcxt->toc, vacrel->shared_len);
+
+	/* Prepare the shared information */
+
+	MemSet(shared, 0, vacrel->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->counters->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->counters->NewRelminMxid;
+	shared->skippedallvis = vacrel->counters->skippedallvis;
+
+	/*
+	 * XXX: we copy the contents of vistest to the shared area, but in order
+	 * to do that, we need to either expose GlobalVisTest or to provide
+	 * functions to copy contents of GlobalVisTest to somewhere. Currently we
+	 * do the former but not sure it's the best choice.
+	 *
+	 * Alternative idea is to have each worker determine cutoff and have their
+	 * own vistest. But we need to carefully consider it since parallel
+	 * workers end up having different cutoff and horizon.
+	 */
+	shared->vistest = *vacrel->vistest;
+
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_SHARED, shared);
+
+	phvstate->shared = shared;
+
+	/* prepare the  parallel block table scan description */
+	pscan = shm_toc_allocate(pcxt->toc, vacrel->pscan_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_DESC, pscan);
+
+	/* initialize parallel scan description */
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pscan);
+	phvstate->pscandesc = pscan;
+
+	/* prepare the workers' parallel block table scan state */
+	pscanwork = shm_toc_allocate(pcxt->toc, vacrel->pscanwork_len);
+	MemSet(pscanwork, 0, vacrel->pscanwork_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_DESC_WORKER, pscanwork);
+	phvstate->scanstates = pscanwork;
+
+	vacrel->phvstate = phvstate;
+}
+
+/*
+ * Main function for parallel heap vacuum workers.
+ */
+void
+heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
+								 ParallelWorkerContext *pwcxt)
+{
+	LVRelState	vacrel = {0};
+	PHVState   *phvstate;
+	PHVShared  *shared;
+	ParallelBlockTableScanDesc pscandesc;
+	PHVScanWorkerState *scanstate;
+	LVRelCounters *counters;
+	bool		scan_done;
+
+	phvstate = palloc(sizeof(PHVState));
+
+	pscandesc = (ParallelBlockTableScanDesc) shm_toc_lookup(pwcxt->toc,
+															LV_PARALLEL_SCAN_DESC,
+															false);
+	phvstate->pscandesc = pscandesc;
+
+	shared = (PHVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_SCAN_SHARED,
+										  false);
+	phvstate->shared = shared;
+
+	scanstate = (PHVScanWorkerState *) shm_toc_lookup(pwcxt->toc,
+													  LV_PARALLEL_SCAN_DESC_WORKER,
+													  false);
+
+	phvstate->myscanstate = &(scanstate[ParallelWorkerNumber]);
+	counters = &(shared->worker_relcnts[ParallelWorkerNumber]);
+
+	/* Prepare LVRelState */
+	vacrel.rel = rel;
+	vacrel.indrels = parallel_vacuum_get_table_indexes(pvs, &vacrel.nindexes);
+	vacrel.pvs = pvs;
+	vacrel.phvstate = phvstate;
+	vacrel.aggressive = shared->aggressive;
+	vacrel.skipwithvm = shared->skipwithvm;
+	vacrel.cutoffs = shared->cutoffs;
+	vacrel.vistest = &(shared->vistest);
+	vacrel.dead_items = parallel_vacuum_get_dead_items(pvs,
+													   &vacrel.dead_items_info);
+	vacrel.rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel.counters = counters;
+
+	/* initialize per-worker relation statistics */
+	MemSet(counters, 0, sizeof(LVRelCounters));
+
+	vacrel.counters->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel.counters->NewRelminMxid = shared->NewRelminMxid;
+	vacrel.counters->skippedallvis = shared->skippedallvis;
+
+	/*
+	 * XXX: the following fields are not set yet: - index vacuum related
+	 * fields such as consider_bypass_optimization, do_index_vacuuming etc. -
+	 * error reporting state. - statistics such as scanned_pages etc. - oldest
+	 * extant XID/MXID. - states maintained by heap_vac_scan_next_block()
+	 */
+
+	/* Initialize the start block if not yet */
+	if (!phvstate->myscanstate->maybe_have_blocks)
+	{
+		table_block_parallelscan_startblock_init(rel,
+												 &(phvstate->myscanstate->state),
+												 phvstate->pscandesc);
+
+		phvstate->myscanstate->maybe_have_blocks = false;
+	}
+
+	/*
+	 * XXX: if we want to support parallel heap *vacuum*, we need to allow
+	 * workers to call different function based on the shared information.
+	 */
+	scan_done = do_lazy_scan_heap(&vacrel);
+
+	phvstate->myscanstate->maybe_have_blocks = !scan_done;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+parallel_heap_complete_unfinised_scan(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		PHVScanWorkerState *wstate = &(vacrel->phvstate->scanstates[i]);
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		if (!wstate->maybe_have_blocks)
+			continue;
+
+		vacrel->phvstate->myscanstate = wstate;
+
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		Assert(scan_done);
+	}
+}
+
+/*
+ * Accumulate relation counters that parallel workers collected into the
+ * leader's counters.
+ */
+static void
+parallel_heap_vacuum_gather_scan_stats(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		LVRelCounters *counters = &(phvstate->shared->worker_relcnts[i]);
+
+#define LV_ACCUM_ITEM(item) (vacrel)->counters->item += (counters)->item
+
+		LV_ACCUM_ITEM(scanned_pages);
+		LV_ACCUM_ITEM(removed_pages);
+		LV_ACCUM_ITEM(frozen_pages);
+		LV_ACCUM_ITEM(lpdead_item_pages);
+		LV_ACCUM_ITEM(missed_dead_pages);
+		LV_ACCUM_ITEM(nonempty_pages);
+		LV_ACCUM_ITEM(tuples_deleted);
+		LV_ACCUM_ITEM(tuples_frozen);
+		LV_ACCUM_ITEM(lpdead_items);
+		LV_ACCUM_ITEM(live_tuples);
+		LV_ACCUM_ITEM(recently_dead_tuples);
+		LV_ACCUM_ITEM(missed_dead_tuples);
+
+#undef LV_ACCUM_ITEM
+
+		if (TransactionIdPrecedes(counters->NewRelfrozenXid, vacrel->counters->NewRelfrozenXid))
+			vacrel->counters->NewRelfrozenXid = counters->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(counters->NewRelminMxid, vacrel->counters->NewRelminMxid))
+			vacrel->counters->NewRelminMxid = counters->NewRelminMxid;
+
+		if (!vacrel->counters->skippedallvis && counters->skippedallvis)
+			vacrel->counters->skippedallvis = true;
+	}
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches parallel
+ * workers to scan the heap in parallel.
+ */
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	PHVScanWorkerState *scanstate;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+	/* initialize parallel scan description to join as a worker */
+	scanstate = palloc(sizeof(PHVScanWorkerState));
+	table_block_parallelscan_startblock_init(vacrel->rel, &(scanstate->state),
+											 vacrel->phvstate->pscandesc);
+	vacrel->phvstate->myscanstate = scanstate;
+
+	for (;;)
+	{
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		/*
+		 * Scan the table until either we are close to overrunning the
+		 * available space for dead_items TIDs or we reach the end of the
+		 * table.
+		 */
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/* stop parallel workers and gather the collected stats */
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+		parallel_heap_vacuum_gather_scan_stats(vacrel);
+
+		/*
+		 * Consider if we definitely have enough space to process TIDs on page
+		 * already.  If we are close to overrunning the available space for
+		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
+		 * this page.
+		 */
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
+		{
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
+
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			/* re-launcher workers */
+			vacrel->phvstate->nworkers_launched =
+				parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+			continue;
+		}
+
+		/* We reach the end of the table */
+		Assert(scan_done);
+		break;
+	}
+
+	parallel_heap_complete_unfinised_scan(vacrel);
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26070bff2..e1759da69a 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -28,6 +28,7 @@
 
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -64,6 +65,12 @@ typedef struct PVShared
 	Oid			relid;
 	int			elevel;
 
+	/*
+	 * True if the caller wants parallel workers to invoke vacuum table scan
+	 * callback.
+	 */
+	bool		do_vacuum_table_scan;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -163,6 +170,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Passed to parallel table scan workers. NULL for leader process */
+	ParallelWorkerContext *pwcxt;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -192,6 +202,16 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/*
+	 * The number of workers for parallel table scan/vacuuming and index
+	 * vacuuming, respectively.
+	 */
+	int			nworkers_for_table;
+	int			nworkers_for_index;
+
+	/* How many parallel table vacuum scan is called? */
+	int			num_table_scans;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
@@ -220,8 +240,9 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static void parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_table,
+											int *nworkers_index, bool *will_parallel_vacuum);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -241,7 +262,7 @@ static void parallel_vacuum_error_callback(void *arg);
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -255,6 +276,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_table;
+	int			nworkers_index;
 	int			querylen;
 
 	/*
@@ -262,15 +285,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	 * relation
 	 */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
-													   nrequested_workers,
-													   will_parallel_vacuum);
+	parallel_vacuum_compute_workers(rel, indrels, nindexes, nrequested_workers,
+									&nworkers_table, &nworkers_index,
+									will_parallel_vacuum);
+
+	parallel_workers = Max(nworkers_table, nworkers_index);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
@@ -284,6 +309,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
+	pvs->nworkers_for_table = nworkers_table;
+	pvs->nworkers_for_index = nworkers_index;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -326,6 +353,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (nworkers_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, nworkers_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -417,6 +448,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Prepare AM-specific DSM for parallel table vacuum */
+	if (nworkers_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, nworkers_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -538,27 +573,41 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
  * min_parallel_index_scan_size as invoking workers for very small indexes
  * can hurt performance.
  *
+ * XXX needs to mention about the number of workers for table.
+ *
  * nrequested is the number of parallel workers that user requested.  If
  * nrequested is 0, we compute the parallel degree based on nindexes, that is
  * the number of indexes that support parallel vacuum.  This function also
  * sets will_parallel_vacuum to remember indexes that participate in parallel
  * vacuum.
  */
-static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+static void
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table,
+								int *nworkers_index, bool *will_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			parallel_workers_table = 0;
+	int			parallel_workers_index = 0;
+
+	*nworkers_table = 0;
+	*nworkers_index = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
 	 * when parallelism is disabled.
 	 */
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
-		return 0;
+		return;
+
+	/*
+	 * Compute the number of workers for parallel table scan. Cap by
+	 * max_parallel_maintenance_workers.
+	 */
+	parallel_workers_table = Min(table_paralle_vacuum_compute_workers(rel, nrequested),
+								 max_parallel_maintenance_workers);
 
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
@@ -589,17 +638,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	nindexes_parallel--;
 
 	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Compute the parallel degree for parallel index vacuum */
+		parallel_workers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		parallel_workers_index = Min(parallel_workers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table = parallel_workers_table;
+	*nworkers_index = parallel_workers_index;
 }
 
 /*
@@ -669,7 +719,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	if (nworkers > 0)
 	{
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (num_index_scans > 0 || pvs->num_table_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -978,6 +1028,120 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+	/*
+	 * Increment the active worker count if we are able to launch any worker.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do table vacuum scan */
+	table_parallel_vacuum_scan(pvs->heaprel, pvs, pvs->pwcxt);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * Prepare DSM and vacuum delay, and launch parallel workers for parallel
+ * table vacuum scan.
+ */
+int
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->do_vacuum_table_scan = true;
+
+	if (pvs->num_table_scans > 0)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->nworkers_for_table);
+
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+	}
+
+	ereport(pvs->shared->elevel,
+			(errmsg(ngettext("launched %d parallel vacuum worker for table scanning (planned: %d)",
+							 "launched %d parallel vacuum workers for table scanning (planned: %d)",
+							 pvs->pcxt->nworkers_launched),
+					pvs->pcxt->nworkers_launched, pvs->nworkers_for_table)));
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel table vacuum scan, and gather statistics.
+ */
+void
+parallel_vacuum_table_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	pvs->shared->do_vacuum_table_scan = false;
+	pvs->num_table_scans++;
+}
+
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -1026,7 +1190,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	if (shared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = shared->maintenance_work_mem_worker;
@@ -1060,6 +1223,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
 	pvs.heaprel = rel;
 
+	pvs.pwcxt = palloc(sizeof(ParallelWorkerContext));
+	pvs.pwcxt->toc = toc;
+	pvs.pwcxt->seg = seg;
+
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
 	pvs.status = PARALLEL_INDVAC_STATUS_INITIAL;
@@ -1077,8 +1244,15 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	if (pvs.shared->do_vacuum_table_scan)
+	{
+		parallel_vacuum_process_table(&pvs);
+	}
+	else
+	{
+		/* Process indexes to perform vacuum/cleanup */
+		parallel_vacuum_process_safe_indexes(&pvs);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index af3b15e93d..63c2548c54 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -164,15 +164,6 @@ typedef struct ProcArrayStruct
  *
  * The typedef is in the header.
  */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9e9aec88a6..a80b3a17bf 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -20,6 +20,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "commands/vacuum.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -393,6 +394,13 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int requested);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+										  int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
+											 ParallelWorkerContext *pwcxt);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index da661289c1..e1bacc95cd 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -655,6 +656,46 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * function must return 0 to disable parallel table vacuum.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel, int requested);
+
+	/*
+	 * Compute the amount of DSM space AM need in the parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * This callback is called for parallel table vacuum workers.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_scan_worker) (Relation rel,
+												ParallelVacuumState *pvs,
+												ParallelWorkerContext *pwcxt);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1710,6 +1751,33 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}
+
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							   void *state)
+{
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+static inline void
+table_parallel_vacuum_scan(Relation rel, ParallelVacuumState *pvs,
+						   ParallelWorkerContext *pwcxt)
+{
+	rel->rd_tableam->parallel_vacuum_scan_worker(rel, pvs, pwcxt);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..e665335b6f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -360,7 +360,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -372,6 +373,10 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051..6ccb19a29f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -102,8 +102,20 @@ extern char *ExportSnapshot(Snapshot snapshot);
 /*
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
+ *
+ * XXX the struct definition is temporarily moved from procarray.c for
+ * parallel table vacuum development. We need to find a suitable way for
+ * parallel table vacuum workers to share the GlobalVisState.
  */
-typedef struct GlobalVisState GlobalVisState;
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Amit Kapila (#2)

Re: Parallel heap vacuum

On Fri, Jun 28, 2024 at 9:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

# Benchmark results

* Test-1: parallel heap scan on the table without indexes

I created 20GB table, made garbage on the table, and run vacuum while
changing parallel degree:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

Here are the results (total time and heap scan time):

PARALLEL 0: 21.99 s (single process)
PARALLEL 1: 11.39 s
PARALLEL 2: 8.36 s
PARALLEL 3: 6.14 s
PARALLEL 4: 5.08 s

* Test-2: parallel heap scan on the table with one index

I used a similar table to the test case 1 but created one btree index on it:

create unlogged table test (a int) with (autovacuum_enabled = off);
insert into test select generate_series(1, 600000000); --- 20GB table
create index on test (a);
delete from test where a % 5 = 0;
vacuum (verbose, parallel 0) test;

I've measured the total execution time as well as the time of each
vacuum phase (from left heap scan time, index vacuum time, and heap
vacuum time):

PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)

Overall, I can see the parallel heap scan in lazy vacuum has a decent
scalability; In both test-1 and test-2, the execution time of heap
scan got ~4x faster with 4 parallel workers. On the other hand, when
it comes to the total vacuum execution time, I could not see much
performance improvement in test-2 (45.11 vs. 35.84). Looking at the
results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
(21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
and heap scan in case 2 was not as fast as in case 1 with 1 parallel
worker (12.75 vs. 11.39).

I think the reason is the shared TidStore is not very scalable since
we have a single lock on it. In all cases in the test-1, we don't use
the shared TidStore since all dead tuples are removed during heap
pruning. So the scalability was better overall than in test-2. In
parallel 0 case in test-2, we use the local TidStore, and from
parallel degree of 1 in test-2, we use the shared TidStore and
parallel worker concurrently update it. Also, I guess that the lookup
performance of the local TidStore is better than the shared TidStore's
lookup performance because of the differences between a bump context
and an DSA area. I think that this difference contributed the fact
that index vacuuming got slower (16.74 vs. 22.04).

Thank you for the comments!

There are two obvious improvement ideas to improve overall vacuum
execution time: (1) improve the shared TidStore scalability and (2)
support parallel heap vacuum. For (1), several ideas are proposed by
the ART authors[1]. I've not tried these ideas but it might be
applicable to our ART implementation. But I prefer to start with (2)
since it would be easier. Feedback is very welcome.

Starting with (2) sounds like a reasonable approach. We should study a
few more things like (a) the performance results where there are 3-4
indexes,

Here are the results with 4 indexes (and restarting the server before
the benchmark):

PARALLEL 0: 115.48 s (32.76, 64.46, 18.24)
PARALLEL 1: 74.88 s (17.11, 44.43, 13.25)
PARALLEL 2: 71.15 s (14.13, 44.82, 12.12)
PARALLEL 3: 46.78 s (10.74, 24.50, 11.43)
PARALLEL 4: 46.42 s (8.95, 24.96, 12.39) (launched 4 workers for heap
scan and 3 workers for index vacuum)

(b) What is the reason for performance improvement seen with
only heap scans. We normally get benefits of parallelism because of
using multiple CPUs but parallelizing scans (I/O) shouldn't give much
benefits. Is it possible that you are seeing benefits because most of
the data is either in shared_buffers or in memory? We can probably try
vacuuming tables by restarting the nodes to ensure the data is not in
memory.

I think it depends on the storage performance. FYI I use an EC2
instance (m6id.metal).

I've run the same benchmark script (table with no index) with
restarting the server before executing the vacuum, and here are the
results:

PARALLEL 0: 32.75 s
PARALLEL 1: 17.46 s
PARALLEL 2: 13.41 s
PARALLEL 3: 10.31 s
PARALLEL 4: 8.48 s

With the above two tests, I used the updated patch that I just submitted[1]/messages/by-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn+2KYoSXEv=vXauw@mail.gmail.com.

Regards,

[1]: /messages/by-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn+2KYoSXEv=vXauw@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: Masahiko Sawada (#4)

RE: Parallel heap vacuum

Dear Sawada-san,

Thank you for the test!

I could reproduce this issue and it's a bug; it skipped even
non-all-visible pages. I've attached the new version patch.

BTW since we compute the number of parallel workers for the heap scan
based on the table size, it's possible that we launch multiple workers
even if most blocks are all-visible. It seems to be better if we
calculate it based on (relpages - relallvisible).

Thanks for updating the patch. I applied and confirmed all pages are scanned.
I used almost the same script (just changed max_parallel_maintenance_workers)
and got below result. I think the tendency was the same as yours.

```
parallel 0: 61114.369 ms
parallel 1: 34870.316 ms
parallel 2: 23598.115 ms
parallel 3: 17732.363 ms
parallel 4: 15203.271 ms
parallel 5: 13836.025 ms
```

I started to read your codes but takes much time because I've never seen before...
Below part contains initial comments.

1.
This patch cannot be built when debug mode is enabled. See [1]``` vacuumlazy.c: In function ‘lazy_scan_prune’: vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no member named ‘NewRelminMxid’ Assert(MultiXactIdIsValid(vacrel->NewRelminMxid)); ^~ .... ```.
IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared.
So you should update like " vacrel->counters->NewRelminMxid".

2.
I compared parallel heap scan and found that it does not have compute_worker API.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the heap structure,
so should we add the API for table scan as well?)

[1]: ``` vacuumlazy.c: In function ‘lazy_scan_prune’: vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no member named ‘NewRelminMxid’ Assert(MultiXactIdIsValid(vacrel->NewRelminMxid)); ^~ .... ```
```
vacuumlazy.c: In function ‘lazy_scan_prune’:
vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no member named ‘NewRelminMxid’
Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
^~
....
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Masahiko Sawada

sawada.mshk@gmail.com

over 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#6)

Re: Parallel heap vacuum

On Thu, Jul 25, 2024 at 2:58 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

Thank you for the test!

I could reproduce this issue and it's a bug; it skipped even
non-all-visible pages. I've attached the new version patch.

BTW since we compute the number of parallel workers for the heap scan
based on the table size, it's possible that we launch multiple workers
even if most blocks are all-visible. It seems to be better if we
calculate it based on (relpages - relallvisible).

Thanks for updating the patch. I applied and confirmed all pages are scanned.
I used almost the same script (just changed max_parallel_maintenance_workers)
and got below result. I think the tendency was the same as yours.

```
parallel 0: 61114.369 ms
parallel 1: 34870.316 ms
parallel 2: 23598.115 ms
parallel 3: 17732.363 ms
parallel 4: 15203.271 ms
parallel 5: 13836.025 ms
```

Thank you for testing!

I started to read your codes but takes much time because I've never seen before...
Below part contains initial comments.

1.
This patch cannot be built when debug mode is enabled. See [1].
IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared.
So you should update like " vacrel->counters->NewRelminMxid".

Right, will fix.

2.
I compared parallel heap scan and found that it does not have compute_worker API.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the heap structure,
so should we add the API for table scan as well?)

There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.

I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 1 year ago

In reply to: Masahiko Sawada (#7)

1 attachment(s)

RE: Parallel heap vacuum

Dear Sawada-san,

Thank you for testing!

I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.

2.
I compared parallel heap scan and found that it does not have compute_worker

API.

Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the

heap structure,

so should we add the API for table scan as well?)

There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.

Okay, thanks for confirming. I wanted to ask others as well.

I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.

Here are other preliminary comments for v2 patch. This does not contain
cosmetic ones.

1.
Shared data structure PHVShared does not contain the mutex lock. Is it intentional
because they are accessed by leader only after parallel workers exit?

2.
Per my understanding, the vacuuming goes like below steps.

a. paralell workers are launched for scanning pages
b. leader waits until scans are done
c. leader does vacuum alone (you may extend here...)
d. parallel workers are launched again to cleanup indeces

If so, can we reuse parallel workers for the cleanup? Or, this is painful
engineering than the benefit?

3.
According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
"parallel worker ..." Can we extend this to improve the trackability in the
pg_stat_activity?

4.
I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
bottleneck when tid is added to the shared TidStore. My another primitive idea
is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
and gather after the heap scanning. If you extend like parallel workers do vacuuming,
the gathering may not be needed: they can access own TidStore and clean up.
One downside is that the memory consumption may be quite large.

How do you think?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

40_flamegraph.svgapplication/octet-stream; name=40_flamegraph.svgDownload

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" width="1200" height="854" onload="init(evt)" viewBox="0 0 1200 854" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<!-- Flame graph stack visualization. See https://github.com/brendangregg/FlameGraph for latest version, and http://www.brendangregg.com/flamegraphs.html for examples. -->
<!-- NOTES:  -->
<defs>
	<linearGradient id="background" y1="0" y2="1" x1="0" x2="0" >
		<stop stop-color="#eeeeee" offset="5%" />
		<stop stop-color="#eeeeb0" offset="95%" />
	</linearGradient>
</defs>
<style type="text/css">
	text { font-family:Verdana; font-size:12px; fill:rgb(0,0,0); }
	#search, #ignorecase { opacity:0.1; cursor:pointer; }
	#search:hover, #search.show, #ignorecase:hover, #ignorecase.show { opacity:1; }
	#subtitle { text-anchor:middle; font-color:rgb(160,160,160); }
	#title { text-anchor:middle; font-size:17px}
	#unzoom { cursor:pointer; }
	#frames > *:hover { stroke:black; stroke-width:0.5; cursor:pointer; }
	.hide { display:none; }
	.parent { opacity:0.5; }
</style>
<script type="text/ecmascript">
<![CDATA[
	"use strict";
	var details, searchbtn, unzoombtn, matchedtxt, svg, searching, currentSearchTerm, ignorecase, ignorecaseBtn;
	function init(evt) {
		details = document.getElementById("details").firstChild;
		searchbtn = document.getElementById("search");
		ignorecaseBtn = document.getElementById("ignorecase");
		unzoombtn = document.getElementById("unzoom");
		matchedtxt = document.getElementById("matched");
		svg = document.getElementsByTagName("svg")[0];
		searching = 0;
		currentSearchTerm = null;

		// use GET parameters to restore a flamegraphs state.
		var params = get_params();
		if (params.x && params.y)
			zoom(find_group(document.querySelector('[x="' + params.x + '"][y="' + params.y + '"]')));
                if (params.s) search(params.s);
	}

	// event listeners
	window.addEventListener("click", function(e) {
		var target = find_group(e.target);
		if (target) {
			if (target.nodeName == "a") {
				if (e.ctrlKey === false) return;
				e.preventDefault();
			}
			if (target.classList.contains("parent")) unzoom(true);
			zoom(target);
			if (!document.querySelector('.parent')) {
				// we have basically done a clearzoom so clear the url
				var params = get_params();
				if (params.x) delete params.x;
				if (params.y) delete params.y;
				history.replaceState(null, null, parse_params(params));
				unzoombtn.classList.add("hide");
				return;
			}

			// set parameters for zoom state
			var el = target.querySelector("rect");
			if (el && el.attributes && el.attributes.y && el.attributes._orig_x) {
				var params = get_params()
				params.x = el.attributes._orig_x.value;
				params.y = el.attributes.y.value;
				history.replaceState(null, null, parse_params(params));
			}
		}
		else if (e.target.id == "unzoom") clearzoom();
		else if (e.target.id == "search") search_prompt();
		else if (e.target.id == "ignorecase") toggle_ignorecase();
	}, false)

	// mouse-over for info
	// show
	window.addEventListener("mouseover", function(e) {
		var target = find_group(e.target);
		if (target) details.nodeValue = "Function: " + g_to_text(target);
	}, false)

	// clear
	window.addEventListener("mouseout", function(e) {
		var target = find_group(e.target);
		if (target) details.nodeValue = ' ';
	}, false)

	// ctrl-F for search
	// ctrl-I to toggle case-sensitive search
	window.addEventListener("keydown",function (e) {
		if (e.keyCode === 114 || (e.ctrlKey && e.keyCode === 70)) {
			e.preventDefault();
			search_prompt();
		}
		else if (e.ctrlKey && e.keyCode === 73) {
			e.preventDefault();
			toggle_ignorecase();
		}
	}, false)

	// functions
	function get_params() {
		var params = {};
		var paramsarr = window.location.search.substr(1).split('&');
		for (var i = 0; i < paramsarr.length; ++i) {
			var tmp = paramsarr[i].split("=");
			if (!tmp[0] || !tmp[1]) continue;
			params[tmp[0]]  = decodeURIComponent(tmp[1]);
		}
		return params;
	}
	function parse_params(params) {
		var uri = "?";
		for (var key in params) {
			uri += key + '=' + encodeURIComponent(params[key]) + '&';
		}
		if (uri.slice(-1) == "&")
			uri = uri.substring(0, uri.length - 1);
		if (uri == '?')
			uri = window.location.href.split('?')[0];
		return uri;
	}
	function find_child(node, selector) {
		var children = node.querySelectorAll(selector);
		if (children.length) return children[0];
	}
	function find_group(node) {
		var parent = node.parentElement;
		if (!parent) return;
		if (parent.id == "frames") return node;
		return find_group(parent);
	}
	function orig_save(e, attr, val) {
		if (e.attributes["_orig_" + attr] != undefined) return;
		if (e.attributes[attr] == undefined) return;
		if (val == undefined) val = e.attributes[attr].value;
		e.setAttribute("_orig_" + attr, val);
	}
	function orig_load(e, attr) {
		if (e.attributes["_orig_"+attr] == undefined) return;
		e.attributes[attr].value = e.attributes["_orig_" + attr].value;
		e.removeAttribute("_orig_"+attr);
	}
	function g_to_text(e) {
		var text = find_child(e, "title").firstChild.nodeValue;
		return (text)
	}
	function g_to_func(e) {
		var func = g_to_text(e);
		// if there's any manipulation we want to do to the function
		// name before it's searched, do it here before returning.
		return (func);
	}
	function update_text(e) {
		var r = find_child(e, "rect");
		var t = find_child(e, "text");
		var w = parseFloat(r.attributes.width.value) -3;
		var txt = find_child(e, "title").textContent.replace(/\([^(]*\)$/,"");
		t.attributes.x.value = parseFloat(r.attributes.x.value) + 3;

		// Smaller than this size won't fit anything
		if (w < 2 * 12 * 0.59) {
			t.textContent = "";
			return;
		}

		t.textContent = txt;
		var sl = t.getSubStringLength(0, txt.length);
		// check if only whitespace or if we can fit the entire string into width w
		if (/^ *$/.test(txt) || sl < w)
			return;

		// this isn't perfect, but gives a good starting point
		// and avoids calling getSubStringLength too often
		var start = Math.floor((w/sl) * txt.length);
		for (var x = start; x > 0; x = x-2) {
			if (t.getSubStringLength(0, x + 2) <= w) {
				t.textContent = txt.substring(0, x) + "..";
				return;
			}
		}
		t.textContent = "";
	}

	// zoom
	function zoom_reset(e) {
		if (e.attributes != undefined) {
			orig_load(e, "x");
			orig_load(e, "width");
		}
		if (e.childNodes == undefined) return;
		for (var i = 0, c = e.childNodes; i < c.length; i++) {
			zoom_reset(c[i]);
		}
	}
	function zoom_child(e, x, ratio) {
		if (e.attributes != undefined) {
			if (e.attributes.x != undefined) {
				orig_save(e, "x");
				e.attributes.x.value = (parseFloat(e.attributes.x.value) - x - 10) * ratio + 10;
				if (e.tagName == "text")
					e.attributes.x.value = find_child(e.parentNode, "rect[x]").attributes.x.value + 3;
			}
			if (e.attributes.width != undefined) {
				orig_save(e, "width");
				e.attributes.width.value = parseFloat(e.attributes.width.value) * ratio;
			}
		}

		if (e.childNodes == undefined) return;
		for (var i = 0, c = e.childNodes; i < c.length; i++) {
			zoom_child(c[i], x - 10, ratio);
		}
	}
	function zoom_parent(e) {
		if (e.attributes) {
			if (e.attributes.x != undefined) {
				orig_save(e, "x");
				e.attributes.x.value = 10;
			}
			if (e.attributes.width != undefined) {
				orig_save(e, "width");
				e.attributes.width.value = parseInt(svg.width.baseVal.value) - (10 * 2);
			}
		}
		if (e.childNodes == undefined) return;
		for (var i = 0, c = e.childNodes; i < c.length; i++) {
			zoom_parent(c[i]);
		}
	}
	function zoom(node) {
		var attr = find_child(node, "rect").attributes;
		var width = parseFloat(attr.width.value);
		var xmin = parseFloat(attr.x.value);
		var xmax = parseFloat(xmin + width);
		var ymin = parseFloat(attr.y.value);
		var ratio = (svg.width.baseVal.value - 2 * 10) / width;

		// XXX: Workaround for JavaScript float issues (fix me)
		var fudge = 0.0001;

		unzoombtn.classList.remove("hide");

		var el = document.getElementById("frames").children;
		for (var i = 0; i < el.length; i++) {
			var e = el[i];
			var a = find_child(e, "rect").attributes;
			var ex = parseFloat(a.x.value);
			var ew = parseFloat(a.width.value);
			var upstack;
			// Is it an ancestor
			if (0 == 0) {
				upstack = parseFloat(a.y.value) > ymin;
			} else {
				upstack = parseFloat(a.y.value) < ymin;
			}
			if (upstack) {
				// Direct ancestor
				if (ex <= xmin && (ex+ew+fudge) >= xmax) {
					e.classList.add("parent");
					zoom_parent(e);
					update_text(e);
				}
				// not in current path
				else
					e.classList.add("hide");
			}
			// Children maybe
			else {
				// no common path
				if (ex < xmin || ex + fudge >= xmax) {
					e.classList.add("hide");
				}
				else {
					zoom_child(e, xmin, ratio);
					update_text(e);
				}
			}
		}
		search();
	}
	function unzoom(dont_update_text) {
		unzoombtn.classList.add("hide");
		var el = document.getElementById("frames").children;
		for(var i = 0; i < el.length; i++) {
			el[i].classList.remove("parent");
			el[i].classList.remove("hide");
			zoom_reset(el[i]);
			if(!dont_update_text) update_text(el[i]);
		}
		search();
	}
	function clearzoom() {
		unzoom();

		// remove zoom state
		var params = get_params();
		if (params.x) delete params.x;
		if (params.y) delete params.y;
		history.replaceState(null, null, parse_params(params));
	}

	// search
	function toggle_ignorecase() {
		ignorecase = !ignorecase;
		if (ignorecase) {
			ignorecaseBtn.classList.add("show");
		} else {
			ignorecaseBtn.classList.remove("show");
		}
		reset_search();
		search();
	}
	function reset_search() {
		var el = document.querySelectorAll("#frames rect");
		for (var i = 0; i < el.length; i++) {
			orig_load(el[i], "fill")
		}
		var params = get_params();
		delete params.s;
		history.replaceState(null, null, parse_params(params));
	}
	function search_prompt() {
		if (!searching) {
			var term = prompt("Enter a search term (regexp " +
			    "allowed, eg: ^ext4_)"
			    + (ignorecase ? ", ignoring case" : "")
			    + "\nPress Ctrl-i to toggle case sensitivity", "");
			if (term != null) search(term);
		} else {
			reset_search();
			searching = 0;
			currentSearchTerm = null;
			searchbtn.classList.remove("show");
			searchbtn.firstChild.nodeValue = "Search"
			matchedtxt.classList.add("hide");
			matchedtxt.firstChild.nodeValue = ""
		}
	}
	function search(term) {
		if (term) currentSearchTerm = term;

		var re = new RegExp(currentSearchTerm, ignorecase ? 'i' : '');
		var el = document.getElementById("frames").children;
		var matches = new Object();
		var maxwidth = 0;
		for (var i = 0; i < el.length; i++) {
			var e = el[i];
			var func = g_to_func(e);
			var rect = find_child(e, "rect");
			if (func == null || rect == null)
				continue;

			// Save max width. Only works as we have a root frame
			var w = parseFloat(rect.attributes.width.value);
			if (w > maxwidth)
				maxwidth = w;

			if (func.match(re)) {
				// highlight
				var x = parseFloat(rect.attributes.x.value);
				orig_save(rect, "fill");
				rect.attributes.fill.value = "rgb(230,0,230)";

				// remember matches
				if (matches[x] == undefined) {
					matches[x] = w;
				} else {
					if (w > matches[x]) {
						// overwrite with parent
						matches[x] = w;
					}
				}
				searching = 1;
			}
		}
		if (!searching)
			return;
		var params = get_params();
		params.s = currentSearchTerm;
		history.replaceState(null, null, parse_params(params));

		searchbtn.classList.add("show");
		searchbtn.firstChild.nodeValue = "Reset Search";

		// calculate percent matched, excluding vertical overlap
		var count = 0;
		var lastx = -1;
		var lastw = 0;
		var keys = Array();
		for (k in matches) {
			if (matches.hasOwnProperty(k))
				keys.push(k);
		}
		// sort the matched frames by their x location
		// ascending, then width descending
		keys.sort(function(a, b){
			return a - b;
		});
		// Step through frames saving only the biggest bottom-up frames
		// thanks to the sort order. This relies on the tree property
		// where children are always smaller than their parents.
		var fudge = 0.0001;	// JavaScript floating point
		for (var k in keys) {
			var x = parseFloat(keys[k]);
			var w = matches[keys[k]];
			if (x >= lastx + lastw - fudge) {
				count += w;
				lastx = x;
				lastw = w;
			}
		}
		// display matched percent
		matchedtxt.classList.remove("hide");
		var pct = 100 * count / maxwidth;
		if (pct != 100) pct = pct.toFixed(1)
		matchedtxt.firstChild.nodeValue = "Matched: " + pct + "%";
	}
]]>
</script>
<rect x="0.0" y="0" width="1200.0" height="854.0" fill="url(#background)"  />
<text id="title" x="600.00" y="24" >Flame Graph</text>
<text id="details" x="10.00" y="837" > </text>
<text id="unzoom" x="10.00" y="24" class="hide">Reset Zoom</text>
<text id="search" x="1090.00" y="24" >Search</text>
<text id="ignorecase" x="1174.00" y="24" >ic</text>
<text id="matched" x="1090.00" y="837" > </text>
<g id="frames">
<g >
<title>__rmqueue (322,728,072 samples, 0.04%)</title><rect x="126.4" y="101" width="0.5" height="15.0" fill="rgb(249,203,48)" rx="2" ry="2" />
<text  x="129.40" y="111.5" ></text>
</g>
<g >
<title>enqueue_entity (295,591,692 samples, 0.04%)</title><rect x="1180.2" y="661" width="0.4" height="15.0" fill="rgb(218,62,15)" rx="2" ry="2" />
<text  x="1183.22" y="671.5" ></text>
</g>
<g >
<title>cpu_startup_entry (18,163,022,403 samples, 2.20%)</title><rect x="1162.7" y="741" width="26.0" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text  x="1165.66" y="751.5" >c..</text>
</g>
<g >
<title>scheduler_tick (118,516,889 samples, 0.01%)</title><rect x="709.3" y="389" width="0.1" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text  x="712.27" y="399.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32_impl (773,495,997 samples, 0.09%)</title><rect x="319.2" y="405" width="1.1" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="322.24" y="415.5" ></text>
</g>
<g >
<title>tas (614,778,162 samples, 0.07%)</title><rect x="75.0" y="453" width="0.9" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="78.03" y="463.5" ></text>
</g>
<g >
<title>pgstat_count_io_op (72,274,524 samples, 0.01%)</title><rect x="621.1" y="421" width="0.2" height="15.0" fill="rgb(207,10,2)" rx="2" ry="2" />
<text  x="624.15" y="431.5" ></text>
</g>
<g >
<title>heap_tuple_should_freeze (95,669,608 samples, 0.01%)</title><rect x="54.1" y="341" width="0.2" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="57.11" y="351.5" ></text>
</g>
<g >
<title>__pte_alloc (1,508,822,768 samples, 0.18%)</title><rect x="386.4" y="245" width="2.1" height="15.0" fill="rgb(218,62,15)" rx="2" ry="2" />
<text  x="389.38" y="255.5" ></text>
</g>
<g >
<title>list_del (631,365,829 samples, 0.08%)</title><rect x="573.4" y="101" width="0.9" height="15.0" fill="rgb(235,140,33)" rx="2" ry="2" />
<text  x="576.42" y="111.5" ></text>
</g>
<g >
<title>dequeue_entity (423,622,158 samples, 0.05%)</title><rect x="593.1" y="213" width="0.6" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text  x="596.10" y="223.5" ></text>
</g>
<g >
<title>heap_prune_record_dead_or_unused (2,386,563,437 samples, 0.29%)</title><rect x="935.6" y="501" width="3.4" height="15.0" fill="rgb(226,96,23)" rx="2" ry="2" />
<text  x="938.56" y="511.5" ></text>
</g>
<g >
<title>lazy_scan_heap (22,604,331,570 samples, 2.74%)</title><rect x="11.0" y="693" width="32.3" height="15.0" fill="rgb(248,198,47)" rx="2" ry="2" />
<text  x="13.98" y="703.5" >la..</text>
</g>
<g >
<title>ReleaseBuffer (155,476,154 samples, 0.02%)</title><rect x="133.0" y="517" width="0.2" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text  x="136.00" y="527.5" ></text>
</g>
<g >
<title>LWLockAcquire (101,433,190 samples, 0.01%)</title><rect x="232.3" y="485" width="0.2" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text  x="235.33" y="495.5" ></text>
</g>
<g >
<title>int_sqrt (184,053,246 samples, 0.02%)</title><rect x="1177.9" y="677" width="0.2" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="1180.89" y="687.5" ></text>
</g>
<g >
<title>x86_64_start_kernel (379,723,816 samples, 0.05%)</title><rect x="1189.5" y="757" width="0.5" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="1192.46" y="767.5" ></text>
</g>
<g >
<title>pte_alloc_one (1,494,467,745 samples, 0.18%)</title><rect x="386.4" y="229" width="2.1" height="15.0" fill="rgb(252,217,51)" rx="2" ry="2" />
<text  x="389.40" y="239.5" ></text>
</g>
<g >
<title>wb_writeback (115,471,936 samples, 0.01%)</title><rect x="10.1" y="693" width="0.2" height="15.0" fill="rgb(222,80,19)" rx="2" ry="2" />
<text  x="13.10" y="703.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (123,131,901 samples, 0.01%)</title><rect x="257.7" y="421" width="0.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="260.69" y="431.5" ></text>
</g>
<g >
<title>__radix_tree_insert (4,428,321,008 samples, 0.54%)</title><rect x="456.2" y="165" width="6.4" height="15.0" fill="rgb(235,140,33)" rx="2" ry="2" />
<text  x="459.24" y="175.5" ></text>
</g>
<g >
<title>do_softirq (535,134,408 samples, 0.06%)</title><rect x="1164.6" y="645" width="0.7" height="15.0" fill="rgb(223,87,20)" rx="2" ry="2" />
<text  x="1167.57" y="655.5" ></text>
</g>
<g >
<title>do_generic_file_read.constprop.52 (169,978,536,003 samples, 20.62%)</title><rect x="343.5" y="325" width="243.4" height="15.0" fill="rgb(205,4,1)" rx="2" ry="2" />
<text  x="346.51" y="335.5" >do_generic_file_read.constprop.52</text>
</g>
<g >
<title>BufferDescriptorGetContentLock (145,479,215 samples, 0.02%)</title><rect x="1089.3" y="469" width="0.2" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="1092.31" y="479.5" ></text>
</g>
<g >
<title>ItemPointerSet (2,156,126,831 samples, 0.26%)</title><rect x="627.1" y="533" width="3.1" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="630.10" y="543.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (2,515,838,353 samples, 0.31%)</title><rect x="705.6" y="517" width="3.6" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="708.59" y="527.5" ></text>
</g>
<g >
<title>ss_report_location (266,921,260 samples, 0.03%)</title><rect x="607.5" y="517" width="0.4" height="15.0" fill="rgb(249,202,48)" rx="2" ry="2" />
<text  x="610.51" y="527.5" ></text>
</g>
<g >
<title>xfs_file_aio_write (701,647,072 samples, 0.09%)</title><rect x="50.5" y="149" width="1.0" height="15.0" fill="rgb(251,211,50)" rx="2" ry="2" />
<text  x="53.47" y="159.5" ></text>
</g>
<g >
<title>LockBufHdr (166,567,596 samples, 0.02%)</title><rect x="1118.6" y="453" width="0.2" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="1121.57" y="463.5" ></text>
</g>
<g >
<title>WaitReadBuffers (40,272,838,991 samples, 4.89%)</title><rect x="74.7" y="501" width="57.6" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text  x="77.68" y="511.5" >WaitRe..</text>
</g>
<g >
<title>__do_fault.isra.61 (79,157,688 samples, 0.01%)</title><rect x="75.5" y="357" width="0.1" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="78.50" y="367.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (105,887,461 samples, 0.01%)</title><rect x="203.0" y="389" width="0.2" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="206.01" y="399.5" ></text>
</g>
<g >
<title>security_file_permission (1,027,140,287 samples, 0.12%)</title><rect x="597.9" y="373" width="1.5" height="15.0" fill="rgb(225,96,23)" rx="2" ry="2" />
<text  x="600.88" y="383.5" ></text>
</g>
<g >
<title>system_call_fastpath (707,254,508 samples, 0.09%)</title><rect x="50.5" y="213" width="1.0" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="53.47" y="223.5" ></text>
</g>
<g >
<title>shmem_add_to_page_cache.isra.26 (81,789,067,321 samples, 9.92%)</title><rect x="452.5" y="181" width="117.1" height="15.0" fill="rgb(250,207,49)" rx="2" ry="2" />
<text  x="455.52" y="191.5" >shmem_add_to_p..</text>
</g>
<g >
<title>apic_timer_interrupt (137,726,205 samples, 0.02%)</title><rect x="853.8" y="469" width="0.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="856.81" y="479.5" ></text>
</g>
<g >
<title>compactify_tuples (105,072,182 samples, 0.01%)</title><rect x="854.3" y="501" width="0.2" height="15.0" fill="rgb(209,21,5)" rx="2" ry="2" />
<text  x="857.34" y="511.5" ></text>
</g>
<g >
<title>schedule (596,433,624 samples, 0.07%)</title><rect x="16.3" y="277" width="0.9" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text  x="19.34" y="287.5" ></text>
</g>
<g >
<title>lapic_next_deadline (194,545,753 samples, 0.02%)</title><rect x="1185.4" y="597" width="0.3" height="15.0" fill="rgb(222,82,19)" rx="2" ry="2" />
<text  x="1188.38" y="607.5" ></text>
</g>
<g >
<title>_raw_qspin_lock_irq (113,817,551 samples, 0.01%)</title><rect x="130.2" y="261" width="0.2" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="133.19" y="271.5" ></text>
</g>
<g >
<title>PinBufferForBlock (162,226,846 samples, 0.02%)</title><rect x="22.8" y="533" width="0.2" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text  x="25.79" y="543.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (2,334,526,026 samples, 0.28%)</title><rect x="1163.7" y="693" width="3.4" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="1166.73" y="703.5" ></text>
</g>
<g >
<title>xfs_vm_writepages (115,471,936 samples, 0.01%)</title><rect x="10.1" y="613" width="0.2" height="15.0" fill="rgb(246,191,45)" rx="2" ry="2" />
<text  x="13.10" y="623.5" ></text>
</g>
<g >
<title>LWLockWakeup (72,140,377 samples, 0.01%)</title><rect x="321.5" y="453" width="0.1" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text  x="324.53" y="463.5" ></text>
</g>
<g >
<title>TransactionIdFollows (1,453,985,001 samples, 0.18%)</title><rect x="930.7" y="501" width="2.1" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="933.74" y="511.5" ></text>
</g>
<g >
<title>PageGetMaxOffsetNumber (135,737,711 samples, 0.02%)</title><rect x="864.6" y="485" width="0.2" height="15.0" fill="rgb(234,133,32)" rx="2" ry="2" />
<text  x="867.59" y="495.5" ></text>
</g>
<g >
<title>shmem_recalc_inode (1,343,862,804 samples, 0.16%)</title><rect x="576.0" y="181" width="1.9" height="15.0" fill="rgb(214,42,10)" rx="2" ry="2" />
<text  x="578.99" y="191.5" ></text>
</g>
<g >
<title>InitializeOneGUCOption (71,976,486 samples, 0.01%)</title><rect x="55.0" y="597" width="0.1" height="15.0" fill="rgb(229,111,26)" rx="2" ry="2" />
<text  x="58.01" y="607.5" ></text>
</g>
<g >
<title>LWLockRelease (967,891,315 samples, 0.12%)</title><rect x="1141.4" y="501" width="1.4" height="15.0" fill="rgb(217,58,13)" rx="2" ry="2" />
<text  x="1144.44" y="511.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (104,243,313 samples, 0.01%)</title><rect x="23.5" y="469" width="0.2" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="26.54" y="479.5" ></text>
</g>
<g >
<title>LWLockWakeup (528,609,363 samples, 0.06%)</title><rect x="1141.8" y="485" width="0.7" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text  x="1144.77" y="495.5" ></text>
</g>
<g >
<title>ExecVacuum (3,088,212,841 samples, 0.37%)</title><rect x="50.2" y="549" width="4.4" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="53.20" y="559.5" ></text>
</g>
<g >
<title>ConditionalLockBufferForCleanup (226,808,609 samples, 0.03%)</title><rect x="55.4" y="533" width="0.3" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text  x="58.41" y="543.5" ></text>
</g>
<g >
<title>copy_user_enhanced_fast_string (575,488,434 samples, 0.07%)</title><rect x="21.4" y="405" width="0.8" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text  x="24.41" y="415.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (1,168,021,112 samples, 0.14%)</title><rect x="144.3" y="485" width="1.6" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="147.26" y="495.5" ></text>
</g>
<g >
<title>call_softirq (203,950,887 samples, 0.02%)</title><rect x="1189.6" y="565" width="0.3" height="15.0" fill="rgb(225,94,22)" rx="2" ry="2" />
<text  x="1192.63" y="575.5" ></text>
</g>
<g >
<title>do_softirq (203,950,887 samples, 0.02%)</title><rect x="1189.6" y="581" width="0.3" height="15.0" fill="rgb(223,87,20)" rx="2" ry="2" />
<text  x="1192.63" y="591.5" ></text>
</g>
<g >
<title>hrtimer_start_range_ns (801,244,050 samples, 0.10%)</title><rect x="1185.0" y="661" width="1.2" height="15.0" fill="rgb(244,179,42)" rx="2" ry="2" />
<text  x="1188.03" y="671.5" ></text>
</g>
<g >
<title>do_nanosleep (82,563,287 samples, 0.01%)</title><rect x="71.8" y="309" width="0.2" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="74.85" y="319.5" ></text>
</g>
<g >
<title>_raw_qspin_lock_irq (417,446,285 samples, 0.05%)</title><rect x="591.7" y="277" width="0.6" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="594.69" y="287.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (73,671,132 samples, 0.01%)</title><rect x="618.4" y="389" width="0.1" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="621.39" y="399.5" ></text>
</g>
<g >
<title>radix_tree_lookup_slot (486,036,641 samples, 0.06%)</title><rect x="256.3" y="261" width="0.7" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="259.29" y="271.5" ></text>
</g>
<g >
<title>FileReadV (186,288,969,029 samples, 22.60%)</title><rect x="333.8" y="469" width="266.7" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text  x="336.81" y="479.5" >FileReadV</text>
</g>
<g >
<title>__audit_syscall_entry (70,480,025 samples, 0.01%)</title><rect x="309.9" y="357" width="0.1" height="15.0" fill="rgb(243,176,42)" rx="2" ry="2" />
<text  x="312.87" y="367.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (823,765,745 samples, 0.10%)</title><rect x="203.9" y="469" width="1.1" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="206.86" y="479.5" ></text>
</g>
<g >
<title>__libc_start_main (128,045,675,684 samples, 15.54%)</title><rect x="50.2" y="773" width="183.3" height="15.0" fill="rgb(236,142,34)" rx="2" ry="2" />
<text  x="53.19" y="783.5" >__libc_start_main</text>
</g>
<g >
<title>hash_search_with_hash_value (318,195,656 samples, 0.04%)</title><rect x="615.9" y="373" width="0.5" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="618.92" y="383.5" ></text>
</g>
<g >
<title>find_next_bit (73,626,817 samples, 0.01%)</title><rect x="311.2" y="213" width="0.1" height="15.0" fill="rgb(244,179,43)" rx="2" ry="2" />
<text  x="314.21" y="223.5" ></text>
</g>
<g >
<title>futex_wake (280,422,222 samples, 0.03%)</title><rect x="1142.1" y="405" width="0.4" height="15.0" fill="rgb(219,65,15)" rx="2" ry="2" />
<text  x="1145.10" y="415.5" ></text>
</g>
<g >
<title>ConditionVariableBroadcast (2,698,740,417 samples, 0.33%)</title><rect x="325.6" y="485" width="3.9" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="328.62" y="495.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (369,814,789 samples, 0.04%)</title><rect x="784.3" y="501" width="0.5" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="787.30" y="511.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32_impl (135,558,230 samples, 0.02%)</title><rect x="132.6" y="437" width="0.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="135.57" y="447.5" ></text>
</g>
<g >
<title>update_process_times (105,887,461 samples, 0.01%)</title><rect x="203.0" y="341" width="0.2" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="206.01" y="351.5" ></text>
</g>
<g >
<title>StartBufferIO (492,043,252 samples, 0.06%)</title><rect x="330.9" y="485" width="0.7" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text  x="333.90" y="495.5" ></text>
</g>
<g >
<title>xfs_vn_update_time (100,541,237 samples, 0.01%)</title><rect x="50.9" y="69" width="0.1" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="53.88" y="79.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (159,855,016 samples, 0.02%)</title><rect x="349.7" y="309" width="0.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="352.69" y="319.5" ></text>
</g>
<g >
<title>heap_page_prune_and_freeze (1,488,034,874 samples, 0.18%)</title><rect x="52.4" y="405" width="2.2" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="55.43" y="415.5" ></text>
</g>
<g >
<title>xfs_trans_commit (112,688,247 samples, 0.01%)</title><rect x="586.7" y="261" width="0.1" height="15.0" fill="rgb(250,210,50)" rx="2" ry="2" />
<text  x="589.66" y="271.5" ></text>
</g>
<g >
<title>maybe_start_bgworkers (637,265,816,681 samples, 77.32%)</title><rect x="233.6" y="709" width="912.4" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text  x="236.63" y="719.5" >maybe_start_bgworkers</text>
</g>
<g >
<title>GetPrivateRefCountEntry (89,716,643 samples, 0.01%)</title><rect x="1138.2" y="485" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="1141.19" y="495.5" ></text>
</g>
<g >
<title>sys_pread64 (37,575,822,762 samples, 4.56%)</title><rect x="78.4" y="405" width="53.8" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text  x="81.38" y="415.5" >sys_p..</text>
</g>
<g >
<title>visibilitymap_pin (797,848,131 samples, 0.10%)</title><rect x="1144.3" y="549" width="1.1" height="15.0" fill="rgb(253,221,53)" rx="2" ry="2" />
<text  x="1147.26" y="559.5" ></text>
</g>
<g >
<title>heap_prepare_freeze_tuple (1,685,453,234 samples, 0.20%)</title><rect x="37.4" y="581" width="2.4" height="15.0" fill="rgb(227,101,24)" rx="2" ry="2" />
<text  x="40.38" y="591.5" ></text>
</g>
<g >
<title>table_block_parallelscan_nextpage (1,394,665,687 samples, 0.17%)</title><rect x="607.9" y="517" width="2.0" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text  x="610.89" y="527.5" ></text>
</g>
<g >
<title>BufferGetPage (291,133,286 samples, 0.04%)</title><rect x="1089.5" y="469" width="0.4" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="1092.52" y="479.5" ></text>
</g>
<g >
<title>PinBuffer (836,501,856 samples, 0.10%)</title><rect x="619.1" y="389" width="1.2" height="15.0" fill="rgb(219,64,15)" rx="2" ry="2" />
<text  x="622.08" y="399.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (131,841,886 samples, 0.02%)</title><rect x="682.4" y="501" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="685.35" y="511.5" ></text>
</g>
<g >
<title>sysret_check (295,261,045 samples, 0.04%)</title><rect x="77.8" y="421" width="0.5" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="80.83" y="431.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (147,600,825 samples, 0.02%)</title><rect x="38.5" y="565" width="0.2" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="41.49" y="575.5" ></text>
</g>
<g >
<title>system_call_fastpath (10,517,164,883 samples, 1.28%)</title><rect x="1146.0" y="773" width="15.0" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="1148.99" y="783.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (70,484,646 samples, 0.01%)</title><rect x="1058.9" y="405" width="0.1" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="1061.89" y="415.5" ></text>
</g>
<g >
<title>get_user_pages_fast (128,367,624 samples, 0.02%)</title><rect x="49.0" y="645" width="0.2" height="15.0" fill="rgb(229,111,26)" rx="2" ry="2" />
<text  x="52.02" y="655.5" ></text>
</g>
<g >
<title>kthread (115,471,936 samples, 0.01%)</title><rect x="10.1" y="757" width="0.2" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text  x="13.10" y="767.5" ></text>
</g>
<g >
<title>heap_tuple_should_freeze (254,280,701 samples, 0.03%)</title><rect x="217.9" y="469" width="0.4" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="220.90" y="479.5" ></text>
</g>
<g >
<title>BlockIdSet (2,378,065,078 samples, 0.29%)</title><rect x="766.0" y="501" width="3.4" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="769.01" y="511.5" ></text>
</g>
<g >
<title>mem_cgroup_charge_common (1,009,313,280 samples, 0.12%)</title><rect x="97.1" y="149" width="1.5" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="100.12" y="159.5" ></text>
</g>
<g >
<title>xfs_iunlock (344,246,781 samples, 0.04%)</title><rect x="131.3" y="325" width="0.5" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text  x="134.33" y="335.5" ></text>
</g>
<g >
<title>pgstat_tracks_io_op (234,846,491 samples, 0.03%)</title><rect x="332.4" y="469" width="0.3" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="335.39" y="479.5" ></text>
</g>
<g >
<title>mdreadv (157,402,898 samples, 0.02%)</title><rect x="51.7" y="357" width="0.2" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text  x="54.67" y="367.5" ></text>
</g>
<g >
<title>touch_atime (547,498,164 samples, 0.07%)</title><rect x="586.1" y="309" width="0.8" height="15.0" fill="rgb(205,2,0)" rx="2" ry="2" />
<text  x="589.08" y="319.5" ></text>
</g>
<g >
<title>iomap_apply (270,636,716 samples, 0.03%)</title><rect x="10.4" y="533" width="0.4" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="13.44" y="543.5" ></text>
</g>
<g >
<title>xfs_log_commit_cil (355,451,941 samples, 0.04%)</title><rect x="15.1" y="245" width="0.5" height="15.0" fill="rgb(207,11,2)" rx="2" ry="2" />
<text  x="18.11" y="255.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (86,182,567 samples, 0.01%)</title><rect x="96.9" y="117" width="0.1" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="99.88" y="127.5" ></text>
</g>
<g >
<title>xfs_trans_alloc (102,176,742 samples, 0.01%)</title><rect x="14.9" y="277" width="0.2" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text  x="17.91" y="287.5" ></text>
</g>
<g >
<title>__inc_zone_page_state (243,537,605 samples, 0.03%)</title><rect x="455.8" y="165" width="0.3" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text  x="458.78" y="175.5" ></text>
</g>
<g >
<title>mark_page_accessed (2,072,833,213 samples, 0.25%)</title><rect x="1151.6" y="629" width="2.9" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text  x="1154.57" y="639.5" ></text>
</g>
<g >
<title>__find_get_page (3,924,892,558 samples, 0.48%)</title><rect x="344.0" y="309" width="5.6" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="347.03" y="319.5" ></text>
</g>
<g >
<title>BufferIsValid (980,288,190 samples, 0.12%)</title><rect x="1107.9" y="421" width="1.4" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1110.92" y="431.5" ></text>
</g>
<g >
<title>do_group_exit (10,517,164,883 samples, 1.28%)</title><rect x="1146.0" y="741" width="15.0" height="15.0" fill="rgb(219,67,16)" rx="2" ry="2" />
<text  x="1148.99" y="751.5" ></text>
</g>
<g >
<title>PinBufferForBlock (6,500,970,346 samples, 0.79%)</title><rect x="11.2" y="581" width="9.3" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text  x="14.17" y="591.5" ></text>
</g>
<g >
<title>PageGetItemId (163,735,936 samples, 0.02%)</title><rect x="32.5" y="581" width="0.3" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="35.53" y="591.5" ></text>
</g>
<g >
<title>vm_readbuf (95,273,516 samples, 0.01%)</title><rect x="621.6" y="517" width="0.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="624.56" y="527.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (206,275,440 samples, 0.03%)</title><rect x="709.2" y="517" width="0.3" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="712.19" y="527.5" ></text>
</g>
<g >
<title>do_page_fault (143,314,527,755 samples, 17.39%)</title><rect x="380.7" y="293" width="205.1" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="383.67" y="303.5" >do_page_fault</text>
</g>
<g >
<title>__audit_syscall_exit (735,968,673 samples, 0.09%)</title><rect x="335.9" y="421" width="1.1" height="15.0" fill="rgb(218,62,14)" rx="2" ry="2" />
<text  x="338.90" y="431.5" ></text>
</g>
<g >
<title>ReleaseBuffer (424,736,470 samples, 0.05%)</title><rect x="606.9" y="517" width="0.6" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text  x="609.89" y="527.5" ></text>
</g>
<g >
<title>__find_lock_page (89,010,712 samples, 0.01%)</title><rect x="13.9" y="261" width="0.1" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="16.87" y="271.5" ></text>
</g>
<g >
<title>select_task_rq_fair (170,990,994 samples, 0.02%)</title><rect x="1165.9" y="565" width="0.2" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text  x="1168.86" y="575.5" ></text>
</g>
<g >
<title>BufferIsValid (1,080,172,541 samples, 0.13%)</title><rect x="1109.9" y="453" width="1.5" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1112.86" y="463.5" ></text>
</g>
<g >
<title>__radix_tree_lookup (636,857,395 samples, 0.08%)</title><rect x="245.3" y="229" width="0.9" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="248.32" y="239.5" ></text>
</g>
<g >
<title>file_update_time (1,997,989,982 samples, 0.24%)</title><rect x="579.3" y="229" width="2.9" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text  x="582.35" y="239.5" ></text>
</g>
<g >
<title>visibilitymap_get_status (285,548,717 samples, 0.03%)</title><rect x="23.3" y="613" width="0.4" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text  x="26.34" y="623.5" ></text>
</g>
<g >
<title>call_rwsem_down_write_failed (843,114,110 samples, 0.10%)</title><rect x="16.0" y="309" width="1.2" height="15.0" fill="rgb(205,0,0)" rx="2" ry="2" />
<text  x="18.99" y="319.5" ></text>
</g>
<g >
<title>LWLockConditionalAcquire (408,957,962 samples, 0.05%)</title><rect x="235.2" y="517" width="0.6" height="15.0" fill="rgb(245,185,44)" rx="2" ry="2" />
<text  x="238.24" y="527.5" ></text>
</g>
<g >
<title>BufferIsValid (147,989,814 samples, 0.02%)</title><rect x="769.8" y="501" width="0.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="772.76" y="511.5" ></text>
</g>
<g >
<title>PageGetItem (283,395,127 samples, 0.03%)</title><rect x="29.1" y="613" width="0.4" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="32.12" y="623.5" ></text>
</g>
<g >
<title>perform_spin_delay (100,806,989 samples, 0.01%)</title><rect x="247.7" y="389" width="0.2" height="15.0" fill="rgb(247,196,46)" rx="2" ry="2" />
<text  x="250.71" y="399.5" ></text>
</g>
<g >
<title>pg_preadv (188,855,651 samples, 0.02%)</title><rect x="600.0" y="453" width="0.3" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="602.98" y="463.5" ></text>
</g>
<g >
<title>queued_spin_lock_slowpath (200,576,965 samples, 0.02%)</title><rect x="1159.0" y="533" width="0.3" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="1162.04" y="543.5" ></text>
</g>
<g >
<title>rw_verify_area (194,418,688 samples, 0.02%)</title><rect x="131.9" y="373" width="0.2" height="15.0" fill="rgb(218,64,15)" rx="2" ry="2" />
<text  x="134.86" y="383.5" ></text>
</g>
<g >
<title>__radix_tree_lookup (122,104,010 samples, 0.01%)</title><rect x="64.1" y="229" width="0.2" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="67.12" y="239.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (127,019,827 samples, 0.02%)</title><rect x="44.0" y="725" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="46.96" y="735.5" ></text>
</g>
<g >
<title>StartReadBuffer (6,511,909,804 samples, 0.79%)</title><rect x="11.2" y="613" width="9.3" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text  x="14.15" y="623.5" ></text>
</g>
<g >
<title>FlushBuffer (5,304,482,542 samples, 0.64%)</title><rect x="11.9" y="533" width="7.6" height="15.0" fill="rgb(254,226,54)" rx="2" ry="2" />
<text  x="14.93" y="543.5" ></text>
</g>
<g >
<title>tick_sched_timer (75,893,920 samples, 0.01%)</title><rect x="173.3" y="373" width="0.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="176.34" y="383.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (263,118,266 samples, 0.03%)</title><rect x="933.4" y="501" width="0.4" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="936.44" y="511.5" ></text>
</g>
<g >
<title>HeapTupleHeaderAdvanceConflictHorizon (6,228,913,769 samples, 0.76%)</title><rect x="907.3" y="501" width="8.9" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text  x="910.26" y="511.5" ></text>
</g>
<g >
<title>BufferGetBlockNumber (856,293,468 samples, 0.10%)</title><rect x="666.3" y="517" width="1.2" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="669.32" y="527.5" ></text>
</g>
<g >
<title>hrtimer_cancel (115,533,294 samples, 0.01%)</title><rect x="1187.4" y="693" width="0.2" height="15.0" fill="rgb(254,228,54)" rx="2" ry="2" />
<text  x="1190.43" y="703.5" ></text>
</g>
<g >
<title>compactify_tuples (3,494,990,954 samples, 0.42%)</title><rect x="174.1" y="469" width="5.0" height="15.0" fill="rgb(209,21,5)" rx="2" ry="2" />
<text  x="177.11" y="479.5" ></text>
</g>
<g >
<title>fsm_readbuf (279,923,013 samples, 0.03%)</title><rect x="22.6" y="613" width="0.4" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="25.62" y="623.5" ></text>
</g>
<g >
<title>radix_tree_descend (1,276,223,732 samples, 0.15%)</title><rect x="458.6" y="133" width="1.8" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="461.62" y="143.5" ></text>
</g>
<g >
<title>heap_vacuum_rel (22,616,657,143 samples, 2.74%)</title><rect x="11.0" y="709" width="32.4" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text  x="13.98" y="719.5" >he..</text>
</g>
<g >
<title>vfs_write (313,556,868 samples, 0.04%)</title><rect x="10.4" y="613" width="0.5" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="13.42" y="623.5" ></text>
</g>
<g >
<title>shmem_fault (595,808,066 samples, 0.07%)</title><rect x="327.8" y="357" width="0.8" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="330.77" y="367.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (78,231,985 samples, 0.01%)</title><rect x="28.6" y="613" width="0.2" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="31.64" y="623.5" ></text>
</g>
<g >
<title>down_write (82,091,217 samples, 0.01%)</title><rect x="586.5" y="245" width="0.1" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="589.51" y="255.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (916,817,104 samples, 0.11%)</title><rect x="216.1" y="437" width="1.3" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="219.08" y="447.5" ></text>
</g>
<g >
<title>__do_page_fault (2,320,058,077 samples, 0.28%)</title><rect x="243.3" y="373" width="3.4" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="246.35" y="383.5" ></text>
</g>
<g >
<title>TidStoreMemoryUsage (2,275,949,099 samples, 0.28%)</title><rect x="601.4" y="549" width="3.2" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text  x="604.37" y="559.5" ></text>
</g>
<g >
<title>InvalidateVictimBuffer (344,974,810 samples, 0.04%)</title><rect x="19.5" y="533" width="0.5" height="15.0" fill="rgb(248,198,47)" rx="2" ry="2" />
<text  x="22.53" y="543.5" ></text>
</g>
<g >
<title>ItemPointerIsValid (1,580,431,853 samples, 0.19%)</title><rect x="1125.9" y="501" width="2.3" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="1128.95" y="511.5" ></text>
</g>
<g >
<title>[unknown] (858,525,970 samples, 0.10%)</title><rect x="46.9" y="757" width="1.3" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text  x="49.93" y="767.5" ></text>
</g>
<g >
<title>__do_page_fault (30,497,477,926 samples, 3.70%)</title><rect x="85.4" y="261" width="43.7" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="88.42" y="271.5" >__do..</text>
</g>
<g >
<title>ReadBufferExtended (1,340,860,850 samples, 0.16%)</title><rect x="133.8" y="469" width="1.9" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text  x="136.78" y="479.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (185,930,761 samples, 0.02%)</title><rect x="1138.1" y="501" width="0.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="1141.06" y="511.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (77,288,274 samples, 0.01%)</title><rect x="621.3" y="453" width="0.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="624.34" y="463.5" ></text>
</g>
<g >
<title>__perf_event_task_sched_in (110,725,430 samples, 0.01%)</title><rect x="1181.9" y="677" width="0.2" height="15.0" fill="rgb(231,121,29)" rx="2" ry="2" />
<text  x="1184.90" y="687.5" ></text>
</g>
<g >
<title>process_one_work (115,471,936 samples, 0.01%)</title><rect x="10.1" y="725" width="0.2" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text  x="13.10" y="735.5" ></text>
</g>
<g >
<title>heap_tuple_should_freeze (761,463,094 samples, 0.09%)</title><rect x="38.7" y="565" width="1.1" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="41.70" y="575.5" ></text>
</g>
<g >
<title>__radix_tree_lookup (196,157,364 samples, 0.02%)</title><rect x="328.2" y="277" width="0.3" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="331.22" y="287.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (137,726,205 samples, 0.02%)</title><rect x="853.8" y="453" width="0.2" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="856.81" y="463.5" ></text>
</g>
<g >
<title>parallel_vacuum_process_table (637,236,586,138 samples, 77.31%)</title><rect x="233.7" y="613" width="912.3" height="15.0" fill="rgb(205,3,0)" rx="2" ry="2" />
<text  x="236.67" y="623.5" >parallel_vacuum_process_table</text>
</g>
<g >
<title>BufTableLookup (205,061,762 samples, 0.02%)</title><rect x="11.6" y="549" width="0.3" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text  x="14.60" y="559.5" ></text>
</g>
<g >
<title>BufferIsValid (149,567,202 samples, 0.02%)</title><rect x="759.8" y="517" width="0.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="762.81" y="527.5" ></text>
</g>
<g >
<title>system_call_after_swapgs (297,624,001 samples, 0.04%)</title><rect x="43.5" y="661" width="0.4" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text  x="46.50" y="671.5" ></text>
</g>
<g >
<title>update_process_times (132,283,144 samples, 0.02%)</title><rect x="709.3" y="405" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="712.26" y="415.5" ></text>
</g>
<g >
<title>hash_initial_lookup (779,209,400 samples, 0.09%)</title><rect x="253.8" y="421" width="1.1" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="256.83" y="431.5" ></text>
</g>
<g >
<title>fsnotify (273,688,962 samples, 0.03%)</title><rect x="598.2" y="357" width="0.4" height="15.0" fill="rgb(215,50,12)" rx="2" ry="2" />
<text  x="601.17" y="367.5" ></text>
</g>
<g >
<title>list_del (263,101,215 samples, 0.03%)</title><rect x="126.5" y="85" width="0.4" height="15.0" fill="rgb(235,140,33)" rx="2" ry="2" />
<text  x="129.49" y="95.5" ></text>
</g>
<g >
<title>TransactionIdFollows (483,413,334 samples, 0.06%)</title><rect x="203.2" y="469" width="0.7" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="206.17" y="479.5" ></text>
</g>
<g >
<title>radix_tree_descend (120,946,008 samples, 0.01%)</title><rect x="246.2" y="229" width="0.2" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="249.24" y="239.5" ></text>
</g>
<g >
<title>update_process_times (95,227,023 samples, 0.01%)</title><rect x="393.6" y="85" width="0.2" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="396.64" y="95.5" ></text>
</g>
<g >
<title>radix_tree_lookup_slot (242,208,178 samples, 0.03%)</title><rect x="328.2" y="293" width="0.4" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="331.21" y="303.5" ></text>
</g>
<g >
<title>__mem_cgroup_try_charge (429,001,947 samples, 0.05%)</title><rect x="451.6" y="149" width="0.6" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="454.59" y="159.5" ></text>
</g>
<g >
<title>CheckBufferIsPinnedOnce (177,920,144 samples, 0.02%)</title><rect x="257.6" y="437" width="0.3" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text  x="260.63" y="447.5" ></text>
</g>
<g >
<title>WaitReadBuffers (183,808,446 samples, 0.02%)</title><rect x="51.6" y="389" width="0.3" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text  x="54.63" y="399.5" ></text>
</g>
<g >
<title>generic_file_aio_read (170,394,819,281 samples, 20.67%)</title><rect x="343.1" y="341" width="243.9" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text  x="346.07" y="351.5" >generic_file_aio_read</text>
</g>
<g >
<title>smp_apic_timer_interrupt (131,841,886 samples, 0.02%)</title><rect x="682.4" y="485" width="0.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="685.35" y="495.5" ></text>
</g>
<g >
<title>[unknown] (4,331,051,912 samples, 0.53%)</title><rect x="43.9" y="773" width="6.2" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text  x="46.93" y="783.5" ></text>
</g>
<g >
<title>ItemPointerSet (87,248,740 samples, 0.01%)</title><rect x="23.9" y="629" width="0.1" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="26.86" y="639.5" ></text>
</g>
<g >
<title>BufferIsValid (119,329,650 samples, 0.01%)</title><rect x="227.3" y="437" width="0.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="230.28" y="447.5" ></text>
</g>
<g >
<title>radix_tree_descend (367,628,238 samples, 0.04%)</title><rect x="348.8" y="261" width="0.6" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="351.83" y="271.5" ></text>
</g>
<g >
<title>tas (88,089,515 samples, 0.01%)</title><rect x="73.3" y="405" width="0.1" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="76.32" y="415.5" ></text>
</g>
<g >
<title>StartReadBuffer (398,349,091 samples, 0.05%)</title><rect x="43.4" y="773" width="0.5" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text  x="46.36" y="783.5" ></text>
</g>
<g >
<title>idle_exit_fair (78,500,553 samples, 0.01%)</title><rect x="1182.5" y="661" width="0.1" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1185.46" y="671.5" ></text>
</g>
<g >
<title>radix_tree_descend (162,364,900 samples, 0.02%)</title><rect x="60.4" y="213" width="0.2" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="63.36" y="223.5" ></text>
</g>
<g >
<title>heap_parallel_vacuum_scan_worker (637,236,586,138 samples, 77.31%)</title><rect x="233.7" y="581" width="912.3" height="15.0" fill="rgb(209,21,5)" rx="2" ry="2" />
<text  x="236.67" y="591.5" >heap_parallel_vacuum_scan_worker</text>
</g>
<g >
<title>shmem_fault (270,519,733 samples, 0.03%)</title><rect x="63.9" y="309" width="0.4" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="66.93" y="319.5" ></text>
</g>
<g >
<title>lazy_scan_prune (67,794,568,622 samples, 8.23%)</title><rect x="135.8" y="533" width="97.1" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text  x="138.82" y="543.5" >lazy_scan_p..</text>
</g>
<g >
<title>__radix_tree_lookup (448,623,926 samples, 0.05%)</title><rect x="256.3" y="245" width="0.6" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="259.29" y="255.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (812,554,938 samples, 0.10%)</title><rect x="319.2" y="421" width="1.1" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="322.18" y="431.5" ></text>
</g>
<g >
<title>error_swapgs (82,319,755 samples, 0.01%)</title><rect x="49.6" y="709" width="0.1" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text  x="52.58" y="719.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (2,701,879,035 samples, 0.33%)</title><rect x="440.9" y="149" width="3.8" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text  x="443.88" y="159.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (197,255,195 samples, 0.02%)</title><rect x="146.4" y="501" width="0.3" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="149.40" y="511.5" ></text>
</g>
<g >
<title>xfs_file_aio_read (178,159,674,754 samples, 21.62%)</title><rect x="342.1" y="373" width="255.0" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text  x="345.07" y="383.5" >xfs_file_aio_read</text>
</g>
<g >
<title>ItemPointerIsValid (1,407,275,130 samples, 0.17%)</title><rect x="678.9" y="485" width="2.0" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="681.89" y="495.5" ></text>
</g>
<g >
<title>TransactionIdFollows (121,998,621 samples, 0.01%)</title><rect x="35.2" y="597" width="0.2" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="38.18" y="607.5" ></text>
</g>
<g >
<title>postmaster_child_launch (637,265,816,681 samples, 77.32%)</title><rect x="233.6" y="677" width="912.4" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="236.63" y="687.5" >postmaster_child_launch</text>
</g>
<g >
<title>xfs_ilock (963,834,659 samples, 0.12%)</title><rect x="15.8" y="341" width="1.4" height="15.0" fill="rgb(249,203,48)" rx="2" ry="2" />
<text  x="18.82" y="351.5" ></text>
</g>
<g >
<title>unmap_vmas (10,374,169,931 samples, 1.26%)</title><rect x="1146.1" y="677" width="14.9" height="15.0" fill="rgb(243,176,42)" rx="2" ry="2" />
<text  x="1149.15" y="687.5" ></text>
</g>
<g >
<title>PortalRun (3,088,212,841 samples, 0.37%)</title><rect x="50.2" y="629" width="4.4" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text  x="53.20" y="639.5" ></text>
</g>
<g >
<title>intel_idle (6,010,443,106 samples, 0.73%)</title><rect x="1167.2" y="677" width="8.6" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="1170.23" y="687.5" ></text>
</g>
<g >
<title>account_entity_dequeue (90,354,745 samples, 0.01%)</title><rect x="593.2" y="197" width="0.1" height="15.0" fill="rgb(231,120,28)" rx="2" ry="2" />
<text  x="596.16" y="207.5" ></text>
</g>
<g >
<title>tick_sched_timer (74,985,907 samples, 0.01%)</title><rect x="621.3" y="389" width="0.1" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="624.34" y="399.5" ></text>
</g>
<g >
<title>TransactionIdDidCommit (474,727,945 samples, 0.06%)</title><rect x="1128.6" y="501" width="0.7" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text  x="1131.58" y="511.5" ></text>
</g>
<g >
<title>BufferGetPage (127,844,494 samples, 0.02%)</title><rect x="610.6" y="485" width="0.2" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="613.60" y="495.5" ></text>
</g>
<g >
<title>smgrwritev (5,080,836,104 samples, 0.62%)</title><rect x="12.3" y="501" width="7.2" height="15.0" fill="rgb(217,56,13)" rx="2" ry="2" />
<text  x="15.25" y="511.5" ></text>
</g>
<g >
<title>BufferGetBlockNumber (1,369,405,226 samples, 0.17%)</title><rect x="1136.4" y="517" width="1.9" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="1139.39" y="527.5" ></text>
</g>
<g >
<title>vm_readbuf (7,777,835,010 samples, 0.94%)</title><rect x="610.4" y="501" width="11.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="613.43" y="511.5" ></text>
</g>
<g >
<title>ss_search (126,871,703 samples, 0.02%)</title><rect x="609.7" y="485" width="0.2" height="15.0" fill="rgb(244,181,43)" rx="2" ry="2" />
<text  x="612.70" y="495.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32_impl (782,945,425 samples, 0.09%)</title><rect x="617.2" y="341" width="1.1" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="620.21" y="351.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (140,930,970 samples, 0.02%)</title><rect x="225.4" y="437" width="0.2" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="228.39" y="447.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (118,334,939 samples, 0.01%)</title><rect x="607.3" y="469" width="0.2" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="610.33" y="479.5" ></text>
</g>
<g >
<title>smgrreadv (187,449,915,647 samples, 22.74%)</title><rect x="332.8" y="501" width="268.3" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text  x="335.78" y="511.5" >smgrreadv</text>
</g>
<g >
<title>ret_from_fork_nospec_end (115,471,936 samples, 0.01%)</title><rect x="10.1" y="773" width="0.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="13.10" y="783.5" ></text>
</g>
<g >
<title>__put_single_page (500,981,483 samples, 0.06%)</title><rect x="1164.6" y="533" width="0.7" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text  x="1167.61" y="543.5" ></text>
</g>
<g >
<title>heap_prune_record_unchanged_lp_normal (3,066,171,706 samples, 0.37%)</title><rect x="35.5" y="597" width="4.4" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="38.55" y="607.5" ></text>
</g>
<g >
<title>get_hash_entry (206,305,776 samples, 0.03%)</title><rect x="331.6" y="501" width="0.3" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text  x="334.62" y="511.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (119,060,468 samples, 0.01%)</title><rect x="1058.9" y="437" width="0.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="1061.88" y="447.5" ></text>
</g>
<g >
<title>auditsys (394,905,399 samples, 0.05%)</title><rect x="335.1" y="437" width="0.6" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text  x="338.13" y="447.5" ></text>
</g>
<g >
<title>tick_sched_timer (73,045,894 samples, 0.01%)</title><rect x="87.4" y="101" width="0.1" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="90.39" y="111.5" ></text>
</g>
<g >
<title>update_curr (220,741,522 samples, 0.03%)</title><rect x="593.4" y="197" width="0.3" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="596.38" y="207.5" ></text>
</g>
<g >
<title>vfs_write (707,254,508 samples, 0.09%)</title><rect x="50.5" y="181" width="1.0" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="53.47" y="191.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (433,353,918 samples, 0.05%)</title><rect x="932.8" y="501" width="0.6" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="935.82" y="511.5" ></text>
</g>
<g >
<title>ReleaseBuffer (636,129,423 samples, 0.08%)</title><rect x="605.2" y="533" width="0.9" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text  x="608.24" y="543.5" ></text>
</g>
<g >
<title>PageGetItemId (485,394,818 samples, 0.06%)</title><rect x="178.4" y="453" width="0.7" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="181.37" y="463.5" ></text>
</g>
<g >
<title>heap_page_is_all_visible (1,796,979,870 samples, 0.22%)</title><rect x="24.1" y="629" width="2.6" height="15.0" fill="rgb(228,107,25)" rx="2" ry="2" />
<text  x="27.11" y="639.5" ></text>
</g>
<g >
<title>pg_rotate_left32 (324,572,656 samples, 0.04%)</title><rect x="614.5" y="325" width="0.5" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="617.49" y="335.5" ></text>
</g>
<g >
<title>finish_task_switch (147,777,065 samples, 0.02%)</title><rect x="1181.9" y="693" width="0.2" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="1184.85" y="703.5" ></text>
</g>
<g >
<title>sched_ttwu_pending (1,564,184,133 samples, 0.19%)</title><rect x="1179.1" y="725" width="2.2" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text  x="1182.08" y="735.5" ></text>
</g>
<g >
<title>LWLockWakeup (86,751,378 samples, 0.01%)</title><rect x="232.6" y="469" width="0.1" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text  x="235.58" y="479.5" ></text>
</g>
<g >
<title>pagevec_lru_move_fn (1,054,133,464 samples, 0.13%)</title><rect x="439.3" y="165" width="1.5" height="15.0" fill="rgb(205,0,0)" rx="2" ry="2" />
<text  x="442.26" y="175.5" ></text>
</g>
<g >
<title>TidStoreMemoryUsage (75,936,328 samples, 0.01%)</title><rect x="52.0" y="421" width="0.1" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text  x="54.95" y="431.5" ></text>
</g>
<g >
<title>system_call_after_swapgs (109,404,767 samples, 0.01%)</title><rect x="48.0" y="725" width="0.2" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text  x="51.01" y="735.5" ></text>
</g>
<g >
<title>enqueue_task_fair (498,688,020 samples, 0.06%)</title><rect x="1180.0" y="677" width="0.7" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text  x="1182.96" y="687.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (150,595,284 samples, 0.02%)</title><rect x="709.2" y="453" width="0.2" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="712.23" y="463.5" ></text>
</g>
<g >
<title>UnpinBuffer (544,244,404 samples, 0.07%)</title><rect x="605.4" y="517" width="0.7" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="608.37" y="527.5" ></text>
</g>
<g >
<title>nohz_balance_enter_idle (246,102,363 samples, 0.03%)</title><rect x="1186.2" y="677" width="0.3" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text  x="1189.18" y="687.5" ></text>
</g>
<g >
<title>__schedule (1,160,883,236 samples, 0.14%)</title><rect x="1181.3" y="709" width="1.7" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text  x="1184.32" y="719.5" ></text>
</g>
<g >
<title>proclist_delete_offset (95,612,548 samples, 0.01%)</title><rect x="1141.0" y="469" width="0.1" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="1143.97" y="479.5" ></text>
</g>
<g >
<title>clear_page_c_e (2,160,487,749 samples, 0.26%)</title><rect x="87.5" y="181" width="3.1" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text  x="90.50" y="191.5" ></text>
</g>
<g >
<title>futex_wait_setup (502,642,039 samples, 0.06%)</title><rect x="48.7" y="677" width="0.7" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text  x="51.72" y="687.5" ></text>
</g>
<g >
<title>BufferIsValid (253,026,045 samples, 0.03%)</title><rect x="1138.6" y="517" width="0.3" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1141.58" y="527.5" ></text>
</g>
<g >
<title>write_cache_pages (115,471,936 samples, 0.01%)</title><rect x="10.1" y="597" width="0.2" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="13.10" y="607.5" ></text>
</g>
<g >
<title>schedule_preempt_disabled (1,270,692,993 samples, 0.15%)</title><rect x="1181.3" y="725" width="1.8" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text  x="1184.32" y="735.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (211,882,877 samples, 0.03%)</title><rect x="828.8" y="485" width="0.3" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="831.81" y="495.5" ></text>
</g>
<g >
<title>TransactionIdPrecedesOrEquals (2,213,915,933 samples, 0.27%)</title><rect x="1055.7" y="453" width="3.2" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text  x="1058.71" y="463.5" ></text>
</g>
<g >
<title>BufferAlloc (990,132,493 samples, 0.12%)</title><rect x="134.1" y="389" width="1.5" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text  x="137.13" y="399.5" ></text>
</g>
<g >
<title>activate_page (1,720,881,200 samples, 0.21%)</title><rect x="1152.1" y="613" width="2.4" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text  x="1155.08" y="623.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (665,890,937 samples, 0.08%)</title><rect x="1140.0" y="485" width="0.9" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="1142.97" y="495.5" ></text>
</g>
<g >
<title>path_put (221,951,443 samples, 0.03%)</title><rect x="336.6" y="405" width="0.3" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text  x="339.58" y="415.5" ></text>
</g>
<g >
<title>page_fault (180,891,907 samples, 0.02%)</title><rect x="47.3" y="725" width="0.2" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="50.26" y="735.5" ></text>
</g>
<g >
<title>LockBuffer (337,629,607 samples, 0.04%)</title><rect x="604.8" y="533" width="0.4" height="15.0" fill="rgb(235,142,34)" rx="2" ry="2" />
<text  x="607.75" y="543.5" ></text>
</g>
<g >
<title>put_page (114,960,458 samples, 0.01%)</title><rect x="585.8" y="309" width="0.2" height="15.0" fill="rgb(253,221,53)" rx="2" ry="2" />
<text  x="588.85" y="319.5" ></text>
</g>
<g >
<title>ConditionalLockBufferForCleanup (1,255,282,863 samples, 0.15%)</title><rect x="234.8" y="549" width="1.8" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text  x="237.83" y="559.5" ></text>
</g>
<g >
<title>vacuum (22,616,657,143 samples, 2.74%)</title><rect x="11.0" y="757" width="32.4" height="15.0" fill="rgb(219,66,15)" rx="2" ry="2" />
<text  x="13.98" y="767.5" >va..</text>
</g>
<g >
<title>unlock_page (88,491,040 samples, 0.01%)</title><rect x="271.6" y="341" width="0.2" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text  x="274.64" y="351.5" ></text>
</g>
<g >
<title>RestoreGUCState (110,594,713 samples, 0.01%)</title><rect x="55.0" y="613" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="58.01" y="623.5" ></text>
</g>
<g >
<title>heap_prune_chain (839,849,499 samples, 0.10%)</title><rect x="53.1" y="389" width="1.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="56.08" y="399.5" ></text>
</g>
<g >
<title>scheduler_tick (91,702,468 samples, 0.01%)</title><rect x="828.9" y="357" width="0.2" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text  x="831.92" y="367.5" ></text>
</g>
<g >
<title>tick_sched_timer (114,070,044 samples, 0.01%)</title><rect x="393.6" y="117" width="0.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="396.61" y="127.5" ></text>
</g>
<g >
<title>heap_prepare_freeze_tuple (311,664,299 samples, 0.04%)</title><rect x="53.8" y="357" width="0.5" height="15.0" fill="rgb(227,101,24)" rx="2" ry="2" />
<text  x="56.81" y="367.5" ></text>
</g>
<g >
<title>InitPostgres (209,130,176 samples, 0.03%)</title><rect x="54.7" y="597" width="0.3" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text  x="57.69" y="607.5" ></text>
</g>
<g >
<title>page_waitqueue (150,069,459 samples, 0.02%)</title><rect x="585.2" y="213" width="0.2" height="15.0" fill="rgb(212,34,8)" rx="2" ry="2" />
<text  x="588.16" y="223.5" ></text>
</g>
<g >
<title>sysret_check (125,321,012 samples, 0.02%)</title><rect x="310.3" y="373" width="0.1" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="313.27" y="383.5" ></text>
</g>
<g >
<title>FileReadV (2,044,560,622 samples, 0.25%)</title><rect x="44.0" y="757" width="2.9" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text  x="46.96" y="767.5" ></text>
</g>
<g >
<title>mem_cgroup_charge_statistics.isra.20 (136,177,421 samples, 0.02%)</title><rect x="451.0" y="133" width="0.2" height="15.0" fill="rgb(220,73,17)" rx="2" ry="2" />
<text  x="453.98" y="143.5" ></text>
</g>
<g >
<title>LWLockHeldByMe (124,797,739 samples, 0.02%)</title><rect x="1102.6" y="469" width="0.2" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="1105.58" y="479.5" ></text>
</g>
<g >
<title>__rmqueue (71,400,847 samples, 0.01%)</title><rect x="388.1" y="165" width="0.1" height="15.0" fill="rgb(249,203,48)" rx="2" ry="2" />
<text  x="391.10" y="175.5" ></text>
</g>
<g >
<title>htsv_get_valid_status (237,118,096 samples, 0.03%)</title><rect x="218.3" y="485" width="0.3" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text  x="221.30" y="495.5" ></text>
</g>
<g >
<title>[perf] (342,231,209 samples, 0.04%)</title><rect x="10.4" y="677" width="0.5" height="15.0" fill="rgb(253,223,53)" rx="2" ry="2" />
<text  x="13.38" y="687.5" ></text>
</g>
<g >
<title>hash_initial_lookup (106,186,922 samples, 0.01%)</title><rect x="59.1" y="405" width="0.1" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="62.08" y="415.5" ></text>
</g>
<g >
<title>page_fault (308,107,392 samples, 0.04%)</title><rect x="75.4" y="437" width="0.4" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="78.40" y="447.5" ></text>
</g>
<g >
<title>BufferGetBlock (119,212,653 samples, 0.01%)</title><rect x="759.6" y="501" width="0.2" height="15.0" fill="rgb(242,172,41)" rx="2" ry="2" />
<text  x="762.62" y="511.5" ></text>
</g>
<g >
<title>grab_cache_page_write_begin (99,402,375 samples, 0.01%)</title><rect x="13.9" y="277" width="0.1" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text  x="16.86" y="287.5" ></text>
</g>
<g >
<title>__find_get_page (95,547,139 samples, 0.01%)</title><rect x="21.3" y="405" width="0.1" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="24.27" y="415.5" ></text>
</g>
<g >
<title>selinux_file_permission (551,113,782 samples, 0.07%)</title><rect x="598.6" y="357" width="0.8" height="15.0" fill="rgb(249,204,48)" rx="2" ry="2" />
<text  x="601.56" y="367.5" ></text>
</g>
<g >
<title>trigger_load_balance (105,181,066 samples, 0.01%)</title><rect x="203.0" y="309" width="0.2" height="15.0" fill="rgb(228,108,26)" rx="2" ry="2" />
<text  x="206.02" y="319.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (105,887,461 samples, 0.01%)</title><rect x="203.0" y="453" width="0.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="206.01" y="463.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (112,709,505 samples, 0.01%)</title><rect x="1086.9" y="437" width="0.2" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="1089.95" y="447.5" ></text>
</g>
<g >
<title>StartBufferIO (95,731,868 samples, 0.01%)</title><rect x="76.2" y="469" width="0.2" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text  x="79.23" y="479.5" ></text>
</g>
<g >
<title>__zone_watermark_ok (97,915,835 samples, 0.01%)</title><rect x="125.7" y="117" width="0.2" height="15.0" fill="rgb(240,162,38)" rx="2" ry="2" />
<text  x="128.74" y="127.5" ></text>
</g>
<g >
<title>BufferGetBlock (260,772,081 samples, 0.03%)</title><rect x="226.9" y="421" width="0.3" height="15.0" fill="rgb(242,172,41)" rx="2" ry="2" />
<text  x="229.86" y="431.5" ></text>
</g>
<g >
<title>BlockIdSet (160,824,652 samples, 0.02%)</title><rect x="143.2" y="501" width="0.2" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="146.18" y="511.5" ></text>
</g>
<g >
<title>do_page_fault (100,582,489 samples, 0.01%)</title><rect x="11.3" y="485" width="0.2" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="14.32" y="495.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (85,945,729 samples, 0.01%)</title><rect x="225.3" y="437" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="228.26" y="447.5" ></text>
</g>
<g >
<title>heap_prune_record_dead_or_unused (87,338,614 samples, 0.01%)</title><rect x="35.4" y="597" width="0.1" height="15.0" fill="rgb(226,96,23)" rx="2" ry="2" />
<text  x="38.42" y="607.5" ></text>
</g>
<g >
<title>do_futex (323,879,808 samples, 0.04%)</title><rect x="1142.1" y="421" width="0.4" height="15.0" fill="rgb(245,184,44)" rx="2" ry="2" />
<text  x="1145.06" y="431.5" ></text>
</g>
<g >
<title>do_read_fault.isra.63 (2,332,012,274 samples, 0.28%)</title><rect x="268.4" y="357" width="3.4" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text  x="271.43" y="367.5" ></text>
</g>
<g >
<title>LockBufHdr (783,287,595 samples, 0.10%)</title><rect x="321.6" y="453" width="1.2" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="324.64" y="463.5" ></text>
</g>
<g >
<title>BufferIsValid (353,833,527 samples, 0.04%)</title><rect x="1109.3" y="437" width="0.6" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1112.35" y="447.5" ></text>
</g>
<g >
<title>BufferDescriptorGetBuffer (467,733,068 samples, 0.06%)</title><rect x="619.2" y="373" width="0.6" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="622.16" y="383.5" ></text>
</g>
<g >
<title>__inc_zone_page_state (91,583,340 samples, 0.01%)</title><rect x="438.8" y="181" width="0.2" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text  x="441.84" y="191.5" ></text>
</g>
<g >
<title>__mem_cgroup_try_charge (156,252,162 samples, 0.02%)</title><rect x="98.3" y="133" width="0.3" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="101.33" y="143.5" ></text>
</g>
<g >
<title>kworker/u244:0 (115,471,936 samples, 0.01%)</title><rect x="10.1" y="789" width="0.2" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text  x="13.10" y="799.5" ></text>
</g>
<g >
<title>tag_hash (248,005,373 samples, 0.03%)</title><rect x="134.3" y="341" width="0.4" height="15.0" fill="rgb(245,185,44)" rx="2" ry="2" />
<text  x="137.34" y="351.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (97,340,457 samples, 0.01%)</title><rect x="594.9" y="213" width="0.2" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text  x="597.91" y="223.5" ></text>
</g>
<g >
<title>arch_cpu_idle (10,521,777,127 samples, 1.28%)</title><rect x="1163.5" y="725" width="15.0" height="15.0" fill="rgb(218,62,14)" rx="2" ry="2" />
<text  x="1166.45" y="735.5" ></text>
</g>
<g >
<title>__do_page_fault (474,711,513 samples, 0.06%)</title><rect x="63.7" y="373" width="0.7" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="66.71" y="383.5" ></text>
</g>
<g >
<title>tick_sched_handle (132,283,144 samples, 0.02%)</title><rect x="709.3" y="421" width="0.1" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="712.26" y="431.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32 (270,831,393 samples, 0.03%)</title><rect x="318.8" y="421" width="0.4" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="321.78" y="431.5" ></text>
</g>
<g >
<title>__radix_tree_lookup (5,912,698,770 samples, 0.72%)</title><rect x="427.1" y="133" width="8.5" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="430.10" y="143.5" ></text>
</g>
<g >
<title>tick_sched_handle (195,003,410 samples, 0.02%)</title><rect x="784.5" y="421" width="0.3" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="787.53" y="431.5" ></text>
</g>
<g >
<title>UnpinBuffer (90,219,515 samples, 0.01%)</title><rect x="133.0" y="501" width="0.2" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="136.02" y="511.5" ></text>
</g>
<g >
<title>GetVictimBuffer (42,558,234,700 samples, 5.16%)</title><rect x="257.5" y="453" width="60.9" height="15.0" fill="rgb(209,18,4)" rx="2" ry="2" />
<text  x="260.48" y="463.5" >GetVic..</text>
</g>
<g >
<title>heap_vacuum_rel (3,086,113,520 samples, 0.37%)</title><rect x="50.2" y="485" width="4.4" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text  x="53.20" y="495.5" ></text>
</g>
<g >
<title>update_time (100,541,237 samples, 0.01%)</title><rect x="50.9" y="85" width="0.1" height="15.0" fill="rgb(211,31,7)" rx="2" ry="2" />
<text  x="53.88" y="95.5" ></text>
</g>
<g >
<title>__memmove_ssse3 (372,118,706 samples, 0.05%)</title><rect x="173.5" y="469" width="0.6" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text  x="176.54" y="479.5" ></text>
</g>
<g >
<title>alloc_pages_current (263,588,567 samples, 0.03%)</title><rect x="86.3" y="197" width="0.4" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text  x="89.33" y="207.5" ></text>
</g>
<g >
<title>xfs_do_writepage (84,258,683 samples, 0.01%)</title><rect x="10.1" y="581" width="0.2" height="15.0" fill="rgb(242,174,41)" rx="2" ry="2" />
<text  x="13.15" y="591.5" ></text>
</g>
<g >
<title>tlb_flush_mmu_free (1,485,699,215 samples, 0.18%)</title><rect x="1158.6" y="629" width="2.1" height="15.0" fill="rgb(251,211,50)" rx="2" ry="2" />
<text  x="1161.62" y="639.5" ></text>
</g>
<g >
<title>PageGetItemId (2,870,705,108 samples, 0.35%)</title><rect x="860.5" y="485" width="4.1" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="863.48" y="495.5" ></text>
</g>
<g >
<title>shmem_getpage_gfp (716,510,559 samples, 0.09%)</title><rect x="59.7" y="293" width="1.0" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="62.70" y="303.5" ></text>
</g>
<g >
<title>BufferGetBlock (200,056,000 samples, 0.02%)</title><rect x="1105.8" y="453" width="0.2" height="15.0" fill="rgb(242,172,41)" rx="2" ry="2" />
<text  x="1108.75" y="463.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (527,375,283 samples, 0.06%)</title><rect x="444.9" y="181" width="0.7" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="447.85" y="191.5" ></text>
</g>
<g >
<title>PageGetItem (598,527,492 samples, 0.07%)</title><rect x="201.3" y="469" width="0.8" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="204.27" y="479.5" ></text>
</g>
<g >
<title>LWLockRelease (403,929,011 samples, 0.05%)</title><rect x="618.5" y="389" width="0.6" height="15.0" fill="rgb(217,58,13)" rx="2" ry="2" />
<text  x="621.49" y="399.5" ></text>
</g>
<g >
<title>sys_pwrite64 (4,752,416,222 samples, 0.58%)</title><rect x="12.5" y="421" width="6.8" height="15.0" fill="rgb(238,156,37)" rx="2" ry="2" />
<text  x="15.53" y="431.5" ></text>
</g>
<g >
<title>update_process_times (74,180,775 samples, 0.01%)</title><rect x="621.3" y="357" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="624.34" y="367.5" ></text>
</g>
<g >
<title>deactivate_task (351,635,633 samples, 0.04%)</title><rect x="16.4" y="245" width="0.5" height="15.0" fill="rgb(206,8,2)" rx="2" ry="2" />
<text  x="19.43" y="255.5" ></text>
</g>
<g >
<title>PinBufferForBlock (140,649,243 samples, 0.02%)</title><rect x="612.0" y="437" width="0.2" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text  x="614.97" y="447.5" ></text>
</g>
<g >
<title>update_curr (107,808,561 samples, 0.01%)</title><rect x="16.6" y="197" width="0.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="19.65" y="207.5" ></text>
</g>
<g >
<title>PageGetItemId (858,942,197 samples, 0.10%)</title><rect x="172.2" y="469" width="1.3" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="175.22" y="479.5" ></text>
</g>
<g >
<title>ktime_get (87,314,491 samples, 0.01%)</title><rect x="784.3" y="437" width="0.2" height="15.0" fill="rgb(207,10,2)" rx="2" ry="2" />
<text  x="787.35" y="447.5" ></text>
</g>
<g >
<title>visibilitymap_set (696,026,385 samples, 0.08%)</title><rect x="231.9" y="517" width="1.0" height="15.0" fill="rgb(220,73,17)" rx="2" ry="2" />
<text  x="234.88" y="527.5" ></text>
</g>
<g >
<title>StartReadBuffer (60,051,069,381 samples, 7.29%)</title><rect x="237.1" y="517" width="86.0" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text  x="240.13" y="527.5" >StartReadB..</text>
</g>
<g >
<title>do_page_fault (2,342,578,748 samples, 0.28%)</title><rect x="243.3" y="389" width="3.4" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="246.33" y="399.5" ></text>
</g>
<g >
<title>RegisterSyncRequest (72,393,843 samples, 0.01%)</title><rect x="19.4" y="453" width="0.1" height="15.0" fill="rgb(226,96,23)" rx="2" ry="2" />
<text  x="22.42" y="463.5" ></text>
</g>
<g >
<title>activate_task (102,520,980 samples, 0.01%)</title><rect x="19.1" y="229" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="22.06" y="239.5" ></text>
</g>
<g >
<title>do_sync_read (178,485,479,838 samples, 21.66%)</title><rect x="341.6" y="389" width="255.5" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="344.61" y="399.5" >do_sync_read</text>
</g>
<g >
<title>pg_atomic_read_u32_impl (200,606,586 samples, 0.02%)</title><rect x="1099.4" y="453" width="0.3" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="1102.43" y="463.5" ></text>
</g>
<g >
<title>RecordPageWithFreeSpace (409,979,614 samples, 0.05%)</title><rect x="22.5" y="645" width="0.5" height="15.0" fill="rgb(247,197,47)" rx="2" ry="2" />
<text  x="25.45" y="655.5" ></text>
</g>
<g >
<title>clear_page_c_e (927,684,554 samples, 0.11%)</title><rect x="386.6" y="181" width="1.3" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text  x="389.60" y="191.5" ></text>
</g>
<g >
<title>do_set_pte (122,544,924 samples, 0.01%)</title><rect x="127.7" y="213" width="0.2" height="15.0" fill="rgb(253,221,52)" rx="2" ry="2" />
<text  x="130.74" y="223.5" ></text>
</g>
<g >
<title>PinBufferForBlock (13,062,990,329 samples, 1.58%)</title><rect x="56.0" y="469" width="18.7" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text  x="58.98" y="479.5" ></text>
</g>
<g >
<title>ttwu_do_activate (380,348,311 samples, 0.05%)</title><rect x="1166.1" y="565" width="0.6" height="15.0" fill="rgb(215,48,11)" rx="2" ry="2" />
<text  x="1169.11" y="575.5" ></text>
</g>
<g >
<title>mdreadv (88,032,678 samples, 0.01%)</title><rect x="76.4" y="485" width="0.2" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text  x="79.43" y="495.5" ></text>
</g>
<g >
<title>LWLockRelease (405,166,738 samples, 0.05%)</title><rect x="604.0" y="501" width="0.6" height="15.0" fill="rgb(217,58,13)" rx="2" ry="2" />
<text  x="607.05" y="511.5" ></text>
</g>
<g >
<title>SetHintBits (24,076,079,462 samples, 2.92%)</title><rect x="1087.6" y="485" width="34.5" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text  x="1090.63" y="495.5" >Se..</text>
</g>
<g >
<title>xfs_vn_update_time (253,334,555 samples, 0.03%)</title><rect x="586.5" y="277" width="0.4" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="589.50" y="287.5" ></text>
</g>
<g >
<title>tick_program_event (72,265,392 samples, 0.01%)</title><rect x="1167.0" y="629" width="0.1" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text  x="1169.95" y="639.5" ></text>
</g>
<g >
<title>PageGetItem (108,314,740 samples, 0.01%)</title><rect x="53.6" y="357" width="0.1" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="56.59" y="367.5" ></text>
</g>
<g >
<title>vfs_read (37,387,576,229 samples, 4.54%)</title><rect x="78.7" y="389" width="53.5" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="81.65" y="399.5" >vfs_r..</text>
</g>
<g >
<title>apic_timer_interrupt (137,764,122 samples, 0.02%)</title><rect x="990.7" y="485" width="0.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="993.73" y="495.5" ></text>
</g>
<g >
<title>StartReadBuffer (13,163,903,090 samples, 1.60%)</title><rect x="55.8" y="501" width="18.9" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text  x="58.84" y="511.5" ></text>
</g>
<g >
<title>do_sync_read (37,105,401,777 samples, 4.50%)</title><rect x="78.7" y="373" width="53.1" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="81.72" y="383.5" >do_sy..</text>
</g>
<g >
<title>PageGetItemId (201,313,947 samples, 0.02%)</title><rect x="34.9" y="597" width="0.3" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="37.89" y="607.5" ></text>
</g>
<g >
<title>iomap_file_buffered_write (1,359,289,501 samples, 0.16%)</title><rect x="12.7" y="341" width="2.0" height="15.0" fill="rgb(206,6,1)" rx="2" ry="2" />
<text  x="15.74" y="351.5" ></text>
</g>
<g >
<title>x86_64_start_reservations (379,723,816 samples, 0.05%)</title><rect x="1189.5" y="741" width="0.5" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text  x="1192.46" y="751.5" ></text>
</g>
<g >
<title>ItemPointerIsValid (254,737,337 samples, 0.03%)</title><rect x="145.6" y="469" width="0.3" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="148.56" y="479.5" ></text>
</g>
<g >
<title>BufferIsValid (500,895,006 samples, 0.06%)</title><rect x="1115.8" y="421" width="0.7" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1118.82" y="431.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (132,564,391 samples, 0.02%)</title><rect x="96.8" y="149" width="0.2" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="99.82" y="159.5" ></text>
</g>
<g >
<title>xfs_ilock (1,331,178,108 samples, 0.16%)</title><rect x="129.4" y="325" width="1.9" height="15.0" fill="rgb(249,203,48)" rx="2" ry="2" />
<text  x="132.43" y="335.5" ></text>
</g>
<g >
<title>table_parallel_vacuum_scan (124,325,521,611 samples, 15.08%)</title><rect x="55.2" y="581" width="178.0" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text  x="58.21" y="591.5" >table_parallel_vacuum_s..</text>
</g>
<g >
<title>PortalRunUtility (3,088,212,841 samples, 0.37%)</title><rect x="50.2" y="597" width="4.4" height="15.0" fill="rgb(239,160,38)" rx="2" ry="2" />
<text  x="53.20" y="607.5" ></text>
</g>
<g >
<title>tick_sched_handle (75,893,920 samples, 0.01%)</title><rect x="173.3" y="357" width="0.2" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="176.34" y="367.5" ></text>
</g>
<g >
<title>read_tsc (71,236,044 samples, 0.01%)</title><rect x="1176.0" y="661" width="0.1" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="1178.95" y="671.5" ></text>
</g>
<g >
<title>__find_lock_page (688,253,910 samples, 0.08%)</title><rect x="59.7" y="277" width="1.0" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="62.74" y="287.5" ></text>
</g>
<g >
<title>put_page (504,987,077 samples, 0.06%)</title><rect x="1164.6" y="549" width="0.7" height="15.0" fill="rgb(253,221,53)" rx="2" ry="2" />
<text  x="1167.60" y="559.5" ></text>
</g>
<g >
<title>page_waitqueue (73,997,230 samples, 0.01%)</title><rect x="271.7" y="325" width="0.1" height="15.0" fill="rgb(212,34,8)" rx="2" ry="2" />
<text  x="274.66" y="335.5" ></text>
</g>
<g >
<title>error_swapgs (608,977,116 samples, 0.07%)</title><rect x="44.2" y="725" width="0.9" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text  x="47.22" y="735.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_or_u32_impl (314,930,663 samples, 0.04%)</title><rect x="63.2" y="373" width="0.5" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text  x="66.23" y="383.5" ></text>
</g>
<g >
<title>__find_get_page (135,896,440 samples, 0.02%)</title><rect x="57.4" y="245" width="0.2" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="60.37" y="255.5" ></text>
</g>
<g >
<title>htsv_get_valid_status (1,156,880,980 samples, 0.14%)</title><rect x="1129.7" y="517" width="1.7" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text  x="1132.74" y="527.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (113,577,889 samples, 0.01%)</title><rect x="1086.9" y="469" width="0.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="1089.95" y="479.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (70,484,646 samples, 0.01%)</title><rect x="1058.9" y="421" width="0.1" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="1061.89" y="431.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_or_u32 (1,686,414,057 samples, 0.20%)</title><rect x="264.2" y="405" width="2.4" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="267.17" y="415.5" ></text>
</g>
<g >
<title>unmap_single_vma (10,374,169,931 samples, 1.26%)</title><rect x="1146.1" y="661" width="14.9" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text  x="1149.15" y="671.5" ></text>
</g>
<g >
<title>do_lazy_scan_heap (22,521,315,931 samples, 2.73%)</title><rect x="11.0" y="661" width="32.3" height="15.0" fill="rgb(221,75,18)" rx="2" ry="2" />
<text  x="14.04" y="671.5" >do..</text>
</g>
<g >
<title>__lru_cache_add (283,349,330 samples, 0.03%)</title><rect x="95.5" y="165" width="0.4" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="98.46" y="175.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32_impl (97,460,343 samples, 0.01%)</title><rect x="135.1" y="325" width="0.1" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="138.11" y="335.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (79,063,339 samples, 0.01%)</title><rect x="87.4" y="165" width="0.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="90.38" y="175.5" ></text>
</g>
<g >
<title>heap_prune_record_unused (360,241,375 samples, 0.04%)</title><rect x="194.8" y="469" width="0.5" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="197.82" y="479.5" ></text>
</g>
<g >
<title>BufferGetBlockNumber (323,212,945 samples, 0.04%)</title><rect x="625.1" y="533" width="0.4" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="628.07" y="543.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (245,327,453 samples, 0.03%)</title><rect x="933.5" y="453" width="0.3" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="936.47" y="463.5" ></text>
</g>
<g >
<title>__pread_nocancel (2,044,560,622 samples, 0.25%)</title><rect x="44.0" y="741" width="2.9" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="46.96" y="751.5" ></text>
</g>
<g >
<title>table_relation_vacuum (3,086,113,520 samples, 0.37%)</title><rect x="50.2" y="501" width="4.4" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="53.20" y="511.5" ></text>
</g>
<g >
<title>ReadBufferExtended (53,512,163,014 samples, 6.49%)</title><rect x="55.8" y="533" width="76.6" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text  x="58.77" y="543.5" >ReadBuff..</text>
</g>
<g >
<title>update_process_times (195,003,410 samples, 0.02%)</title><rect x="784.5" y="405" width="0.3" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="787.53" y="415.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (86,182,567 samples, 0.01%)</title><rect x="96.9" y="133" width="0.1" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="99.88" y="143.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (211,882,877 samples, 0.03%)</title><rect x="828.8" y="469" width="0.3" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="831.81" y="479.5" ></text>
</g>
<g >
<title>sem_post@@GLIBC_2.2.5 (116,994,260 samples, 0.01%)</title><rect x="604.3" y="469" width="0.2" height="15.0" fill="rgb(214,41,9)" rx="2" ry="2" />
<text  x="607.34" y="479.5" ></text>
</g>
<g >
<title>tick_sched_handle (264,178,758 samples, 0.03%)</title><rect x="445.1" y="85" width="0.4" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="448.15" y="95.5" ></text>
</g>
<g >
<title>LockBufHdr (383,395,252 samples, 0.05%)</title><rect x="63.1" y="405" width="0.6" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="66.14" y="415.5" ></text>
</g>
<g >
<title>exec_simple_query (3,088,982,358 samples, 0.37%)</title><rect x="50.2" y="645" width="4.4" height="15.0" fill="rgb(211,29,6)" rx="2" ry="2" />
<text  x="53.20" y="655.5" ></text>
</g>
<g >
<title>radix_tree_lookup_slot (126,027,627 samples, 0.02%)</title><rect x="79.9" y="277" width="0.2" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="82.93" y="287.5" ></text>
</g>
<g >
<title>heap_prune_chain (309,466,378 samples, 0.04%)</title><rect x="231.1" y="517" width="0.5" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="234.11" y="527.5" ></text>
</g>
<g >
<title>ResourceOwnerRemember (161,950,769 samples, 0.02%)</title><rect x="620.0" y="357" width="0.2" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text  x="622.99" y="367.5" ></text>
</g>
<g >
<title>ItemPointerIsValid (299,545,838 samples, 0.04%)</title><rect x="222.6" y="469" width="0.5" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="225.64" y="479.5" ></text>
</g>
<g >
<title>alloc_pages_vma (4,152,451,457 samples, 0.50%)</title><rect x="569.9" y="165" width="5.9" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text  x="572.89" y="175.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (196,692,883 samples, 0.02%)</title><rect x="19.6" y="501" width="0.2" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="22.55" y="511.5" ></text>
</g>
<g >
<title>SetHintBits (256,382,571 samples, 0.03%)</title><rect x="1128.2" y="501" width="0.4" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text  x="1131.21" y="511.5" ></text>
</g>
<g >
<title>ResourceOwnerForget (102,223,199 samples, 0.01%)</title><rect x="605.6" y="485" width="0.1" height="15.0" fill="rgb(235,142,33)" rx="2" ry="2" />
<text  x="608.58" y="495.5" ></text>
</g>
<g >
<title>PinBuffer_Locked (350,170,032 samples, 0.04%)</title><rect x="258.0" y="437" width="0.5" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text  x="260.96" y="447.5" ></text>
</g>
<g >
<title>ReadBufferExtended (7,903,255,857 samples, 0.96%)</title><rect x="11.1" y="645" width="11.4" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text  x="14.14" y="655.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (77,288,274 samples, 0.01%)</title><rect x="621.3" y="437" width="0.1" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="624.34" y="447.5" ></text>
</g>
<g >
<title>LWLockRelease (196,547,355 samples, 0.02%)</title><rect x="232.5" y="485" width="0.3" height="15.0" fill="rgb(217,58,13)" rx="2" ry="2" />
<text  x="235.48" y="495.5" ></text>
</g>
<g >
<title>retint_userspace_restore_args (269,921,226 samples, 0.03%)</title><rect x="329.1" y="453" width="0.4" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text  x="332.10" y="463.5" ></text>
</g>
<g >
<title>StartReadBuffersImpl (168,303,597 samples, 0.02%)</title><rect x="22.8" y="549" width="0.2" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="25.78" y="559.5" ></text>
</g>
<g >
<title>tick_do_update_jiffies64 (82,388,199 samples, 0.01%)</title><rect x="1164.4" y="629" width="0.1" height="15.0" fill="rgb(208,14,3)" rx="2" ry="2" />
<text  x="1167.39" y="639.5" ></text>
</g>
<g >
<title>FileReadV (319,415,208 samples, 0.04%)</title><rect x="49.5" y="741" width="0.5" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text  x="52.51" y="751.5" ></text>
</g>
<g >
<title>mdwritev (5,074,689,565 samples, 0.62%)</title><rect x="12.3" y="485" width="7.2" height="15.0" fill="rgb(215,50,12)" rx="2" ry="2" />
<text  x="15.26" y="495.5" ></text>
</g>
<g >
<title>_raw_qspin_lock (640,204,755 samples, 0.08%)</title><rect x="95.9" y="165" width="0.9" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="98.87" y="175.5" ></text>
</g>
<g >
<title>do_futex (106,839,308 samples, 0.01%)</title><rect x="604.4" y="421" width="0.1" height="15.0" fill="rgb(245,184,44)" rx="2" ry="2" />
<text  x="607.35" y="431.5" ></text>
</g>
<g >
<title>trigger_load_balance (74,180,775 samples, 0.01%)</title><rect x="621.3" y="325" width="0.1" height="15.0" fill="rgb(228,108,26)" rx="2" ry="2" />
<text  x="624.34" y="335.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (105,887,461 samples, 0.01%)</title><rect x="203.0" y="421" width="0.2" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="206.01" y="431.5" ></text>
</g>
<g >
<title>PageGetItem (101,459,237 samples, 0.01%)</title><rect x="26.2" y="613" width="0.2" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="29.23" y="623.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (124,382,774 samples, 0.02%)</title><rect x="73.5" y="421" width="0.2" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="76.53" y="431.5" ></text>
</g>
<g >
<title>drop_futex_key_refs.isra.13 (97,529,216 samples, 0.01%)</title><rect x="48.4" y="677" width="0.2" height="15.0" fill="rgb(217,56,13)" rx="2" ry="2" />
<text  x="51.43" y="687.5" ></text>
</g>
<g >
<title>handle_mm_fault (93,138,394 samples, 0.01%)</title><rect x="11.3" y="453" width="0.2" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="14.33" y="463.5" ></text>
</g>
<g >
<title>__memcpy_ssse3 (74,357,833 samples, 0.01%)</title><rect x="238.8" y="437" width="0.1" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="241.76" y="447.5" ></text>
</g>
<g >
<title>TransactionLogFetch (192,068,445 samples, 0.02%)</title><rect x="1125.5" y="485" width="0.3" height="15.0" fill="rgb(244,180,43)" rx="2" ry="2" />
<text  x="1128.52" y="495.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (117,603,141 samples, 0.01%)</title><rect x="229.3" y="453" width="0.1" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="232.28" y="463.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (1,458,294,053 samples, 0.18%)</title><rect x="1095.3" y="437" width="2.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="1098.30" y="447.5" ></text>
</g>
<g >
<title>__alloc_pages_nodemask (1,033,797,387 samples, 0.13%)</title><rect x="125.7" y="133" width="1.4" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text  x="128.67" y="143.5" ></text>
</g>
<g >
<title>xfs_ilock (5,522,102,019 samples, 0.67%)</title><rect x="587.3" y="341" width="7.9" height="15.0" fill="rgb(249,203,48)" rx="2" ry="2" />
<text  x="590.26" y="351.5" ></text>
</g>
<g >
<title>BufferAlloc (986,844,257 samples, 0.12%)</title><rect x="50.2" y="341" width="1.4" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text  x="53.22" y="351.5" ></text>
</g>
<g >
<title>BufferDescriptorGetContentLock (112,583,956 samples, 0.01%)</title><rect x="1105.6" y="453" width="0.2" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="1108.59" y="463.5" ></text>
</g>
<g >
<title>futex_wake (106,839,308 samples, 0.01%)</title><rect x="604.4" y="405" width="0.1" height="15.0" fill="rgb(219,65,15)" rx="2" ry="2" />
<text  x="607.35" y="415.5" ></text>
</g>
<g >
<title>page_fault (477,991,274 samples, 0.06%)</title><rect x="63.7" y="405" width="0.7" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="66.71" y="415.5" ></text>
</g>
<g >
<title>rb_erase (99,948,625 samples, 0.01%)</title><rect x="1185.7" y="613" width="0.2" height="15.0" fill="rgb(219,66,15)" rx="2" ry="2" />
<text  x="1188.74" y="623.5" ></text>
</g>
<g >
<title>futex_wake (107,024,141 samples, 0.01%)</title><rect x="73.9" y="341" width="0.2" height="15.0" fill="rgb(219,65,15)" rx="2" ry="2" />
<text  x="76.93" y="351.5" ></text>
</g>
<g >
<title>shmem_getpage_gfp (484,379,198 samples, 0.06%)</title><rect x="327.9" y="341" width="0.7" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="330.93" y="351.5" ></text>
</g>
<g >
<title>down_read (1,295,070,375 samples, 0.16%)</title><rect x="129.5" y="309" width="1.8" height="15.0" fill="rgb(246,188,45)" rx="2" ry="2" />
<text  x="132.48" y="319.5" ></text>
</g>
<g >
<title>next_zones_zonelist (254,326,349 samples, 0.03%)</title><rect x="575.2" y="133" width="0.4" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text  x="578.25" y="143.5" ></text>
</g>
<g >
<title>heap_prune_chain (26,077,695,915 samples, 3.16%)</title><rect x="181.3" y="501" width="37.3" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="184.31" y="511.5" >hea..</text>
</g>
<g >
<title>fsnotify (75,729,740 samples, 0.01%)</title><rect x="597.8" y="373" width="0.1" height="15.0" fill="rgb(215,50,12)" rx="2" ry="2" />
<text  x="600.77" y="383.5" ></text>
</g>
<g >
<title>rw_verify_area (1,622,761,477 samples, 0.20%)</title><rect x="597.2" y="389" width="2.4" height="15.0" fill="rgb(218,64,15)" rx="2" ry="2" />
<text  x="600.24" y="399.5" ></text>
</g>
<g >
<title>init_spin_delay (76,435,054 samples, 0.01%)</title><rect x="331.1" y="453" width="0.1" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="334.11" y="463.5" ></text>
</g>
<g >
<title>BufferAlloc (398,251,233 samples, 0.05%)</title><rect x="43.4" y="725" width="0.5" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text  x="46.36" y="735.5" ></text>
</g>
<g >
<title>swapper (20,221,455,097 samples, 2.45%)</title><rect x="1161.0" y="789" width="29.0" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="1164.05" y="799.5" >sw..</text>
</g>
<g >
<title>WaitReadBuffersCanStartIO (560,582,188 samples, 0.07%)</title><rect x="330.8" y="501" width="0.8" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text  x="333.80" y="511.5" ></text>
</g>
<g >
<title>do_shared_fault.isra.64 (946,101,688 samples, 0.11%)</title><rect x="327.7" y="389" width="1.3" height="15.0" fill="rgb(245,185,44)" rx="2" ry="2" />
<text  x="330.68" y="399.5" ></text>
</g>
<g >
<title>file_read_actor (585,648,256 samples, 0.07%)</title><rect x="371.3" y="309" width="0.8" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text  x="374.30" y="319.5" ></text>
</g>
<g >
<title>copy_user_enhanced_fast_string (637,877,709 samples, 0.08%)</title><rect x="12.8" y="293" width="0.9" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text  x="15.79" y="303.5" ></text>
</g>
<g >
<title>xfs_ilock (85,346,173 samples, 0.01%)</title><rect x="586.5" y="261" width="0.1" height="15.0" fill="rgb(249,203,48)" rx="2" ry="2" />
<text  x="589.50" y="271.5" ></text>
</g>
<g >
<title>heap_prune_record_unchanged_lp_normal (101,720,074 samples, 0.01%)</title><rect x="40.0" y="613" width="0.2" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="43.02" y="623.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (99,260,054 samples, 0.01%)</title><rect x="1143.5" y="501" width="0.2" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="1146.53" y="511.5" ></text>
</g>
<g >
<title>vm_readbuf (282,312,859 samples, 0.03%)</title><rect x="23.3" y="597" width="0.4" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="26.34" y="607.5" ></text>
</g>
<g >
<title>__virt_addr_valid (193,004,855 samples, 0.02%)</title><rect x="371.8" y="277" width="0.2" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text  x="374.77" y="287.5" ></text>
</g>
<g >
<title>do_set_pte (639,253,165 samples, 0.08%)</title><rect x="578.4" y="229" width="0.9" height="15.0" fill="rgb(253,221,52)" rx="2" ry="2" />
<text  x="581.43" y="239.5" ></text>
</g>
<g >
<title>LWLockWakeup (205,627,301 samples, 0.02%)</title><rect x="73.8" y="421" width="0.3" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text  x="76.79" y="431.5" ></text>
</g>
<g >
<title>__pwrite_nocancel (721,846,937 samples, 0.09%)</title><rect x="50.5" y="229" width="1.0" height="15.0" fill="rgb(219,67,16)" rx="2" ry="2" />
<text  x="53.45" y="239.5" ></text>
</g>
<g >
<title>__check_object_size (401,775,242 samples, 0.05%)</title><rect x="371.5" y="293" width="0.6" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="374.49" y="303.5" ></text>
</g>
<g >
<title>free_pcppages_bulk (437,081,236 samples, 0.05%)</title><rect x="1164.7" y="501" width="0.6" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text  x="1167.69" y="511.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (204,839,350 samples, 0.02%)</title><rect x="74.1" y="421" width="0.3" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="77.09" y="431.5" ></text>
</g>
<g >
<title>shmem_alloc_page (4,457,593,560 samples, 0.54%)</title><rect x="569.6" y="181" width="6.4" height="15.0" fill="rgb(214,42,10)" rx="2" ry="2" />
<text  x="572.61" y="191.5" ></text>
</g>
<g >
<title>_raw_qspin_lock (2,724,528,719 samples, 0.33%)</title><rect x="440.8" y="181" width="3.9" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="443.84" y="191.5" ></text>
</g>
<g >
<title>LWLockHeldByMe (155,181,642 samples, 0.02%)</title><rect x="228.6" y="437" width="0.2" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="231.59" y="447.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuum (161,547,505 samples, 0.02%)</title><rect x="136.3" y="517" width="0.2" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text  x="139.28" y="527.5" ></text>
</g>
<g >
<title>__switch_to (199,102,160 samples, 0.02%)</title><rect x="334.8" y="437" width="0.3" height="15.0" fill="rgb(205,2,0)" rx="2" ry="2" />
<text  x="337.82" y="447.5" ></text>
</g>
<g >
<title>perform_spin_delay (8,381,902,204 samples, 1.02%)</title><rect x="301.2" y="405" width="12.0" height="15.0" fill="rgb(247,196,46)" rx="2" ry="2" />
<text  x="304.18" y="415.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuum (964,297,582 samples, 0.12%)</title><rect x="625.7" y="533" width="1.4" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text  x="628.72" y="543.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (169,289,522 samples, 0.02%)</title><rect x="594.8" y="245" width="0.3" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text  x="597.81" y="255.5" ></text>
</g>
<g >
<title>tas (244,094,244 samples, 0.03%)</title><rect x="73.0" y="389" width="0.3" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="75.97" y="399.5" ></text>
</g>
<g >
<title>dsa_get_total_size (2,030,461,994 samples, 0.25%)</title><rect x="601.7" y="517" width="2.9" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="604.72" y="527.5" ></text>
</g>
<g >
<title>iomap_file_buffered_write (158,972,088 samples, 0.02%)</title><rect x="50.6" y="117" width="0.3" height="15.0" fill="rgb(206,6,1)" rx="2" ry="2" />
<text  x="53.63" y="127.5" ></text>
</g>
<g >
<title>__find_lock_page (735,697,712 samples, 0.09%)</title><rect x="255.9" y="293" width="1.1" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="258.93" y="303.5" ></text>
</g>
<g >
<title>xfs_file_aio_read (267,586,436 samples, 0.03%)</title><rect x="599.6" y="389" width="0.4" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text  x="602.57" y="399.5" ></text>
</g>
<g >
<title>hash_bytes (170,562,460 samples, 0.02%)</title><rect x="56.1" y="389" width="0.2" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="59.09" y="399.5" ></text>
</g>
<g >
<title>handle_mm_fault (250,258,863 samples, 0.03%)</title><rect x="75.5" y="389" width="0.3" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="78.48" y="399.5" ></text>
</g>
<g >
<title>smgrwritev (749,558,632 samples, 0.09%)</title><rect x="50.4" y="277" width="1.1" height="15.0" fill="rgb(217,56,13)" rx="2" ry="2" />
<text  x="53.45" y="287.5" ></text>
</g>
<g >
<title>__do_softirq (203,950,887 samples, 0.02%)</title><rect x="1189.6" y="549" width="0.3" height="15.0" fill="rgb(246,191,45)" rx="2" ry="2" />
<text  x="1192.63" y="559.5" ></text>
</g>
<g >
<title>unlock_page (84,843,846 samples, 0.01%)</title><rect x="328.9" y="373" width="0.1" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text  x="331.91" y="383.5" ></text>
</g>
<g >
<title>__find_get_page (8,299,844,019 samples, 1.01%)</title><rect x="426.9" y="165" width="11.9" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="429.89" y="175.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_add_u32_impl (179,345,556 samples, 0.02%)</title><rect x="65.1" y="389" width="0.3" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text  x="68.14" y="399.5" ></text>
</g>
<g >
<title>rcu_eqs_enter (70,626,732 samples, 0.01%)</title><rect x="1178.9" y="709" width="0.1" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="1181.86" y="719.5" ></text>
</g>
<g >
<title>BufferGetPage (70,909,365 samples, 0.01%)</title><rect x="610.2" y="501" width="0.1" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="613.22" y="511.5" ></text>
</g>
<g >
<title>shmem_getpage_gfp (141,508,905 samples, 0.02%)</title><rect x="57.4" y="277" width="0.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="60.36" y="287.5" ></text>
</g>
<g >
<title>dequeue_task_fair (129,312,767 samples, 0.02%)</title><rect x="310.9" y="261" width="0.2" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text  x="313.92" y="271.5" ></text>
</g>
<g >
<title>wake_q_add (169,259,723 samples, 0.02%)</title><rect x="17.6" y="261" width="0.2" height="15.0" fill="rgb(248,200,47)" rx="2" ry="2" />
<text  x="20.58" y="271.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (145,923,964 samples, 0.02%)</title><rect x="1033.9" y="437" width="0.2" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="1036.92" y="447.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (440,844,041 samples, 0.05%)</title><rect x="445.0" y="165" width="0.6" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="447.98" y="175.5" ></text>
</g>
<g >
<title>mark_page_accessed (116,915,793 samples, 0.01%)</title><rect x="372.1" y="309" width="0.2" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text  x="375.14" y="319.5" ></text>
</g>
<g >
<title>ItemPointerIsValid (1,001,975,515 samples, 0.12%)</title><rect x="680.9" y="501" width="1.5" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="683.92" y="511.5" ></text>
</g>
<g >
<title>tlb_remove_table_rcu (520,674,817 samples, 0.06%)</title><rect x="1164.6" y="581" width="0.7" height="15.0" fill="rgb(226,99,23)" rx="2" ry="2" />
<text  x="1167.58" y="591.5" ></text>
</g>
<g >
<title>rcu_process_callbacks (525,422,512 samples, 0.06%)</title><rect x="1164.6" y="597" width="0.7" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text  x="1167.58" y="607.5" ></text>
</g>
<g >
<title>update_cpu_load_nohz (176,457,505 samples, 0.02%)</title><rect x="1188.3" y="709" width="0.2" height="15.0" fill="rgb(225,92,22)" rx="2" ry="2" />
<text  x="1191.26" y="719.5" ></text>
</g>
<g >
<title>lazy_scan_prune (364,474,634,375 samples, 44.22%)</title><rect x="622.0" y="549" width="521.8" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text  x="624.97" y="559.5" >lazy_scan_prune</text>
</g>
<g >
<title>hrtimer_start_range_ns (348,714,643 samples, 0.04%)</title><rect x="1187.6" y="693" width="0.5" height="15.0" fill="rgb(244,179,42)" rx="2" ry="2" />
<text  x="1190.60" y="703.5" ></text>
</g>
<g >
<title>hrtimer_nanosleep (86,396,643 samples, 0.01%)</title><rect x="71.8" y="325" width="0.2" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="74.84" y="335.5" ></text>
</g>
<g >
<title>FileWriteV (725,040,178 samples, 0.09%)</title><rect x="50.4" y="245" width="1.1" height="15.0" fill="rgb(248,201,48)" rx="2" ry="2" />
<text  x="53.45" y="255.5" ></text>
</g>
<g >
<title>PageRepairFragmentation (1,581,199,235 samples, 0.19%)</title><rect x="30.0" y="597" width="2.3" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="33.02" y="607.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (1,919,595,363 samples, 0.23%)</title><rect x="1113.9" y="437" width="2.7" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="1116.88" y="447.5" ></text>
</g>
<g >
<title>rest_init (379,723,816 samples, 0.05%)</title><rect x="1189.5" y="709" width="0.5" height="15.0" fill="rgb(252,217,51)" rx="2" ry="2" />
<text  x="1192.46" y="719.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (177,153,295 samples, 0.02%)</title><rect x="1033.9" y="469" width="0.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="1036.87" y="479.5" ></text>
</g>
<g >
<title>[perf] (495,367,615 samples, 0.06%)</title><rect x="10.3" y="741" width="0.7" height="15.0" fill="rgb(253,223,53)" rx="2" ry="2" />
<text  x="13.27" y="751.5" ></text>
</g>
<g >
<title>scheduler_tick (110,255,351 samples, 0.01%)</title><rect x="1086.9" y="341" width="0.2" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text  x="1089.95" y="351.5" ></text>
</g>
<g >
<title>__find_get_page (1,289,182,240 samples, 0.16%)</title><rect x="244.6" y="261" width="1.8" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="247.57" y="271.5" ></text>
</g>
<g >
<title>calc_load_enter_idle (221,857,308 samples, 0.03%)</title><rect x="1184.6" y="677" width="0.3" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1187.58" y="687.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32_impl (172,079,250 samples, 0.02%)</title><rect x="602.3" y="453" width="0.2" height="15.0" fill="rgb(235,141,33)" rx="2" ry="2" />
<text  x="605.26" y="463.5" ></text>
</g>
<g >
<title>get_pageblock_flags_group (98,117,971 samples, 0.01%)</title><rect x="574.5" y="117" width="0.1" height="15.0" fill="rgb(219,65,15)" rx="2" ry="2" />
<text  x="577.46" y="127.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_add_u32 (807,781,132 samples, 0.10%)</title><rect x="275.1" y="421" width="1.2" height="15.0" fill="rgb(206,4,1)" rx="2" ry="2" />
<text  x="278.14" y="431.5" ></text>
</g>
<g >
<title>sys_futex (106,839,308 samples, 0.01%)</title><rect x="604.4" y="437" width="0.1" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text  x="607.35" y="447.5" ></text>
</g>
<g >
<title>do_shared_fault.isra.64 (231,871,738 samples, 0.03%)</title><rect x="75.5" y="373" width="0.3" height="15.0" fill="rgb(245,185,44)" rx="2" ry="2" />
<text  x="78.50" y="383.5" ></text>
</g>
<g >
<title>heap_prune_record_dead_or_unused (593,085,989 samples, 0.07%)</title><rect x="194.5" y="485" width="0.8" height="15.0" fill="rgb(226,96,23)" rx="2" ry="2" />
<text  x="197.48" y="495.5" ></text>
</g>
<g >
<title>gup_pte_range (71,673,109 samples, 0.01%)</title><rect x="49.1" y="613" width="0.1" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text  x="52.10" y="623.5" ></text>
</g>
<g >
<title>arch_cpu_idle_enter (104,654,779 samples, 0.01%)</title><rect x="1178.5" y="725" width="0.2" height="15.0" fill="rgb(253,223,53)" rx="2" ry="2" />
<text  x="1181.52" y="735.5" ></text>
</g>
<g >
<title>BufferDescriptorGetContentLock (508,629,527 samples, 0.06%)</title><rect x="624.3" y="533" width="0.8" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="627.34" y="543.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (1,302,984,685 samples, 0.16%)</title><rect x="211.4" y="453" width="1.9" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="214.43" y="463.5" ></text>
</g>
<g >
<title>TransactionIdIsInProgress (274,639,345 samples, 0.03%)</title><rect x="1129.3" y="501" width="0.4" height="15.0" fill="rgb(208,16,3)" rx="2" ry="2" />
<text  x="1132.26" y="511.5" ></text>
</g>
<g >
<title>mpol_shared_policy_lookup (72,574,635 samples, 0.01%)</title><rect x="575.9" y="165" width="0.1" height="15.0" fill="rgb(213,41,9)" rx="2" ry="2" />
<text  x="578.85" y="175.5" ></text>
</g>
<g >
<title>BufferIsValid (89,617,797 samples, 0.01%)</title><rect x="225.1" y="405" width="0.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="228.13" y="415.5" ></text>
</g>
<g >
<title>PageSetChecksumCopy (70,930,903 samples, 0.01%)</title><rect x="50.3" y="293" width="0.1" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text  x="53.30" y="303.5" ></text>
</g>
<g >
<title>schedule (579,092,383 samples, 0.07%)</title><rect x="130.4" y="261" width="0.8" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text  x="133.36" y="271.5" ></text>
</g>
<g >
<title>PageGetItem (2,918,717,494 samples, 0.35%)</title><rect x="973.1" y="485" width="4.2" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="976.08" y="495.5" ></text>
</g>
<g >
<title>StartReadBuffersImpl (6,343,805,105 samples, 0.77%)</title><rect x="612.2" y="437" width="9.1" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="615.17" y="447.5" ></text>
</g>
<g >
<title>heap_page_is_all_visible (186,708,845 samples, 0.02%)</title><rect x="52.2" y="405" width="0.2" height="15.0" fill="rgb(228,107,25)" rx="2" ry="2" />
<text  x="55.16" y="415.5" ></text>
</g>
<g >
<title>ReadBuffer_common (6,702,418,469 samples, 0.81%)</title><rect x="611.7" y="469" width="9.6" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="614.66" y="479.5" ></text>
</g>
<g >
<title>ReadBufferExtended (7,420,804,528 samples, 0.90%)</title><rect x="610.8" y="485" width="10.6" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text  x="613.82" y="495.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (5,867,090,871 samples, 0.71%)</title><rect x="249.0" y="437" width="8.4" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="252.05" y="447.5" ></text>
</g>
<g >
<title>__remove_hrtimer (523,541,777 samples, 0.06%)</title><rect x="1185.2" y="645" width="0.7" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text  x="1188.15" y="655.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (177,305,943 samples, 0.02%)</title><rect x="709.2" y="485" width="0.3" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="712.23" y="495.5" ></text>
</g>
<g >
<title>__do_page_fault (99,846,476 samples, 0.01%)</title><rect x="11.3" y="469" width="0.2" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="14.32" y="479.5" ></text>
</g>
<g >
<title>system_call_after_swapgs (303,748,794 samples, 0.04%)</title><rect x="340.0" y="437" width="0.4" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text  x="342.98" y="447.5" ></text>
</g>
<g >
<title>heap_tuple_needs_eventual_freeze (87,075,997 samples, 0.01%)</title><rect x="1135.6" y="533" width="0.2" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="1138.64" y="543.5" ></text>
</g>
<g >
<title>__do_fault.isra.61 (28,431,683,665 samples, 3.45%)</title><rect x="87.0" y="213" width="40.7" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="89.97" y="223.5" >__d..</text>
</g>
<g >
<title>scheduler_tick (71,018,716 samples, 0.01%)</title><rect x="990.8" y="357" width="0.1" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text  x="993.83" y="367.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (1,328,906,781 samples, 0.16%)</title><rect x="56.4" y="421" width="1.9" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="59.39" y="431.5" ></text>
</g>
<g >
<title>retint_userspace_restore_args (110,251,515 samples, 0.01%)</title><rect x="65.4" y="405" width="0.2" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text  x="68.42" y="415.5" ></text>
</g>
<g >
<title>BlockIdSet (551,870,522 samples, 0.07%)</title><rect x="665.5" y="517" width="0.8" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="668.52" y="527.5" ></text>
</g>
<g >
<title>update_min_vruntime (70,828,993 samples, 0.01%)</title><rect x="16.8" y="197" width="0.1" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text  x="19.80" y="207.5" ></text>
</g>
<g >
<title>xfs_inode_item_format (145,347,130 samples, 0.02%)</title><rect x="15.2" y="229" width="0.2" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="18.17" y="239.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32_impl (98,290,110 samples, 0.01%)</title><rect x="23.5" y="437" width="0.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="26.55" y="447.5" ></text>
</g>
<g >
<title>security_file_permission (110,206,791 samples, 0.01%)</title><rect x="131.9" y="357" width="0.2" height="15.0" fill="rgb(225,96,23)" rx="2" ry="2" />
<text  x="134.94" y="367.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (72,462,581 samples, 0.01%)</title><rect x="349.8" y="277" width="0.1" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="352.81" y="287.5" ></text>
</g>
<g >
<title>__set_page_dirty_no_writeback (84,521,957 samples, 0.01%)</title><rect x="1151.3" y="629" width="0.1" height="15.0" fill="rgb(223,86,20)" rx="2" ry="2" />
<text  x="1154.29" y="639.5" ></text>
</g>
<g >
<title>gup_pud_range (83,925,706 samples, 0.01%)</title><rect x="49.1" y="629" width="0.1" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text  x="52.08" y="639.5" ></text>
</g>
<g >
<title>BufferAlloc (220,804,767 samples, 0.03%)</title><rect x="23.4" y="501" width="0.3" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text  x="26.39" y="511.5" ></text>
</g>
<g >
<title>shmem_getpage_gfp (1,363,648,876 samples, 0.17%)</title><rect x="244.5" y="293" width="1.9" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="247.47" y="303.5" ></text>
</g>
<g >
<title>page_verify_redirects (342,025,186 samples, 0.04%)</title><rect x="32.3" y="597" width="0.5" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="35.29" y="607.5" ></text>
</g>
<g >
<title>futex_wake (103,863,112 samples, 0.01%)</title><rect x="320.8" y="357" width="0.2" height="15.0" fill="rgb(219,65,15)" rx="2" ry="2" />
<text  x="323.82" y="367.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (78,445,727 samples, 0.01%)</title><rect x="20.2" y="533" width="0.2" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="23.24" y="543.5" ></text>
</g>
<g >
<title>StartReadBuffersImpl (398,251,233 samples, 0.05%)</title><rect x="43.4" y="757" width="0.5" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="46.36" y="767.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (146,301,513 samples, 0.02%)</title><rect x="96.8" y="165" width="0.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="99.80" y="175.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32_impl (118,023,478 samples, 0.01%)</title><rect x="770.4" y="485" width="0.2" height="15.0" fill="rgb(235,141,33)" rx="2" ry="2" />
<text  x="773.41" y="495.5" ></text>
</g>
<g >
<title>FlushBuffer (856,803,789 samples, 0.10%)</title><rect x="50.3" y="309" width="1.2" height="15.0" fill="rgb(254,226,54)" rx="2" ry="2" />
<text  x="53.30" y="319.5" ></text>
</g>
<g >
<title>do_generic_file_read.constprop.52 (107,859,509 samples, 0.01%)</title><rect x="51.7" y="197" width="0.2" height="15.0" fill="rgb(205,4,1)" rx="2" ry="2" />
<text  x="54.73" y="207.5" ></text>
</g>
<g >
<title>ttwu_do_wakeup (421,853,723 samples, 0.05%)</title><rect x="1180.7" y="693" width="0.6" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="1183.72" y="703.5" ></text>
</g>
<g >
<title>queued_spin_lock_slowpath (2,723,497,334 samples, 0.33%)</title><rect x="440.8" y="165" width="3.9" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="443.85" y="175.5" ></text>
</g>
<g >
<title>do_sync_write (703,100,533 samples, 0.09%)</title><rect x="50.5" y="165" width="1.0" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="53.47" y="175.5" ></text>
</g>
<g >
<title>BufferIsValid (221,689,814 samples, 0.03%)</title><rect x="1144.9" y="533" width="0.4" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1147.94" y="543.5" ></text>
</g>
<g >
<title>__pagevec_lru_add_fn (163,878,265 samples, 0.02%)</title><rect x="95.5" y="133" width="0.3" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text  x="98.54" y="143.5" ></text>
</g>
<g >
<title>__do_fault.isra.61 (171,683,917 samples, 0.02%)</title><rect x="57.3" y="309" width="0.3" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="60.32" y="319.5" ></text>
</g>
<g >
<title>BufferIsValid (72,498,815 samples, 0.01%)</title><rect x="43.0" y="597" width="0.1" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="45.97" y="607.5" ></text>
</g>
<g >
<title>mdreadv (187,324,214,970 samples, 22.73%)</title><rect x="333.0" y="485" width="268.1" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text  x="335.96" y="495.5" >mdreadv</text>
</g>
<g >
<title>xfs_file_aio_write_checks (110,393,665 samples, 0.01%)</title><rect x="50.9" y="117" width="0.1" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text  x="53.87" y="127.5" ></text>
</g>
<g >
<title>handle_mm_fault (2,173,650,218 samples, 0.26%)</title><rect x="243.5" y="357" width="3.1" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="246.50" y="367.5" ></text>
</g>
<g >
<title>dequeue_task_fair (182,159,418 samples, 0.02%)</title><rect x="130.7" y="213" width="0.3" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text  x="133.73" y="223.5" ></text>
</g>
<g >
<title>enqueue_hrtimer (124,529,427 samples, 0.02%)</title><rect x="1185.9" y="645" width="0.2" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text  x="1188.91" y="655.5" ></text>
</g>
<g >
<title>process_pm_pmsignal (124,943,774,712 samples, 15.16%)</title><rect x="54.6" y="709" width="178.9" height="15.0" fill="rgb(254,228,54)" rx="2" ry="2" />
<text  x="57.63" y="719.5" >process_pm_pmsignal</text>
</g>
<g >
<title>try_to_wake_up (96,145,621 samples, 0.01%)</title><rect x="131.2" y="245" width="0.1" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="134.19" y="255.5" ></text>
</g>
<g >
<title>reschedule_interrupt (224,011,323 samples, 0.03%)</title><rect x="1189.6" y="645" width="0.3" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="1192.60" y="655.5" ></text>
</g>
<g >
<title>BufferGetPage (159,797,540 samples, 0.02%)</title><rect x="667.5" y="517" width="0.3" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="670.54" y="527.5" ></text>
</g>
<g >
<title>hash_bytes (816,975,810 samples, 0.10%)</title><rect x="613.8" y="341" width="1.2" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="616.78" y="351.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (77,288,274 samples, 0.01%)</title><rect x="621.3" y="421" width="0.1" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="624.34" y="431.5" ></text>
</g>
<g >
<title>ReadBuffer_common (229,827,170 samples, 0.03%)</title><rect x="23.4" y="565" width="0.3" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="26.39" y="575.5" ></text>
</g>
<g >
<title>system_call_fastpath (73,620,781 samples, 0.01%)</title><rect x="232.6" y="437" width="0.1" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="235.59" y="447.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (785,143,918 samples, 0.10%)</title><rect x="617.2" y="357" width="1.1" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="620.20" y="367.5" ></text>
</g>
<g >
<title>tick_sched_handle (110,255,351 samples, 0.01%)</title><rect x="1086.9" y="373" width="0.2" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="1089.95" y="383.5" ></text>
</g>
<g >
<title>finish_task_switch (94,603,716 samples, 0.01%)</title><rect x="131.0" y="229" width="0.1" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="134.01" y="239.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (127,433,970 samples, 0.02%)</title><rect x="682.4" y="437" width="0.1" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="685.35" y="447.5" ></text>
</g>
<g >
<title>heap_prune_chain (1,176,880,130 samples, 0.14%)</title><rect x="1131.5" y="533" width="1.7" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1134.49" y="543.5" ></text>
</g>
<g >
<title>__xfs_trans_commit (403,428,213 samples, 0.05%)</title><rect x="15.1" y="261" width="0.5" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text  x="18.06" y="271.5" ></text>
</g>
<g >
<title>cpuidle_idle_call (10,447,641,085 samples, 1.27%)</title><rect x="1163.5" y="709" width="15.0" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text  x="1166.51" y="719.5" ></text>
</g>
<g >
<title>tas (587,666,632 samples, 0.07%)</title><rect x="247.9" y="405" width="0.8" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="250.85" y="415.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (255,273,570 samples, 0.03%)</title><rect x="1097.4" y="453" width="0.4" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="1100.40" y="463.5" ></text>
</g>
<g >
<title>sys_futex (323,879,808 samples, 0.04%)</title><rect x="1142.1" y="437" width="0.4" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text  x="1145.06" y="447.5" ></text>
</g>
<g >
<title>cpuidle_idle_call (321,864,645 samples, 0.04%)</title><rect x="1189.5" y="661" width="0.4" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text  x="1192.46" y="671.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32_impl (70,813,112 samples, 0.01%)</title><rect x="73.6" y="389" width="0.1" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="76.61" y="399.5" ></text>
</g>
<g >
<title>PageGetItemId (914,641,063 samples, 0.11%)</title><rect x="164.6" y="501" width="1.3" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="167.60" y="511.5" ></text>
</g>
<g >
<title>get_hash_entry (852,511,860 samples, 0.10%)</title><rect x="56.9" y="405" width="1.2" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text  x="59.85" y="415.5" ></text>
</g>
<g >
<title>_cond_resched (110,296,033 samples, 0.01%)</title><rect x="79.0" y="309" width="0.2" height="15.0" fill="rgb(231,121,29)" rx="2" ry="2" />
<text  x="82.01" y="319.5" ></text>
</g>
<g >
<title>vfs_read (128,631,091 samples, 0.02%)</title><rect x="51.7" y="277" width="0.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="54.71" y="287.5" ></text>
</g>
<g >
<title>tas (129,434,267 samples, 0.02%)</title><rect x="57.9" y="389" width="0.2" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="60.89" y="399.5" ></text>
</g>
<g >
<title>do_exit (10,517,164,883 samples, 1.28%)</title><rect x="1146.0" y="725" width="15.0" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="1148.99" y="735.5" ></text>
</g>
<g >
<title>__do_fault.isra.61 (760,965,668 samples, 0.09%)</title><rect x="59.6" y="325" width="1.1" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="62.63" y="335.5" ></text>
</g>
<g >
<title>alloc_pages_current (1,465,407,686 samples, 0.18%)</title><rect x="386.4" y="213" width="2.1" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text  x="389.40" y="223.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (120,317,963 samples, 0.01%)</title><rect x="37.2" y="581" width="0.2" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="40.20" y="591.5" ></text>
</g>
<g >
<title>__list_del_entry (606,310,164 samples, 0.07%)</title><rect x="573.5" y="85" width="0.8" height="15.0" fill="rgb(214,41,9)" rx="2" ry="2" />
<text  x="576.46" y="95.5" ></text>
</g>
<g >
<title>heap_prepare_freeze_tuple (47,580,684,367 samples, 5.77%)</title><rect x="990.9" y="485" width="68.1" height="15.0" fill="rgb(227,101,24)" rx="2" ry="2" />
<text  x="993.93" y="495.5" >heap_pr..</text>
</g>
<g >
<title>PageGetItemId (2,476,381,808 samples, 0.30%)</title><rect x="850.3" y="469" width="3.5" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="853.26" y="479.5" ></text>
</g>
<g >
<title>page_fault (1,158,273,366 samples, 0.14%)</title><rect x="59.2" y="405" width="1.7" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="62.23" y="415.5" ></text>
</g>
<g >
<title>system_call_after_swapgs (192,189,271 samples, 0.02%)</title><rect x="331.6" y="485" width="0.3" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text  x="334.64" y="495.5" ></text>
</g>
<g >
<title>BufferGetPage (367,660,041 samples, 0.04%)</title><rect x="226.8" y="437" width="0.5" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="229.76" y="447.5" ></text>
</g>
<g >
<title>futex_wait (730,129,569 samples, 0.09%)</title><rect x="48.4" y="693" width="1.0" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="51.40" y="703.5" ></text>
</g>
<g >
<title>tick_check_idle (254,503,972 samples, 0.03%)</title><rect x="1164.1" y="645" width="0.4" height="15.0" fill="rgb(208,17,4)" rx="2" ry="2" />
<text  x="1167.15" y="655.5" ></text>
</g>
<g >
<title>LockBuffer (2,663,642,562 samples, 0.32%)</title><rect x="1139.1" y="517" width="3.8" height="15.0" fill="rgb(235,142,34)" rx="2" ry="2" />
<text  x="1142.09" y="527.5" ></text>
</g>
<g >
<title>radix_tree_descend (365,643,101 samples, 0.04%)</title><rect x="94.9" y="117" width="0.5" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="97.92" y="127.5" ></text>
</g>
<g >
<title>LWLockAcquire (1,136,756,674 samples, 0.14%)</title><rect x="616.8" y="389" width="1.6" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text  x="619.76" y="399.5" ></text>
</g>
<g >
<title>queued_spin_lock_slowpath (16,873,002,420 samples, 2.05%)</title><rect x="101.4" y="133" width="24.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="104.40" y="143.5" >q..</text>
</g>
<g >
<title>down_write (956,045,785 samples, 0.12%)</title><rect x="15.8" y="325" width="1.4" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="18.83" y="335.5" ></text>
</g>
<g >
<title>heap_page_is_all_visible (9,611,264,937 samples, 1.17%)</title><rect x="138.1" y="517" width="13.8" height="15.0" fill="rgb(228,107,25)" rx="2" ry="2" />
<text  x="141.11" y="527.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (1,123,942,176 samples, 0.14%)</title><rect x="1165.4" y="645" width="1.7" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="1168.44" y="655.5" ></text>
</g>
<g >
<title>retint_userspace_restore_args (600,586,378 samples, 0.07%)</title><rect x="246.8" y="405" width="0.9" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text  x="249.83" y="415.5" ></text>
</g>
<g >
<title>tag_hash (1,038,492,742 samples, 0.13%)</title><rect x="613.8" y="357" width="1.4" height="15.0" fill="rgb(245,185,44)" rx="2" ry="2" />
<text  x="616.76" y="367.5" ></text>
</g>
<g >
<title>PinBufferForBlock (1,183,027,082 samples, 0.14%)</title><rect x="134.0" y="405" width="1.7" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text  x="136.99" y="415.5" ></text>
</g>
<g >
<title>mem_cgroup_cache_charge (1,087,429,015 samples, 0.13%)</title><rect x="97.0" y="165" width="1.6" height="15.0" fill="rgb(251,215,51)" rx="2" ry="2" />
<text  x="100.01" y="175.5" ></text>
</g>
<g >
<title>BackendInitialize (3,095,632,095 samples, 0.38%)</title><rect x="50.2" y="677" width="4.4" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text  x="53.19" y="687.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (146,853,501 samples, 0.02%)</title><rect x="135.0" y="357" width="0.2" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="138.03" y="367.5" ></text>
</g>
<g >
<title>ServerLoop (128,045,675,684 samples, 15.54%)</title><rect x="50.2" y="725" width="183.3" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text  x="53.19" y="735.5" >ServerLoop</text>
</g>
<g >
<title>vacuum_rel (3,086,806,387 samples, 0.37%)</title><rect x="50.2" y="517" width="4.4" height="15.0" fill="rgb(219,65,15)" rx="2" ry="2" />
<text  x="53.20" y="527.5" ></text>
</g>
<g >
<title>__pread_nocancel (140,093,762 samples, 0.02%)</title><rect x="51.7" y="325" width="0.2" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="54.69" y="335.5" ></text>
</g>
<g >
<title>get_hash_value (380,506,797 samples, 0.05%)</title><rect x="238.0" y="437" width="0.6" height="15.0" fill="rgb(211,27,6)" rx="2" ry="2" />
<text  x="241.02" y="447.5" ></text>
</g>
<g >
<title>__rwsem_mark_wake (137,865,275 samples, 0.02%)</title><rect x="51.1" y="53" width="0.2" height="15.0" fill="rgb(206,8,1)" rx="2" ry="2" />
<text  x="54.10" y="63.5" ></text>
</g>
<g >
<title>clear_page_c_e (11,909,543,008 samples, 1.44%)</title><rect x="393.8" y="197" width="17.0" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text  x="396.79" y="207.5" ></text>
</g>
<g >
<title>heap_page_is_all_visible (51,361,490,837 samples, 6.23%)</title><rect x="636.0" y="533" width="73.6" height="15.0" fill="rgb(228,107,25)" rx="2" ry="2" />
<text  x="639.03" y="543.5" >heap_pag..</text>
</g>
<g >
<title>ItemPointerIsValid (235,071,207 samples, 0.03%)</title><rect x="230.2" y="485" width="0.3" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="233.17" y="495.5" ></text>
</g>
<g >
<title>dequeue_entity (154,556,759 samples, 0.02%)</title><rect x="130.8" y="197" width="0.2" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text  x="133.76" y="207.5" ></text>
</g>
<g >
<title>do_futex (730,129,569 samples, 0.09%)</title><rect x="48.4" y="709" width="1.0" height="15.0" fill="rgb(245,184,44)" rx="2" ry="2" />
<text  x="51.40" y="719.5" ></text>
</g>
<g >
<title>radix_tree_descend (88,773,751 samples, 0.01%)</title><rect x="256.8" y="229" width="0.1" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="259.80" y="239.5" ></text>
</g>
<g >
<title>tick_nohz_idle_exit (1,244,621,438 samples, 0.15%)</title><rect x="1186.8" y="725" width="1.8" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="1189.77" y="735.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (1,521,504,898 samples, 0.18%)</title><rect x="1152.2" y="549" width="2.2" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text  x="1155.22" y="559.5" ></text>
</g>
<g >
<title>tick_sched_handle (105,887,461 samples, 0.01%)</title><rect x="203.0" y="357" width="0.2" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="206.01" y="367.5" ></text>
</g>
<g >
<title>LWLockAcquire (174,718,745 samples, 0.02%)</title><rect x="135.0" y="373" width="0.3" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text  x="138.00" y="383.5" ></text>
</g>
<g >
<title>StrategyGetBuffer (8,227,865,316 samples, 1.00%)</title><rect x="61.7" y="421" width="11.7" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="64.66" y="431.5" ></text>
</g>
<g >
<title>BlockIdSet (793,078,598 samples, 0.10%)</title><rect x="758.2" y="517" width="1.1" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="761.16" y="527.5" ></text>
</g>
<g >
<title>tick_sched_handle (74,180,775 samples, 0.01%)</title><rect x="621.3" y="373" width="0.1" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="624.34" y="383.5" ></text>
</g>
<g >
<title>up_read (79,012,962 samples, 0.01%)</title><rect x="585.5" y="261" width="0.2" height="15.0" fill="rgb(209,18,4)" rx="2" ry="2" />
<text  x="588.55" y="271.5" ></text>
</g>
<g >
<title>heap_tuple_should_freeze (99,586,058 samples, 0.01%)</title><rect x="39.8" y="581" width="0.1" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="42.79" y="591.5" ></text>
</g>
<g >
<title>xfs_file_buffered_aio_write (300,277,888 samples, 0.04%)</title><rect x="10.4" y="565" width="0.5" height="15.0" fill="rgb(243,176,42)" rx="2" ry="2" />
<text  x="13.43" y="575.5" ></text>
</g>
<g >
<title>PinBufferForBlock (5,817,463,213 samples, 0.71%)</title><rect x="612.8" y="421" width="8.3" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text  x="615.82" y="431.5" ></text>
</g>
<g >
<title>_cond_resched (76,797,490 samples, 0.01%)</title><rect x="343.4" y="325" width="0.1" height="15.0" fill="rgb(231,121,29)" rx="2" ry="2" />
<text  x="346.40" y="335.5" ></text>
</g>
<g >
<title>page_fault (177,844,528 samples, 0.02%)</title><rect x="74.1" y="405" width="0.3" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="77.12" y="415.5" ></text>
</g>
<g >
<title>PageGetItemId (2,397,128,644 samples, 0.29%)</title><rect x="788.6" y="501" width="3.5" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="791.63" y="511.5" ></text>
</g>
<g >
<title>unlock_page (100,983,691 samples, 0.01%)</title><rect x="49.3" y="645" width="0.1" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text  x="52.29" y="655.5" ></text>
</g>
<g >
<title>tick_sched_handle (144,074,202 samples, 0.02%)</title><rect x="828.8" y="389" width="0.3" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="831.85" y="399.5" ></text>
</g>
<g >
<title>PageGetItemId (141,210,959 samples, 0.02%)</title><rect x="29.5" y="613" width="0.2" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="32.53" y="623.5" ></text>
</g>
<g >
<title>find_busiest_group (73,446,886 samples, 0.01%)</title><rect x="1189.8" y="485" width="0.1" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="1192.76" y="495.5" ></text>
</g>
<g >
<title>tick_sched_timer (279,325,368 samples, 0.03%)</title><rect x="445.1" y="101" width="0.4" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="448.13" y="111.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (364,696,395 samples, 0.04%)</title><rect x="1102.1" y="469" width="0.5" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="1105.06" y="479.5" ></text>
</g>
<g >
<title>free_pages_and_swap_cache (1,475,718,560 samples, 0.18%)</title><rect x="1158.6" y="613" width="2.1" height="15.0" fill="rgb(222,82,19)" rx="2" ry="2" />
<text  x="1161.63" y="623.5" ></text>
</g>
<g >
<title>iomap_write_end (374,091,118 samples, 0.05%)</title><rect x="14.0" y="293" width="0.5" height="15.0" fill="rgb(242,171,41)" rx="2" ry="2" />
<text  x="17.00" y="303.5" ></text>
</g>
<g >
<title>do_read_fault.isra.63 (1,097,689,706 samples, 0.13%)</title><rect x="255.5" y="357" width="1.6" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text  x="258.51" y="367.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_or_u32 (213,806,923 samples, 0.03%)</title><rect x="329.7" y="469" width="0.3" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="332.73" y="479.5" ></text>
</g>
<g >
<title>PinBuffer (132,780,443 samples, 0.02%)</title><rect x="135.3" y="373" width="0.2" height="15.0" fill="rgb(219,64,15)" rx="2" ry="2" />
<text  x="138.34" y="383.5" ></text>
</g>
<g >
<title>BufferIsValid (199,830,704 samples, 0.02%)</title><rect x="227.9" y="421" width="0.3" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="230.89" y="431.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (94,510,130 samples, 0.01%)</title><rect x="258.1" y="421" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="261.07" y="431.5" ></text>
</g>
<g >
<title>LWLockQueueSelf (69,869,533 samples, 0.01%)</title><rect x="320.4" y="437" width="0.1" height="15.0" fill="rgb(236,146,35)" rx="2" ry="2" />
<text  x="323.36" y="447.5" ></text>
</g>
<g >
<title>__mem_cgroup_commit_charge (3,671,260,686 samples, 0.45%)</title><rect x="446.3" y="149" width="5.3" height="15.0" fill="rgb(212,32,7)" rx="2" ry="2" />
<text  x="449.33" y="159.5" ></text>
</g>
<g >
<title>PinBufferForBlock (986,844,257 samples, 0.12%)</title><rect x="50.2" y="357" width="1.4" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text  x="53.22" y="367.5" ></text>
</g>
<g >
<title>spin_delay (136,825,160 samples, 0.02%)</title><rect x="72.0" y="373" width="0.2" height="15.0" fill="rgb(240,162,38)" rx="2" ry="2" />
<text  x="75.00" y="383.5" ></text>
</g>
<g >
<title>WaitReadBuffers (1,367,408,349 samples, 0.17%)</title><rect x="20.5" y="613" width="1.9" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text  x="23.48" y="623.5" ></text>
</g>
<g >
<title>do_writepages (115,471,936 samples, 0.01%)</title><rect x="10.1" y="629" width="0.2" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text  x="13.10" y="639.5" ></text>
</g>
<g >
<title>sys_write (314,355,073 samples, 0.04%)</title><rect x="10.4" y="629" width="0.5" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="13.42" y="639.5" ></text>
</g>
<g >
<title>rcu_idle_enter (103,348,351 samples, 0.01%)</title><rect x="1178.8" y="725" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="1181.83" y="735.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_add_u32_impl (778,073,501 samples, 0.09%)</title><rect x="275.2" y="405" width="1.1" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text  x="278.18" y="415.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (6,968,404,220 samples, 0.85%)</title><rect x="220.2" y="485" width="10.0" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="223.19" y="495.5" ></text>
</g>
<g >
<title>pgstat_count_io_op_time (87,632,648 samples, 0.01%)</title><rect x="76.6" y="485" width="0.1" height="15.0" fill="rgb(209,19,4)" rx="2" ry="2" />
<text  x="79.57" y="495.5" ></text>
</g>
<g >
<title>FileReadV (38,665,271,064 samples, 4.69%)</title><rect x="76.9" y="453" width="55.3" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text  x="79.87" y="463.5" >FileR..</text>
</g>
<g >
<title>get_hash_entry (182,309,876 samples, 0.02%)</title><rect x="11.2" y="517" width="0.3" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text  x="14.24" y="527.5" ></text>
</g>
<g >
<title>pgstat_count_io_op_time (471,510,682 samples, 0.06%)</title><rect x="332.1" y="501" width="0.6" height="15.0" fill="rgb(209,19,4)" rx="2" ry="2" />
<text  x="335.06" y="511.5" ></text>
</g>
<g >
<title>smgropen (85,656,911 samples, 0.01%)</title><rect x="12.1" y="517" width="0.2" height="15.0" fill="rgb(211,28,6)" rx="2" ry="2" />
<text  x="15.13" y="527.5" ></text>
</g>
<g >
<title>MarkBufferDirtyHint (539,831,467 samples, 0.07%)</title><rect x="41.7" y="565" width="0.8" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="44.69" y="575.5" ></text>
</g>
<g >
<title>__audit_syscall_exit (144,066,144 samples, 0.02%)</title><rect x="77.6" y="405" width="0.2" height="15.0" fill="rgb(218,62,14)" rx="2" ry="2" />
<text  x="80.60" y="415.5" ></text>
</g>
<g >
<title>__schedule (396,109,392 samples, 0.05%)</title><rect x="310.8" y="293" width="0.6" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text  x="313.79" y="303.5" ></text>
</g>
<g >
<title>system_call_fastpath (704,492,392 samples, 0.09%)</title><rect x="310.5" y="373" width="1.0" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="313.46" y="383.5" ></text>
</g>
<g >
<title>smgrwrite (5,080,836,104 samples, 0.62%)</title><rect x="12.3" y="517" width="7.2" height="15.0" fill="rgb(229,112,26)" rx="2" ry="2" />
<text  x="15.25" y="527.5" ></text>
</g>
<g >
<title>BufTableHashCode (393,071,281 samples, 0.05%)</title><rect x="238.0" y="453" width="0.6" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text  x="241.01" y="463.5" ></text>
</g>
<g >
<title>sys_futex (73,620,781 samples, 0.01%)</title><rect x="232.6" y="421" width="0.1" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text  x="235.59" y="431.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (93,560,043 samples, 0.01%)</title><rect x="134.8" y="357" width="0.1" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="137.80" y="367.5" ></text>
</g>
<g >
<title>dequeue_task_fair (529,132,857 samples, 0.06%)</title><rect x="593.0" y="229" width="0.7" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text  x="595.97" y="239.5" ></text>
</g>
<g >
<title>deactivate_task (699,485,140 samples, 0.08%)</title><rect x="592.8" y="245" width="1.0" height="15.0" fill="rgb(206,8,2)" rx="2" ry="2" />
<text  x="595.82" y="255.5" ></text>
</g>
<g >
<title>activate_task (862,742,985 samples, 0.10%)</title><rect x="1179.5" y="693" width="1.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="1182.48" y="703.5" ></text>
</g>
<g >
<title>handle_mm_fault (1,155,822,417 samples, 0.14%)</title><rect x="327.4" y="405" width="1.7" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="330.40" y="415.5" ></text>
</g>
<g >
<title>ItemPointerIsValid (328,053,928 samples, 0.04%)</title><rect x="145.9" y="485" width="0.5" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="148.93" y="495.5" ></text>
</g>
<g >
<title>MarkBufferDirty (536,993,730 samples, 0.07%)</title><rect x="1142.9" y="517" width="0.8" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="1145.91" y="527.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (89,443,719 samples, 0.01%)</title><rect x="43.4" y="693" width="0.1" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="46.36" y="703.5" ></text>
</g>
<g >
<title>ttwu_do_activate (1,314,857,365 samples, 0.16%)</title><rect x="1179.4" y="709" width="1.9" height="15.0" fill="rgb(215,48,11)" rx="2" ry="2" />
<text  x="1182.44" y="719.5" ></text>
</g>
<g >
<title>try_to_wake_up (660,719,524 samples, 0.08%)</title><rect x="1165.7" y="581" width="1.0" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="1168.71" y="591.5" ></text>
</g>
<g >
<title>StartReadBuffer (6,579,262,292 samples, 0.80%)</title><rect x="611.8" y="453" width="9.5" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text  x="614.83" y="463.5" ></text>
</g>
<g >
<title>do_page_fault (1,157,242,416 samples, 0.14%)</title><rect x="59.2" y="389" width="1.7" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="62.23" y="399.5" ></text>
</g>
<g >
<title>start_kernel (379,723,816 samples, 0.05%)</title><rect x="1189.5" y="725" width="0.5" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="1192.46" y="735.5" ></text>
</g>
<g >
<title>page_fault (149,153,652,515 samples, 18.10%)</title><rect x="372.3" y="309" width="213.5" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="375.31" y="319.5" >page_fault</text>
</g>
<g >
<title>generic_segment_checks (75,732,677 samples, 0.01%)</title><rect x="587.0" y="341" width="0.1" height="15.0" fill="rgb(236,144,34)" rx="2" ry="2" />
<text  x="590.01" y="351.5" ></text>
</g>
<g >
<title>GetBufferDescriptor (334,944,306 samples, 0.04%)</title><rect x="1093.2" y="453" width="0.5" height="15.0" fill="rgb(249,202,48)" rx="2" ry="2" />
<text  x="1096.23" y="463.5" ></text>
</g>
<g >
<title>StrategyGetBuffer (41,781,460,625 samples, 5.07%)</title><rect x="258.5" y="437" width="59.8" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="261.48" y="447.5" >Strate..</text>
</g>
<g >
<title>wake_up_q (372,517,381 samples, 0.05%)</title><rect x="594.6" y="277" width="0.5" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text  x="597.61" y="287.5" ></text>
</g>
<g >
<title>sem_post@@GLIBC_2.2.5 (109,531,290 samples, 0.01%)</title><rect x="73.9" y="405" width="0.2" height="15.0" fill="rgb(214,41,9)" rx="2" ry="2" />
<text  x="76.93" y="415.5" ></text>
</g>
<g >
<title>quiet_vmstat (479,485,129 samples, 0.06%)</title><rect x="1188.7" y="741" width="0.6" height="15.0" fill="rgb(249,204,48)" rx="2" ry="2" />
<text  x="1191.66" y="751.5" ></text>
</g>
<g >
<title>shared_ts_memory_usage (326,223,408 samples, 0.04%)</title><rect x="132.4" y="517" width="0.5" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text  x="135.41" y="527.5" ></text>
</g>
<g >
<title>xfs_file_aio_read (123,055,875 samples, 0.01%)</title><rect x="51.7" y="245" width="0.2" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text  x="54.71" y="255.5" ></text>
</g>
<g >
<title>page_fault (1,241,456,530 samples, 0.15%)</title><rect x="327.3" y="453" width="1.8" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="330.31" y="463.5" ></text>
</g>
<g >
<title>shmem_getpage_gfp (116,623,553,586 samples, 14.15%)</title><rect x="410.9" y="197" width="167.0" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="413.95" y="207.5" >shmem_getpage_gfp</text>
</g>
<g >
<title>pg_atomic_read_u32_impl (73,640,039 samples, 0.01%)</title><rect x="20.2" y="501" width="0.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="23.25" y="511.5" ></text>
</g>
<g >
<title>postgres (803,306,357,975 samples, 97.46%)</title><rect x="11.0" y="789" width="1150.0" height="15.0" fill="rgb(233,131,31)" rx="2" ry="2" />
<text  x="13.98" y="799.5" >postgres</text>
</g>
<g >
<title>vm_readbuf (95,772,990 samples, 0.01%)</title><rect x="1145.3" y="533" width="0.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="1148.27" y="543.5" ></text>
</g>
<g >
<title>radix_tree_lookup_slot (457,486,155 samples, 0.06%)</title><rect x="60.0" y="245" width="0.7" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="63.05" y="255.5" ></text>
</g>
<g >
<title>TidStoreMemoryUsage (341,841,353 samples, 0.04%)</title><rect x="132.4" y="533" width="0.5" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text  x="135.39" y="543.5" ></text>
</g>
<g >
<title>LWLockAcquire (83,031,264 samples, 0.01%)</title><rect x="20.2" y="549" width="0.2" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text  x="23.23" y="559.5" ></text>
</g>
<g >
<title>__inc_zone_state (88,589,607 samples, 0.01%)</title><rect x="126.2" y="101" width="0.1" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text  x="129.18" y="111.5" ></text>
</g>
<g >
<title>ReadBufferExtended (254,631,147,709 samples, 30.89%)</title><rect x="236.8" y="549" width="364.5" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text  x="239.80" y="559.5" >ReadBufferExtended</text>
</g>
<g >
<title>pick_next_task_rt (76,627,109 samples, 0.01%)</title><rect x="1182.7" y="693" width="0.1" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="1185.71" y="703.5" ></text>
</g>
<g >
<title>heap_page_prune_and_freeze (55,344,640,583 samples, 6.71%)</title><rect x="151.9" y="517" width="79.2" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="154.87" y="527.5" >heap_page..</text>
</g>
<g >
<title>tick_sched_handle (100,961,491 samples, 0.01%)</title><rect x="1034.0" y="373" width="0.1" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="1036.96" y="383.5" ></text>
</g>
<g >
<title>TransactionIdGetCommitLSN (365,727,757 samples, 0.04%)</title><rect x="1123.1" y="485" width="0.5" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="1126.08" y="495.5" ></text>
</g>
<g >
<title>futex_wait_queue_me (110,500,112 samples, 0.01%)</title><rect x="48.6" y="677" width="0.1" height="15.0" fill="rgb(254,228,54)" rx="2" ry="2" />
<text  x="51.57" y="687.5" ></text>
</g>
<g >
<title>BackgroundWorkerMain (637,248,213,532 samples, 77.32%)</title><rect x="233.7" y="661" width="912.3" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text  x="236.66" y="671.5" >BackgroundWorkerMain</text>
</g>
<g >
<title>BufTableInsert (1,361,390,705 samples, 0.17%)</title><rect x="56.3" y="437" width="2.0" height="15.0" fill="rgb(206,8,1)" rx="2" ry="2" />
<text  x="59.34" y="447.5" ></text>
</g>
<g >
<title>get_hash_value (178,509,464 samples, 0.02%)</title><rect x="56.1" y="421" width="0.2" height="15.0" fill="rgb(211,27,6)" rx="2" ry="2" />
<text  x="59.09" y="431.5" ></text>
</g>
<g >
<title>rwsem_down_read_failed (3,012,432,972 samples, 0.37%)</title><rect x="590.8" y="293" width="4.3" height="15.0" fill="rgb(254,225,54)" rx="2" ry="2" />
<text  x="593.83" y="303.5" ></text>
</g>
<g >
<title>schedule (1,619,490,228 samples, 0.20%)</title><rect x="592.3" y="277" width="2.3" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text  x="595.29" y="287.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (72,545,272,590 samples, 8.80%)</title><rect x="465.7" y="133" width="103.9" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text  x="468.73" y="143.5" >native_queue..</text>
</g>
<g >
<title>heap_prune_satisfies_vacuum (45,735,515,921 samples, 5.55%)</title><rect x="1064.3" y="517" width="65.4" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="1067.26" y="527.5" >heap_pr..</text>
</g>
<g >
<title>sysret_audit (229,738,891 samples, 0.03%)</title><rect x="77.5" y="421" width="0.3" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="80.50" y="431.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (71,568,769 samples, 0.01%)</title><rect x="625.4" y="501" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="628.42" y="511.5" ></text>
</g>
<g >
<title>TerminateBufferIO (829,824,700 samples, 0.10%)</title><rect x="75.0" y="485" width="1.1" height="15.0" fill="rgb(239,160,38)" rx="2" ry="2" />
<text  x="77.96" y="495.5" ></text>
</g>
<g >
<title>__xfs_trans_commit (112,688,247 samples, 0.01%)</title><rect x="586.7" y="245" width="0.1" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text  x="589.66" y="255.5" ></text>
</g>
<g >
<title>LockBufHdr (107,984,003 samples, 0.01%)</title><rect x="236.4" y="533" width="0.1" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="239.38" y="543.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (491,740,979 samples, 0.06%)</title><rect x="151.1" y="501" width="0.7" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="154.15" y="511.5" ></text>
</g>
<g >
<title>wake_up_q (128,474,798 samples, 0.02%)</title><rect x="51.3" y="53" width="0.2" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text  x="54.30" y="63.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (1,537,044,185 samples, 0.19%)</title><rect x="1152.2" y="581" width="2.2" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text  x="1155.20" y="591.5" ></text>
</g>
<g >
<title>GetBufferDescriptor (290,238,862 samples, 0.04%)</title><rect x="1111.4" y="453" width="0.4" height="15.0" fill="rgb(249,202,48)" rx="2" ry="2" />
<text  x="1114.41" y="463.5" ></text>
</g>
<g >
<title>get_page_from_freelist (829,319,855 samples, 0.10%)</title><rect x="125.9" y="117" width="1.2" height="15.0" fill="rgb(252,218,52)" rx="2" ry="2" />
<text  x="128.91" y="127.5" ></text>
</g>
<g >
<title>BufferIsPermanent (297,412,599 samples, 0.04%)</title><rect x="1083.9" y="485" width="0.4" height="15.0" fill="rgb(250,210,50)" rx="2" ry="2" />
<text  x="1086.91" y="495.5" ></text>
</g>
<g >
<title>do_sync_read (96,959,467 samples, 0.01%)</title><rect x="340.7" y="405" width="0.2" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="343.73" y="415.5" ></text>
</g>
<g >
<title>hash_initial_lookup (122,832,498 samples, 0.01%)</title><rect x="58.1" y="405" width="0.1" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="61.07" y="415.5" ></text>
</g>
<g >
<title>activate_task (298,546,138 samples, 0.04%)</title><rect x="1166.1" y="549" width="0.4" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="1169.12" y="559.5" ></text>
</g>
<g >
<title>enqueue_task_fair (212,481,452 samples, 0.03%)</title><rect x="1166.2" y="533" width="0.3" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text  x="1169.22" y="543.5" ></text>
</g>
<g >
<title>StrategyGetBuffer (400,166,752 samples, 0.05%)</title><rect x="47.0" y="741" width="0.5" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="49.95" y="751.5" ></text>
</g>
<g >
<title>pgstat_progress_update_param (94,577,348 samples, 0.01%)</title><rect x="232.9" y="533" width="0.1" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text  x="235.88" y="543.5" ></text>
</g>
<g >
<title>hrtimer_nanosleep (609,658,589 samples, 0.07%)</title><rect x="310.6" y="341" width="0.9" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="313.59" y="351.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (328,016,233 samples, 0.04%)</title><rect x="235.3" y="501" width="0.5" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="238.31" y="511.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (134,744,497 samples, 0.02%)</title><rect x="853.8" y="421" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="856.81" y="431.5" ></text>
</g>
<g >
<title>BufferGetBlockNumber (127,765,126 samples, 0.02%)</title><rect x="42.9" y="613" width="0.2" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="45.91" y="623.5" ></text>
</g>
<g >
<title>__find_get_page (1,512,560,062 samples, 0.18%)</title><rect x="269.2" y="277" width="2.2" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="272.23" y="287.5" ></text>
</g>
<g >
<title>__pagevec_lru_add_fn (549,387,897 samples, 0.07%)</title><rect x="439.4" y="149" width="0.7" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text  x="442.36" y="159.5" ></text>
</g>
<g >
<title>xfs_trans_commit (404,505,494 samples, 0.05%)</title><rect x="15.1" y="277" width="0.5" height="15.0" fill="rgb(250,210,50)" rx="2" ry="2" />
<text  x="18.06" y="287.5" ></text>
</g>
<g >
<title>FileReadV (1,107,321,897 samples, 0.13%)</title><rect x="20.8" y="565" width="1.6" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text  x="23.84" y="575.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (867,878,838 samples, 0.11%)</title><rect x="1120.8" y="469" width="1.3" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="1123.85" y="479.5" ></text>
</g>
<g >
<title>arch_cpu_idle (321,864,645 samples, 0.04%)</title><rect x="1189.5" y="677" width="0.4" height="15.0" fill="rgb(218,62,14)" rx="2" ry="2" />
<text  x="1192.46" y="687.5" ></text>
</g>
<g >
<title>radix_tree_lookup_slot (135,504,497 samples, 0.02%)</title><rect x="64.1" y="245" width="0.2" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="67.12" y="255.5" ></text>
</g>
<g >
<title>__do_fault.isra.61 (1,016,725,410 samples, 0.12%)</title><rect x="255.5" y="341" width="1.5" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="258.54" y="351.5" ></text>
</g>
<g >
<title>tick_sched_timer (141,961,854 samples, 0.02%)</title><rect x="709.2" y="437" width="0.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="712.24" y="447.5" ></text>
</g>
<g >
<title>queued_spin_lock_slowpath (107,024,501 samples, 0.01%)</title><rect x="130.2" y="245" width="0.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="133.20" y="255.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (71,259,274 samples, 0.01%)</title><rect x="1143.0" y="485" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="1146.04" y="495.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (136,932,215 samples, 0.02%)</title><rect x="1125.3" y="485" width="0.2" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="1128.32" y="495.5" ></text>
</g>
<g >
<title>__pread_nocancel (989,449,507 samples, 0.12%)</title><rect x="21.0" y="549" width="1.4" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="24.01" y="559.5" ></text>
</g>
<g >
<title>scheduler_tick (207,063,124 samples, 0.03%)</title><rect x="445.2" y="53" width="0.3" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text  x="448.23" y="63.5" ></text>
</g>
<g >
<title>iomap_apply (1,349,795,106 samples, 0.16%)</title><rect x="12.7" y="325" width="2.0" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="15.74" y="335.5" ></text>
</g>
<g >
<title>PageGetItemId (1,217,995,004 samples, 0.15%)</title><rect x="632.9" y="533" width="1.7" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="635.85" y="543.5" ></text>
</g>
<g >
<title>__do_page_fault (1,204,569,436 samples, 0.15%)</title><rect x="327.3" y="421" width="1.8" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="330.33" y="431.5" ></text>
</g>
<g >
<title>BufferIsValid (612,407,810 samples, 0.07%)</title><rect x="1094.4" y="437" width="0.9" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1097.42" y="447.5" ></text>
</g>
<g >
<title>LockBufHdr (2,046,264,994 samples, 0.25%)</title><rect x="263.7" y="421" width="2.9" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="266.66" y="431.5" ></text>
</g>
<g >
<title>ItemPointerIsValid (1,938,716,806 samples, 0.24%)</title><rect x="1084.3" y="485" width="2.8" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="1087.33" y="495.5" ></text>
</g>
<g >
<title>__set_page_dirty_no_writeback (139,953,307 samples, 0.02%)</title><rect x="577.9" y="229" width="0.2" height="15.0" fill="rgb(223,86,20)" rx="2" ry="2" />
<text  x="580.93" y="239.5" ></text>
</g>
<g >
<title>mdreadv (1,124,397,244 samples, 0.14%)</title><rect x="20.8" y="581" width="1.6" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text  x="23.83" y="591.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (105,887,461 samples, 0.01%)</title><rect x="203.0" y="437" width="0.2" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="206.01" y="447.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (281,396,594 samples, 0.03%)</title><rect x="11.2" y="533" width="0.4" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="14.20" y="543.5" ></text>
</g>
<g >
<title>StrategyGetBuffer (308,807,514 samples, 0.04%)</title><rect x="43.5" y="677" width="0.4" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="46.49" y="687.5" ></text>
</g>
<g >
<title>do_page_fault (30,574,677,911 samples, 3.71%)</title><rect x="85.4" y="277" width="43.8" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="88.39" y="287.5" >do_p..</text>
</g>
<g >
<title>tick_nohz_restart (519,686,314 samples, 0.06%)</title><rect x="1187.4" y="709" width="0.7" height="15.0" fill="rgb(246,191,45)" rx="2" ry="2" />
<text  x="1190.41" y="719.5" ></text>
</g>
<g >
<title>__alloc_pages_nodemask (1,339,431,443 samples, 0.16%)</title><rect x="386.5" y="197" width="1.9" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text  x="389.50" y="207.5" ></text>
</g>
<g >
<title>do_page_fault (1,241,456,530 samples, 0.15%)</title><rect x="327.3" y="437" width="1.8" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="330.31" y="447.5" ></text>
</g>
<g >
<title>get_hash_value (290,571,195 samples, 0.04%)</title><rect x="134.3" y="357" width="0.4" height="15.0" fill="rgb(211,27,6)" rx="2" ry="2" />
<text  x="137.28" y="367.5" ></text>
</g>
<g >
<title>dequeue_entity (218,929,197 samples, 0.03%)</title><rect x="16.6" y="213" width="0.3" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text  x="19.59" y="223.5" ></text>
</g>
<g >
<title>balance_dirty_pages_ratelimited (437,419,495 samples, 0.05%)</title><rect x="389.2" y="245" width="0.6" height="15.0" fill="rgb(212,34,8)" rx="2" ry="2" />
<text  x="392.19" y="255.5" ></text>
</g>
<g >
<title>ItemPointerSet (994,427,062 samples, 0.12%)</title><rect x="161.6" y="501" width="1.5" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="164.64" y="511.5" ></text>
</g>
<g >
<title>call_rwsem_down_read_failed (897,090,581 samples, 0.11%)</title><rect x="130.0" y="293" width="1.3" height="15.0" fill="rgb(228,110,26)" rx="2" ry="2" />
<text  x="133.04" y="303.5" ></text>
</g>
<g >
<title>retint_userspace_restore_args (765,542,700 samples, 0.09%)</title><rect x="276.4" y="421" width="1.1" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text  x="279.42" y="431.5" ></text>
</g>
<g >
<title>perf (495,522,613 samples, 0.06%)</title><rect x="10.3" y="789" width="0.7" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text  x="13.27" y="799.5" ></text>
</g>
<g >
<title>UnlockBufHdr (89,022,688 samples, 0.01%)</title><rect x="1118.8" y="453" width="0.1" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="1121.81" y="463.5" ></text>
</g>
<g >
<title>system_call_fastpath (109,695,662 samples, 0.01%)</title><rect x="604.3" y="453" width="0.2" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="607.35" y="463.5" ></text>
</g>
<g >
<title>__schedule (1,476,060,735 samples, 0.18%)</title><rect x="592.4" y="261" width="2.1" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text  x="595.41" y="271.5" ></text>
</g>
<g >
<title>dequeue_task_fair (337,071,229 samples, 0.04%)</title><rect x="16.4" y="229" width="0.5" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text  x="19.44" y="239.5" ></text>
</g>
<g >
<title>BufTableLookup (693,381,643 samples, 0.08%)</title><rect x="615.4" y="389" width="1.0" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text  x="618.39" y="399.5" ></text>
</g>
<g >
<title>__mem_cgroup_count_vm_event (82,486,615 samples, 0.01%)</title><rect x="268.2" y="357" width="0.2" height="15.0" fill="rgb(217,56,13)" rx="2" ry="2" />
<text  x="271.23" y="367.5" ></text>
</g>
<g >
<title>page_fault (191,062,306 samples, 0.02%)</title><rect x="49.7" y="709" width="0.3" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="52.70" y="719.5" ></text>
</g>
<g >
<title>__find_lock_page (237,787,344 samples, 0.03%)</title><rect x="64.0" y="277" width="0.3" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="66.98" y="287.5" ></text>
</g>
<g >
<title>put_prev_task_idle (82,518,274 samples, 0.01%)</title><rect x="1182.5" y="677" width="0.1" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="1185.45" y="687.5" ></text>
</g>
<g >
<title>table_relation_vacuum (22,616,657,143 samples, 2.74%)</title><rect x="11.0" y="725" width="32.4" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="13.98" y="735.5" >ta..</text>
</g>
<g >
<title>__hrtimer_run_queues (137,764,498 samples, 0.02%)</title><rect x="1033.9" y="405" width="0.2" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="1036.92" y="415.5" ></text>
</g>
<g >
<title>do_sync_read (813,584,497 samples, 0.10%)</title><rect x="21.2" y="485" width="1.2" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="24.20" y="495.5" ></text>
</g>
<g >
<title>select_task_rq_fair (138,619,233 samples, 0.02%)</title><rect x="18.9" y="245" width="0.2" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text  x="21.86" y="255.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (91,262,100 samples, 0.01%)</title><rect x="1144.8" y="517" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="1147.80" y="527.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (137,764,122 samples, 0.02%)</title><rect x="990.7" y="469" width="0.2" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="993.73" y="479.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32_impl (90,077,236 samples, 0.01%)</title><rect x="1143.5" y="485" width="0.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="1146.55" y="495.5" ></text>
</g>
<g >
<title>parallel_vacuum_main (637,238,905,240 samples, 77.31%)</title><rect x="233.7" y="629" width="912.3" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="236.67" y="639.5" >parallel_vacuum_main</text>
</g>
<g >
<title>__memcmp_sse4_1 (184,818,275 samples, 0.02%)</title><rect x="615.6" y="373" width="0.3" height="15.0" fill="rgb(240,162,38)" rx="2" ry="2" />
<text  x="618.64" y="383.5" ></text>
</g>
<g >
<title>tick_sched_timer (70,310,500 samples, 0.01%)</title><rect x="96.9" y="85" width="0.1" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="99.89" y="95.5" ></text>
</g>
<g >
<title>TransactionIdFollows (75,328,725 samples, 0.01%)</title><rect x="37.1" y="581" width="0.1" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="40.10" y="591.5" ></text>
</g>
<g >
<title>BufferAlloc (135,121,430 samples, 0.02%)</title><rect x="22.8" y="517" width="0.2" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text  x="25.80" y="527.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (102,529,432 samples, 0.01%)</title><rect x="1125.8" y="469" width="0.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="1128.79" y="479.5" ></text>
</g>
<g >
<title>postmaster_child_launch (3,096,750,624 samples, 0.38%)</title><rect x="50.2" y="693" width="4.4" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="53.19" y="703.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (177,305,943 samples, 0.02%)</title><rect x="709.2" y="469" width="0.3" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="712.23" y="479.5" ></text>
</g>
<g >
<title>BufferIsValid (92,358,521 samples, 0.01%)</title><rect x="630.7" y="517" width="0.1" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="633.71" y="527.5" ></text>
</g>
<g >
<title>BufTableHashCode (179,304,943 samples, 0.02%)</title><rect x="56.1" y="437" width="0.2" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text  x="59.09" y="447.5" ></text>
</g>
<g >
<title>update_process_times (263,322,681 samples, 0.03%)</title><rect x="445.1" y="69" width="0.4" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="448.15" y="79.5" ></text>
</g>
<g >
<title>BufferIsValid (171,181,149 samples, 0.02%)</title><rect x="225.7" y="453" width="0.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="228.67" y="463.5" ></text>
</g>
<g >
<title>do_generic_file_read.constprop.52 (35,083,865,799 samples, 4.26%)</title><rect x="79.2" y="309" width="50.2" height="15.0" fill="rgb(205,4,1)" rx="2" ry="2" />
<text  x="82.16" y="319.5" >do_ge..</text>
</g>
<g >
<title>local_apic_timer_interrupt (385,779,583 samples, 0.05%)</title><rect x="445.1" y="149" width="0.5" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="448.05" y="159.5" ></text>
</g>
<g >
<title>PageGetItem (491,686,765 samples, 0.06%)</title><rect x="148.4" y="501" width="0.7" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="151.41" y="511.5" ></text>
</g>
<g >
<title>xfs_file_buffered_aio_read (36,944,327,015 samples, 4.48%)</title><rect x="78.9" y="341" width="52.9" height="15.0" fill="rgb(217,55,13)" rx="2" ry="2" />
<text  x="81.93" y="351.5" >xfs_f..</text>
</g>
<g >
<title>UnlockBufHdr (102,139,725 samples, 0.01%)</title><rect x="330.5" y="485" width="0.2" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="333.55" y="495.5" ></text>
</g>
<g >
<title>get_futex_key (101,293,331 samples, 0.01%)</title><rect x="1142.1" y="389" width="0.2" height="15.0" fill="rgb(252,216,51)" rx="2" ry="2" />
<text  x="1145.14" y="399.5" ></text>
</g>
<g >
<title>nohz_balance_exit_idle.part.59 (74,180,775 samples, 0.01%)</title><rect x="621.3" y="309" width="0.1" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="624.34" y="319.5" ></text>
</g>
<g >
<title>heap_page_prune_and_freeze (11,253,277,969 samples, 1.37%)</title><rect x="26.7" y="629" width="16.1" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="29.68" y="639.5" ></text>
</g>
<g >
<title>tick_sched_timer (149,652,535 samples, 0.02%)</title><rect x="828.8" y="405" width="0.3" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="831.84" y="415.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (102,529,432 samples, 0.01%)</title><rect x="1125.8" y="485" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="1128.79" y="495.5" ></text>
</g>
<g >
<title>radix_tree_descend (1,477,690,372 samples, 0.18%)</title><rect x="460.5" y="149" width="2.1" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="463.45" y="159.5" ></text>
</g>
<g >
<title>update_vacuum_error_info (86,440,702 samples, 0.01%)</title><rect x="1145.7" y="565" width="0.1" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text  x="1148.71" y="575.5" ></text>
</g>
<g >
<title>TransactionLogFetch (274,508,106 samples, 0.03%)</title><rect x="1122.7" y="469" width="0.4" height="15.0" fill="rgb(244,180,43)" rx="2" ry="2" />
<text  x="1125.68" y="479.5" ></text>
</g>
<g >
<title>pick_next_task_fair (161,094,855 samples, 0.02%)</title><rect x="594.2" y="245" width="0.2" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text  x="597.17" y="255.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (1,074,397,054 samples, 0.13%)</title><rect x="760.4" y="517" width="1.5" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="763.38" y="527.5" ></text>
</g>
<g >
<title>BackgroundWorkerMain (124,816,047,342 samples, 15.14%)</title><rect x="54.6" y="645" width="178.7" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text  x="57.63" y="655.5" >BackgroundWorkerMain</text>
</g>
<g >
<title>shmem_fault (993,215,265 samples, 0.12%)</title><rect x="255.6" y="325" width="1.4" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="258.56" y="335.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (170,406,319 samples, 0.02%)</title><rect x="25.4" y="597" width="0.3" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="28.42" y="607.5" ></text>
</g>
<g >
<title>perf_event_task_tick (76,661,909 samples, 0.01%)</title><rect x="445.3" y="37" width="0.1" height="15.0" fill="rgb(205,3,0)" rx="2" ry="2" />
<text  x="448.27" y="47.5" ></text>
</g>
<g >
<title>PageGetItem (3,316,241,201 samples, 0.40%)</title><rect x="692.7" y="517" width="4.8" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="695.73" y="527.5" ></text>
</g>
<g >
<title>do_futex (73,620,781 samples, 0.01%)</title><rect x="232.6" y="405" width="0.1" height="15.0" fill="rgb(245,184,44)" rx="2" ry="2" />
<text  x="235.59" y="415.5" ></text>
</g>
<g >
<title>cpuidle_enter_state (6,320,770,283 samples, 0.77%)</title><rect x="1167.1" y="693" width="9.0" height="15.0" fill="rgb(221,73,17)" rx="2" ry="2" />
<text  x="1170.08" y="703.5" ></text>
</g>
<g >
<title>do_page_fault (324,679,284 samples, 0.04%)</title><rect x="57.2" y="373" width="0.5" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="60.20" y="383.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (372,390,171 samples, 0.05%)</title><rect x="784.3" y="517" width="0.5" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="787.29" y="527.5" ></text>
</g>
<g >
<title>ttwu_do_activate (151,810,473 samples, 0.02%)</title><rect x="19.1" y="245" width="0.2" height="15.0" fill="rgb(215,48,11)" rx="2" ry="2" />
<text  x="22.06" y="255.5" ></text>
</g>
<g >
<title>ItemPointerSet (299,085,291 samples, 0.04%)</title><rect x="25.8" y="613" width="0.4" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="28.80" y="623.5" ></text>
</g>
<g >
<title>pgstat_count_io_op_n (81,315,588 samples, 0.01%)</title><rect x="76.6" y="469" width="0.1" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text  x="79.57" y="479.5" ></text>
</g>
<g >
<title>smgrwrite (749,558,632 samples, 0.09%)</title><rect x="50.4" y="293" width="1.1" height="15.0" fill="rgb(229,112,26)" rx="2" ry="2" />
<text  x="53.45" y="303.5" ></text>
</g>
<g >
<title>tick_sched_handle (95,227,023 samples, 0.01%)</title><rect x="393.6" y="101" width="0.2" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="396.64" y="111.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (142,917,883 samples, 0.02%)</title><rect x="54.3" y="373" width="0.2" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="57.30" y="383.5" ></text>
</g>
<g >
<title>heap_prune_satisfies_vacuum (1,722,818,335 samples, 0.21%)</title><rect x="1133.2" y="533" width="2.4" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="1136.17" y="543.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuum (10,301,360,663 samples, 1.25%)</title><rect x="667.8" y="517" width="14.7" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text  x="670.79" y="527.5" ></text>
</g>
<g >
<title>__block_commit_write.isra.21 (300,651,630 samples, 0.04%)</title><rect x="14.0" y="261" width="0.4" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="17.00" y="271.5" ></text>
</g>
<g >
<title>scheduler_tick (92,097,932 samples, 0.01%)</title><rect x="1034.0" y="341" width="0.1" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text  x="1036.97" y="351.5" ></text>
</g>
<g >
<title>heap_vac_scan_next_block_parallel (1,790,805,887 samples, 0.22%)</title><rect x="133.2" y="517" width="2.6" height="15.0" fill="rgb(247,197,47)" rx="2" ry="2" />
<text  x="136.24" y="527.5" ></text>
</g>
<g >
<title>tick_sched_timer (230,462,948 samples, 0.03%)</title><rect x="784.5" y="437" width="0.3" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="787.48" y="447.5" ></text>
</g>
<g >
<title>handle_mm_fault (402,562,211 samples, 0.05%)</title><rect x="63.8" y="357" width="0.6" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="66.78" y="367.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (119,736,523 samples, 0.01%)</title><rect x="1058.9" y="453" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="1061.88" y="463.5" ></text>
</g>
<g >
<title>BlockIdSet (132,668,736 samples, 0.02%)</title><rect x="28.9" y="597" width="0.2" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="31.90" y="607.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (373,773,448 samples, 0.05%)</title><rect x="591.7" y="245" width="0.6" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text  x="594.75" y="255.5" ></text>
</g>
<g >
<title>ReadBuffer_common (254,522,939,179 samples, 30.88%)</title><rect x="236.9" y="533" width="364.4" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="239.91" y="543.5" >ReadBuffer_common</text>
</g>
<g >
<title>__radix_tree_create (483,779,620 samples, 0.06%)</title><rect x="99.6" y="133" width="0.7" height="15.0" fill="rgb(248,201,48)" rx="2" ry="2" />
<text  x="102.56" y="143.5" ></text>
</g>
<g >
<title>UnpinBufferNoOwner (162,914,461 samples, 0.02%)</title><rect x="607.3" y="485" width="0.2" height="15.0" fill="rgb(253,221,53)" rx="2" ry="2" />
<text  x="610.27" y="495.5" ></text>
</g>
<g >
<title>PostgresMain (3,095,632,095 samples, 0.38%)</title><rect x="50.2" y="661" width="4.4" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text  x="53.19" y="671.5" ></text>
</g>
<g >
<title>ProcessUtility (3,088,212,841 samples, 0.37%)</title><rect x="50.2" y="581" width="4.4" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text  x="53.20" y="591.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (143,789,492 samples, 0.02%)</title><rect x="1033.9" y="421" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="1036.92" y="431.5" ></text>
</g>
<g >
<title>do_nanosleep (553,058,085 samples, 0.07%)</title><rect x="310.6" y="325" width="0.8" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="313.60" y="335.5" ></text>
</g>
<g >
<title>find_busiest_group (85,105,846 samples, 0.01%)</title><rect x="311.2" y="245" width="0.1" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="314.21" y="255.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuum (1,989,103,980 samples, 0.24%)</title><rect x="143.6" y="501" width="2.8" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text  x="146.55" y="511.5" ></text>
</g>
<g >
<title>system_call_after_swapgs (92,960,302 samples, 0.01%)</title><rect x="73.8" y="389" width="0.1" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text  x="76.80" y="399.5" ></text>
</g>
<g >
<title>call_rwsem_down_write_failed (82,091,217 samples, 0.01%)</title><rect x="586.5" y="229" width="0.1" height="15.0" fill="rgb(205,0,0)" rx="2" ry="2" />
<text  x="589.51" y="239.5" ></text>
</g>
<g >
<title>__do_page_fault (306,316,030 samples, 0.04%)</title><rect x="75.4" y="405" width="0.4" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="78.40" y="415.5" ></text>
</g>
<g >
<title>PageGetItem (4,033,244,819 samples, 0.49%)</title><rect x="916.2" y="501" width="5.7" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="919.17" y="511.5" ></text>
</g>
<g >
<title>heap_prune_record_unchanged_lp_normal (98,836,349 samples, 0.01%)</title><rect x="218.7" y="501" width="0.1" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="221.66" y="511.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (341,834,314 samples, 0.04%)</title><rect x="41.9" y="549" width="0.5" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="44.91" y="559.5" ></text>
</g>
<g >
<title>irq_exit (564,062,376 samples, 0.07%)</title><rect x="1164.6" y="661" width="0.8" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text  x="1167.57" y="671.5" ></text>
</g>
<g >
<title>free_pgtables (105,984,661 samples, 0.01%)</title><rect x="1146.0" y="677" width="0.1" height="15.0" fill="rgb(208,17,4)" rx="2" ry="2" />
<text  x="1148.99" y="687.5" ></text>
</g>
<g >
<title>sys_futex (732,904,907 samples, 0.09%)</title><rect x="48.4" y="725" width="1.0" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text  x="51.40" y="735.5" ></text>
</g>
<g >
<title>tick_sched_timer (125,901,891 samples, 0.02%)</title><rect x="1033.9" y="389" width="0.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="1036.92" y="399.5" ></text>
</g>
<g >
<title>tick_nohz_stop_sched_tick (1,603,892,618 samples, 0.19%)</title><rect x="1184.3" y="693" width="2.3" height="15.0" fill="rgb(219,64,15)" rx="2" ry="2" />
<text  x="1187.31" y="703.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (129,160,649 samples, 0.02%)</title><rect x="1179.2" y="709" width="0.2" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text  x="1182.23" y="719.5" ></text>
</g>
<g >
<title>radix_tree_lookup_slot (888,720,123 samples, 0.11%)</title><rect x="270.1" y="261" width="1.3" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="273.12" y="271.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (172,532,482 samples, 0.02%)</title><rect x="132.5" y="469" width="0.3" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="135.52" y="479.5" ></text>
</g>
<g >
<title>shmem_getpage_gfp (1,833,301,401 samples, 0.22%)</title><rect x="268.8" y="309" width="2.6" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="271.81" y="319.5" ></text>
</g>
<g >
<title>radix_tree_descend (152,385,801 samples, 0.02%)</title><rect x="271.2" y="245" width="0.2" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="274.18" y="255.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_sub_u32_impl (104,123,405 samples, 0.01%)</title><rect x="1142.6" y="453" width="0.2" height="15.0" fill="rgb(225,94,22)" rx="2" ry="2" />
<text  x="1145.64" y="463.5" ></text>
</g>
<g >
<title>TransactionIdFollows (213,456,833 samples, 0.03%)</title><rect x="137.7" y="517" width="0.3" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="140.71" y="527.5" ></text>
</g>
<g >
<title>pagevec_lru_move_fn (1,699,665,581 samples, 0.21%)</title><rect x="1152.1" y="597" width="2.4" height="15.0" fill="rgb(205,0,0)" rx="2" ry="2" />
<text  x="1155.11" y="607.5" ></text>
</g>
<g >
<title>pg_atomic_sub_fetch_u32 (308,807,514 samples, 0.04%)</title><rect x="43.5" y="693" width="0.4" height="15.0" fill="rgb(242,174,41)" rx="2" ry="2" />
<text  x="46.49" y="703.5" ></text>
</g>
<g >
<title>__do_fault.isra.61 (1,470,684,851 samples, 0.18%)</title><rect x="244.3" y="325" width="2.1" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="247.32" y="335.5" ></text>
</g>
<g >
<title>ConditionalLockBuffer (620,191,015 samples, 0.08%)</title><rect x="234.9" y="533" width="0.9" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="237.94" y="543.5" ></text>
</g>
<g >
<title>do_futex (103,863,112 samples, 0.01%)</title><rect x="320.8" y="373" width="0.2" height="15.0" fill="rgb(245,184,44)" rx="2" ry="2" />
<text  x="323.82" y="383.5" ></text>
</g>
<g >
<title>ReadBufferExtended (231,711,172 samples, 0.03%)</title><rect x="23.4" y="581" width="0.3" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text  x="26.38" y="591.5" ></text>
</g>
<g >
<title>shmem_fault (1,451,151,018 samples, 0.18%)</title><rect x="244.3" y="309" width="2.1" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="247.35" y="319.5" ></text>
</g>
<g >
<title>LockBufHdr (202,762,573 samples, 0.02%)</title><rect x="331.0" y="469" width="0.3" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="334.01" y="479.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (75,893,920 samples, 0.01%)</title><rect x="173.3" y="437" width="0.2" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="176.34" y="447.5" ></text>
</g>
<g >
<title>nohz_balance_exit_idle.part.59 (100,029,700 samples, 0.01%)</title><rect x="1086.9" y="325" width="0.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="1089.95" y="335.5" ></text>
</g>
<g >
<title>smgrreadv (1,141,045,560 samples, 0.14%)</title><rect x="20.8" y="597" width="1.6" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text  x="23.80" y="607.5" ></text>
</g>
<g >
<title>heap_prune_satisfies_vacuum (1,801,765,657 samples, 0.22%)</title><rect x="40.2" y="613" width="2.5" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="43.16" y="623.5" ></text>
</g>
<g >
<title>wake_up_q (1,009,697,283 samples, 0.12%)</title><rect x="17.8" y="277" width="1.5" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text  x="20.83" y="287.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (235,130,798 samples, 0.03%)</title><rect x="631.0" y="501" width="0.4" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="634.04" y="511.5" ></text>
</g>
<g >
<title>PageRepairFragmentation (43,469,266,273 samples, 5.27%)</title><rect x="792.1" y="501" width="62.2" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="795.08" y="511.5" >PageRe..</text>
</g>
<g >
<title>mdwritev (749,558,632 samples, 0.09%)</title><rect x="50.4" y="261" width="1.1" height="15.0" fill="rgb(215,50,12)" rx="2" ry="2" />
<text  x="53.45" y="271.5" ></text>
</g>
<g >
<title>native_write_msr_safe (699,279,606 samples, 0.08%)</title><rect x="1161.6" y="773" width="1.0" height="15.0" fill="rgb(243,176,42)" rx="2" ry="2" />
<text  x="1164.63" y="783.5" ></text>
</g>
<g >
<title>__radix_tree_lookup (726,176,207 samples, 0.09%)</title><rect x="270.1" y="245" width="1.1" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="273.12" y="255.5" ></text>
</g>
<g >
<title>shared_ts_memory_usage (2,096,652,975 samples, 0.25%)</title><rect x="601.6" y="533" width="3.0" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text  x="604.63" y="543.5" ></text>
</g>
<g >
<title>TransactionIdDidCommit (109,615,455 samples, 0.01%)</title><rect x="229.4" y="469" width="0.2" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text  x="232.44" y="479.5" ></text>
</g>
<g >
<title>heap_prune_satisfies_vacuum (224,303,972 samples, 0.03%)</title><rect x="231.6" y="517" width="0.3" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="234.55" y="527.5" ></text>
</g>
<g >
<title>current_kernel_time (70,877,978 samples, 0.01%)</title><rect x="335.6" y="405" width="0.1" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text  x="338.59" y="415.5" ></text>
</g>
<g >
<title>xfs_file_aio_write (4,603,972,718 samples, 0.56%)</title><rect x="12.7" y="373" width="6.6" height="15.0" fill="rgb(251,211,50)" rx="2" ry="2" />
<text  x="15.69" y="383.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (917,989,083 samples, 0.11%)</title><rect x="682.5" y="517" width="1.4" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="685.54" y="527.5" ></text>
</g>
<g >
<title>LWLockHeldByMeInMode (70,070,532 samples, 0.01%)</title><rect x="1143.1" y="501" width="0.1" height="15.0" fill="rgb(207,12,2)" rx="2" ry="2" />
<text  x="1146.14" y="511.5" ></text>
</g>
<g >
<title>PageGetItemId (1,148,809,284 samples, 0.14%)</title><rect x="192.3" y="485" width="1.6" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="195.28" y="495.5" ></text>
</g>
<g >
<title>visibilitymap_get_status (8,154,032,961 samples, 0.99%)</title><rect x="609.9" y="517" width="11.7" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text  x="612.89" y="527.5" ></text>
</g>
<g >
<title>__writeback_single_inode (115,471,936 samples, 0.01%)</title><rect x="10.1" y="645" width="0.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="13.10" y="655.5" ></text>
</g>
<g >
<title>pick_next_task_fair (358,590,811 samples, 0.04%)</title><rect x="1182.2" y="693" width="0.5" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text  x="1185.20" y="703.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (75,893,920 samples, 0.01%)</title><rect x="173.3" y="405" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="176.34" y="415.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (80,712,318 samples, 0.01%)</title><rect x="301.0" y="373" width="0.1" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="304.00" y="383.5" ></text>
</g>
<g >
<title>nohz_balance_exit_idle.part.59 (75,162,367 samples, 0.01%)</title><rect x="173.3" y="309" width="0.1" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="176.34" y="319.5" ></text>
</g>
<g >
<title>__memmove_ssse3 (2,699,176,943 samples, 0.33%)</title><rect x="824.9" y="485" width="3.9" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text  x="827.95" y="495.5" ></text>
</g>
<g >
<title>pg_rotate_left32 (79,970,721 samples, 0.01%)</title><rect x="134.6" y="325" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="137.58" y="335.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (896,177,642 samples, 0.11%)</title><rect x="1119.0" y="453" width="1.2" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="1121.96" y="463.5" ></text>
</g>
<g >
<title>alloc_pages_vma (1,082,554,484 samples, 0.13%)</title><rect x="125.6" y="149" width="1.6" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text  x="128.64" y="159.5" ></text>
</g>
<g >
<title>scheduler_ipi (205,569,025 samples, 0.02%)</title><rect x="1189.6" y="613" width="0.3" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text  x="1192.62" y="623.5" ></text>
</g>
<g >
<title>do_page_fault (477,991,274 samples, 0.06%)</title><rect x="63.7" y="389" width="0.7" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="66.71" y="399.5" ></text>
</g>
<g >
<title>__wake_up_bit (238,607,146 samples, 0.03%)</title><rect x="584.8" y="213" width="0.4" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text  x="587.82" y="223.5" ></text>
</g>
<g >
<title>__find_lock_page (8,385,097,254 samples, 1.02%)</title><rect x="426.8" y="181" width="12.0" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="429.84" y="191.5" ></text>
</g>
<g >
<title>system_call_fastpath (95,889,441 samples, 0.01%)</title><rect x="71.8" y="357" width="0.2" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="74.83" y="367.5" ></text>
</g>
<g >
<title>StartReadBuffer (227,498,364 samples, 0.03%)</title><rect x="23.4" y="549" width="0.3" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text  x="26.39" y="559.5" ></text>
</g>
<g >
<title>tas (250,400,588 samples, 0.03%)</title><rect x="47.8" y="741" width="0.4" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="50.80" y="751.5" ></text>
</g>
<g >
<title>BufferGetBlock (120,192,872 samples, 0.01%)</title><rect x="1138.4" y="501" width="0.1" height="15.0" fill="rgb(242,172,41)" rx="2" ry="2" />
<text  x="1141.37" y="511.5" ></text>
</g>
<g >
<title>BufferGetBlockNumber (438,961,729 samples, 0.05%)</title><rect x="1144.3" y="533" width="0.6" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="1147.31" y="543.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (6,503,999,018 samples, 0.79%)</title><rect x="671.6" y="501" width="9.3" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="674.61" y="511.5" ></text>
</g>
<g >
<title>LWLockAcquire (1,178,208,717 samples, 0.14%)</title><rect x="1139.7" y="501" width="1.7" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text  x="1142.72" y="511.5" ></text>
</g>
<g >
<title>__rwsem_mark_wake (422,744,642 samples, 0.05%)</title><rect x="17.2" y="277" width="0.6" height="15.0" fill="rgb(206,8,1)" rx="2" ry="2" />
<text  x="20.21" y="287.5" ></text>
</g>
<g >
<title>file_update_time (101,488,659 samples, 0.01%)</title><rect x="50.9" y="101" width="0.1" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text  x="53.88" y="111.5" ></text>
</g>
<g >
<title>sem_post@@GLIBC_2.2.5 (73,620,781 samples, 0.01%)</title><rect x="232.6" y="453" width="0.1" height="15.0" fill="rgb(214,41,9)" rx="2" ry="2" />
<text  x="235.59" y="463.5" ></text>
</g>
<g >
<title>do_futex (109,531,290 samples, 0.01%)</title><rect x="73.9" y="357" width="0.2" height="15.0" fill="rgb(245,184,44)" rx="2" ry="2" />
<text  x="76.93" y="367.5" ></text>
</g>
<g >
<title>tick_program_event (311,256,336 samples, 0.04%)</title><rect x="1185.3" y="629" width="0.4" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text  x="1188.25" y="639.5" ></text>
</g>
<g >
<title>__do_fault.isra.61 (2,046,736,897 samples, 0.25%)</title><rect x="268.5" y="341" width="2.9" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="271.50" y="351.5" ></text>
</g>
<g >
<title>BufTableLookup (2,215,030,560 samples, 0.27%)</title><rect x="58.3" y="437" width="3.2" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text  x="61.29" y="447.5" ></text>
</g>
<g >
<title>__find_get_page (659,326,879 samples, 0.08%)</title><rect x="59.8" y="261" width="0.9" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="62.76" y="271.5" ></text>
</g>
<g >
<title>BufTableLookup (5,925,703,624 samples, 0.72%)</title><rect x="249.0" y="453" width="8.4" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text  x="251.96" y="463.5" ></text>
</g>
<g >
<title>__schedule (558,121,049 samples, 0.07%)</title><rect x="130.4" y="245" width="0.8" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text  x="133.38" y="255.5" ></text>
</g>
<g >
<title>sysret_audit (819,271,228 samples, 0.10%)</title><rect x="335.9" y="437" width="1.1" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="338.86" y="447.5" ></text>
</g>
<g >
<title>__do_page_fault (143,134,447,326 samples, 17.37%)</title><rect x="380.7" y="277" width="205.0" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="383.74" y="287.5" >__do_page_fault</text>
</g>
<g >
<title>FileAccess (91,790,077 samples, 0.01%)</title><rect x="334.3" y="453" width="0.1" height="15.0" fill="rgb(245,187,44)" rx="2" ry="2" />
<text  x="337.28" y="463.5" ></text>
</g>
<g >
<title>try_to_wake_up (153,248,404 samples, 0.02%)</title><rect x="596.8" y="261" width="0.3" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="599.84" y="271.5" ></text>
</g>
<g >
<title>hrtimer_start (819,085,682 samples, 0.10%)</title><rect x="1185.0" y="677" width="1.2" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text  x="1188.01" y="687.5" ></text>
</g>
<g >
<title>PageGetItem (869,016,305 samples, 0.11%)</title><rect x="163.4" y="501" width="1.2" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="166.36" y="511.5" ></text>
</g>
<g >
<title>mem_cgroup_update_page_stat (1,372,720,940 samples, 0.17%)</title><rect x="1155.5" y="613" width="1.9" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text  x="1158.48" y="623.5" ></text>
</g>
<g >
<title>wake_up_q (149,194,408 samples, 0.02%)</title><rect x="1142.3" y="389" width="0.2" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text  x="1145.29" y="399.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (577,326,841 samples, 0.07%)</title><rect x="227.7" y="437" width="0.8" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="230.68" y="447.5" ></text>
</g>
<g >
<title>PageGetItemId (137,355,995 samples, 0.02%)</title><rect x="31.2" y="581" width="0.2" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="34.22" y="591.5" ></text>
</g>
<g >
<title>StartReadBuffersImpl (60,001,797,621 samples, 7.28%)</title><rect x="237.2" y="501" width="85.9" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="240.20" y="511.5" >StartReadB..</text>
</g>
<g >
<title>hash_bytes (156,817,810 samples, 0.02%)</title><rect x="134.4" y="325" width="0.2" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="137.35" y="335.5" ></text>
</g>
<g >
<title>enqueue_entity (165,851,095 samples, 0.02%)</title><rect x="1166.3" y="517" width="0.2" height="15.0" fill="rgb(218,62,15)" rx="2" ry="2" />
<text  x="1169.28" y="527.5" ></text>
</g>
<g >
<title>LWLockRelease (501,842,807 samples, 0.06%)</title><rect x="73.7" y="437" width="0.8" height="15.0" fill="rgb(217,58,13)" rx="2" ry="2" />
<text  x="76.74" y="447.5" ></text>
</g>
<g >
<title>radix_tree_lookup_slot (1,330,162,189 samples, 0.16%)</title><rect x="93.5" y="133" width="1.9" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="96.54" y="143.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (108,313,034 samples, 0.01%)</title><rect x="321.0" y="437" width="0.1" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="323.99" y="447.5" ></text>
</g>
<g >
<title>heap_prune_chain (138,370,919,442 samples, 16.79%)</title><rect x="864.8" y="517" width="198.1" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="867.81" y="527.5" >heap_prune_chain</text>
</g>
<g >
<title>ConditionalLockBuffer (75,564,266 samples, 0.01%)</title><rect x="55.4" y="517" width="0.2" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="58.45" y="527.5" ></text>
</g>
<g >
<title>TransactionIdFollows (168,187,208 samples, 0.02%)</title><rect x="193.9" y="485" width="0.3" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="196.93" y="495.5" ></text>
</g>
<g >
<title>StartReadBuffersImpl (225,838,527 samples, 0.03%)</title><rect x="23.4" y="533" width="0.3" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="26.39" y="543.5" ></text>
</g>
<g >
<title>radix_tree_lookup_slot (78,246,028 samples, 0.01%)</title><rect x="57.5" y="229" width="0.1" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="60.45" y="239.5" ></text>
</g>
<g >
<title>TransactionIdPrecedesOrEquals (339,865,302 samples, 0.04%)</title><rect x="217.4" y="437" width="0.5" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text  x="220.39" y="447.5" ></text>
</g>
<g >
<title>HeapTupleHeaderAdvanceConflictHorizon (206,349,584 samples, 0.03%)</title><rect x="760.1" y="517" width="0.3" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text  x="763.09" y="527.5" ></text>
</g>
<g >
<title>queued_spin_lock_slowpath (373,773,448 samples, 0.05%)</title><rect x="591.7" y="261" width="0.6" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="594.75" y="271.5" ></text>
</g>
<g >
<title>BufferDescriptorGetBuffer (279,220,599 samples, 0.03%)</title><rect x="263.7" y="405" width="0.4" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="266.70" y="415.5" ></text>
</g>
<g >
<title>smgrreadv (72,230,433 samples, 0.01%)</title><rect x="601.2" y="517" width="0.1" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text  x="604.19" y="527.5" ></text>
</g>
<g >
<title>TerminateBufferIO (75,927,925 samples, 0.01%)</title><rect x="12.0" y="517" width="0.1" height="15.0" fill="rgb(239,160,38)" rx="2" ry="2" />
<text  x="14.98" y="527.5" ></text>
</g>
<g >
<title>__pread_nocancel (319,415,208 samples, 0.04%)</title><rect x="49.5" y="725" width="0.5" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="52.51" y="735.5" ></text>
</g>
<g >
<title>s_lock (113,376,933 samples, 0.01%)</title><rect x="247.7" y="405" width="0.2" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text  x="250.69" y="415.5" ></text>
</g>
<g >
<title>heap_vac_scan_next_block_parallel (346,032,104 samples, 0.04%)</title><rect x="23.3" y="629" width="0.4" height="15.0" fill="rgb(247,197,47)" rx="2" ry="2" />
<text  x="26.25" y="639.5" ></text>
</g>
<g >
<title>ItemPointerSet (304,559,443 samples, 0.04%)</title><rect x="136.5" y="517" width="0.5" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="139.51" y="527.5" ></text>
</g>
<g >
<title>all (824,213,431,609 samples, 100%)</title><rect x="10.0" y="805" width="1180.0" height="15.0" fill="rgb(213,39,9)" rx="2" ry="2" />
<text  x="13.00" y="815.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (177,153,295 samples, 0.02%)</title><rect x="1033.9" y="453" width="0.2" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="1036.87" y="463.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (90,485,653 samples, 0.01%)</title><rect x="349.8" y="293" width="0.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="352.79" y="303.5" ></text>
</g>
<g >
<title>radix_tree_lookup_slot (1,016,370,074 samples, 0.12%)</title><rect x="348.2" y="293" width="1.4" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="351.19" y="303.5" ></text>
</g>
<g >
<title>queued_spin_lock_slowpath (1,521,504,898 samples, 0.18%)</title><rect x="1152.2" y="565" width="2.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="1155.22" y="575.5" ></text>
</g>
<g >
<title>pgstat_report_wait_end (80,771,282 samples, 0.01%)</title><rect x="600.3" y="453" width="0.1" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text  x="603.25" y="463.5" ></text>
</g>
<g >
<title>call_rwsem_down_read_failed (3,037,112,745 samples, 0.37%)</title><rect x="590.8" y="309" width="4.3" height="15.0" fill="rgb(228,110,26)" rx="2" ry="2" />
<text  x="593.80" y="319.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (113,577,889 samples, 0.01%)</title><rect x="1086.9" y="453" width="0.2" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="1089.95" y="463.5" ></text>
</g>
<g >
<title>BufferIsValid (752,199,942 samples, 0.09%)</title><rect x="1112.8" y="437" width="1.1" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1115.81" y="447.5" ></text>
</g>
<g >
<title>call_softirq (534,868,085 samples, 0.06%)</title><rect x="1164.6" y="629" width="0.7" height="15.0" fill="rgb(225,94,22)" rx="2" ry="2" />
<text  x="1167.57" y="639.5" ></text>
</g>
<g >
<title>PageGetItem (692,124,148 samples, 0.08%)</title><rect x="631.9" y="533" width="1.0" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="634.86" y="543.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (77,288,274 samples, 0.01%)</title><rect x="621.3" y="469" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="624.34" y="479.5" ></text>
</g>
<g >
<title>xfs_file_buffered_aio_read (112,499,076 samples, 0.01%)</title><rect x="51.7" y="229" width="0.2" height="15.0" fill="rgb(217,55,13)" rx="2" ry="2" />
<text  x="54.73" y="239.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32 (204,343,660 samples, 0.02%)</title><rect x="1143.2" y="501" width="0.3" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="1146.24" y="511.5" ></text>
</g>
<g >
<title>queued_spin_lock_slowpath (97,340,457 samples, 0.01%)</title><rect x="594.9" y="229" width="0.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="597.91" y="239.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32_impl (163,680,503 samples, 0.02%)</title><rect x="318.9" y="405" width="0.3" height="15.0" fill="rgb(235,141,33)" rx="2" ry="2" />
<text  x="321.94" y="415.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (340,966,235 samples, 0.04%)</title><rect x="784.3" y="469" width="0.5" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="787.34" y="479.5" ></text>
</g>
<g >
<title>__pread_nocancel (38,456,749,002 samples, 4.67%)</title><rect x="77.1" y="437" width="55.1" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="80.13" y="447.5" >__pre..</text>
</g>
<g >
<title>heap_prepare_freeze_tuple (8,946,773,570 samples, 1.09%)</title><rect x="205.1" y="469" width="12.8" height="15.0" fill="rgb(227,101,24)" rx="2" ry="2" />
<text  x="208.09" y="479.5" ></text>
</g>
<g >
<title>LockBufHdr (384,468,694 samples, 0.05%)</title><rect x="329.5" y="485" width="0.5" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="332.50" y="495.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (7,029,628,639 samples, 0.85%)</title><rect x="238.9" y="437" width="10.1" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="241.89" y="447.5" ></text>
</g>
<g >
<title>menu_select (1,120,409,766 samples, 0.14%)</title><rect x="1176.6" y="693" width="1.6" height="15.0" fill="rgb(242,172,41)" rx="2" ry="2" />
<text  x="1179.58" y="703.5" ></text>
</g>
<g >
<title>load_balance (91,989,712 samples, 0.01%)</title><rect x="311.2" y="261" width="0.1" height="15.0" fill="rgb(226,96,23)" rx="2" ry="2" />
<text  x="314.21" y="271.5" ></text>
</g>
<g >
<title>BufferIsPermanent (273,594,320 samples, 0.03%)</title><rect x="41.3" y="565" width="0.3" height="15.0" fill="rgb(250,210,50)" rx="2" ry="2" />
<text  x="44.25" y="575.5" ></text>
</g>
<g >
<title>system_call_fastpath (889,747,881 samples, 0.11%)</title><rect x="21.2" y="533" width="1.2" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="24.15" y="543.5" ></text>
</g>
<g >
<title>cmd_record (494,635,465 samples, 0.06%)</title><rect x="10.3" y="725" width="0.7" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="13.27" y="735.5" ></text>
</g>
<g >
<title>vacuum (3,088,212,841 samples, 0.37%)</title><rect x="50.2" y="533" width="4.4" height="15.0" fill="rgb(219,66,15)" rx="2" ry="2" />
<text  x="53.20" y="543.5" ></text>
</g>
<g >
<title>xfs_iunlock (267,816,803 samples, 0.03%)</title><rect x="51.1" y="117" width="0.4" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text  x="54.10" y="127.5" ></text>
</g>
<g >
<title>heap_parallel_vacuum_scan_worker (124,325,521,611 samples, 15.08%)</title><rect x="55.2" y="565" width="178.0" height="15.0" fill="rgb(209,21,5)" rx="2" ry="2" />
<text  x="58.21" y="575.5" >heap_parallel_vacuum_sc..</text>
</g>
<g >
<title>irq_enter (326,610,123 samples, 0.04%)</title><rect x="1164.1" y="661" width="0.5" height="15.0" fill="rgb(253,225,53)" rx="2" ry="2" />
<text  x="1167.10" y="671.5" ></text>
</g>
<g >
<title>mem_cgroup_cache_charge (4,681,231,025 samples, 0.57%)</title><rect x="445.6" y="181" width="6.7" height="15.0" fill="rgb(251,215,51)" rx="2" ry="2" />
<text  x="448.61" y="191.5" ></text>
</g>
<g >
<title>hrtimer_start_range_ns (96,060,738 samples, 0.01%)</title><rect x="310.6" y="309" width="0.2" height="15.0" fill="rgb(244,179,42)" rx="2" ry="2" />
<text  x="313.61" y="319.5" ></text>
</g>
<g >
<title>update_process_times (144,074,202 samples, 0.02%)</title><rect x="828.8" y="373" width="0.3" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="831.85" y="383.5" ></text>
</g>
<g >
<title>__tlb_remove_page (89,036,450 samples, 0.01%)</title><rect x="1151.4" y="629" width="0.1" height="15.0" fill="rgb(236,142,34)" rx="2" ry="2" />
<text  x="1154.41" y="639.5" ></text>
</g>
<g >
<title>PageGetItem (80,324,566 samples, 0.01%)</title><rect x="36.8" y="581" width="0.1" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="39.81" y="591.5" ></text>
</g>
<g >
<title>__new_sem_wait_slow.constprop.0 (131,147,634 samples, 0.02%)</title><rect x="47.5" y="741" width="0.2" height="15.0" fill="rgb(250,208,49)" rx="2" ry="2" />
<text  x="50.55" y="751.5" ></text>
</g>
<g >
<title>PageGetMaxOffsetNumber (88,762,334 samples, 0.01%)</title><rect x="824.8" y="485" width="0.1" height="15.0" fill="rgb(234,133,32)" rx="2" ry="2" />
<text  x="827.76" y="495.5" ></text>
</g>
<g >
<title>smgrreadv (158,124,941 samples, 0.02%)</title><rect x="51.7" y="373" width="0.2" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text  x="54.67" y="383.5" ></text>
</g>
<g >
<title>nohz_balance_exit_idle.part.59 (105,181,066 samples, 0.01%)</title><rect x="203.0" y="293" width="0.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="206.02" y="303.5" ></text>
</g>
<g >
<title>load_balance (85,584,062 samples, 0.01%)</title><rect x="1189.7" y="501" width="0.2" height="15.0" fill="rgb(226,96,23)" rx="2" ry="2" />
<text  x="1192.74" y="511.5" ></text>
</g>
<g >
<title>TransactionIdIsInProgress (1,202,065,743 samples, 0.15%)</title><rect x="1123.6" y="485" width="1.7" height="15.0" fill="rgb(208,16,3)" rx="2" ry="2" />
<text  x="1126.60" y="495.5" ></text>
</g>
<g >
<title>page_fault (2,344,754,514 samples, 0.28%)</title><rect x="243.3" y="405" width="3.4" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="246.33" y="415.5" ></text>
</g>
<g >
<title>hrtimer_try_to_cancel (101,851,143 samples, 0.01%)</title><rect x="1187.4" y="677" width="0.2" height="15.0" fill="rgb(243,176,42)" rx="2" ry="2" />
<text  x="1190.44" y="687.5" ></text>
</g>
<g >
<title>BufferGetBlockNumber (164,125,955 samples, 0.02%)</title><rect x="231.9" y="501" width="0.3" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="234.92" y="511.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_sub_u32_impl (144,543,924 samples, 0.02%)</title><rect x="321.3" y="405" width="0.2" height="15.0" fill="rgb(225,94,22)" rx="2" ry="2" />
<text  x="324.27" y="415.5" ></text>
</g>
<g >
<title>system_call_fastpath (181,287,469,115 samples, 22.00%)</title><rect x="340.4" y="437" width="259.6" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="343.41" y="447.5" >system_call_fastpath</text>
</g>
<g >
<title>TransactionIdPrecedes (511,887,427 samples, 0.06%)</title><rect x="1124.6" y="469" width="0.7" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="1127.59" y="479.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (636,854,493 samples, 0.08%)</title><rect x="95.9" y="133" width="0.9" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text  x="98.88" y="143.5" ></text>
</g>
<g >
<title>shmem_getpage_gfp (878,253,293 samples, 0.11%)</title><rect x="255.7" y="309" width="1.3" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="258.73" y="319.5" ></text>
</g>
<g >
<title>queued_spin_lock_slowpath (376,838,370 samples, 0.05%)</title><rect x="1164.8" y="469" width="0.5" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="1167.76" y="479.5" ></text>
</g>
<g >
<title>LWLockWaitListLock (73,040,904 samples, 0.01%)</title><rect x="1141.8" y="469" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="1144.80" y="479.5" ></text>
</g>
<g >
<title>TransactionIdPrecedesOrEquals (569,285,335 samples, 0.07%)</title><rect x="1033.1" y="469" width="0.8" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text  x="1036.06" y="479.5" ></text>
</g>
<g >
<title>LWLockHeldByMeInMode (90,032,597 samples, 0.01%)</title><rect x="770.2" y="501" width="0.1" height="15.0" fill="rgb(207,12,2)" rx="2" ry="2" />
<text  x="773.18" y="511.5" ></text>
</g>
<g >
<title>PageGetItemId (117,637,617 samples, 0.01%)</title><rect x="36.9" y="581" width="0.2" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="39.93" y="591.5" ></text>
</g>
<g >
<title>_raw_qspin_lock (119,542,197 samples, 0.01%)</title><rect x="87.2" y="181" width="0.2" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="90.21" y="191.5" ></text>
</g>
<g >
<title>ktime_get (100,614,536 samples, 0.01%)</title><rect x="682.4" y="405" width="0.1" height="15.0" fill="rgb(207,10,2)" rx="2" ry="2" />
<text  x="685.36" y="415.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (128,189,296 samples, 0.02%)</title><rect x="990.7" y="453" width="0.2" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="993.75" y="463.5" ></text>
</g>
<g >
<title>tas (2,467,560,318 samples, 0.30%)</title><rect x="325.9" y="469" width="3.6" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="328.95" y="479.5" ></text>
</g>
<g >
<title>shmem_fault (130,369,365,543 samples, 15.82%)</title><rect x="391.3" y="213" width="186.6" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="394.28" y="223.5" >shmem_fault</text>
</g>
<g >
<title>do_sync_read (91,148,571 samples, 0.01%)</title><rect x="78.4" y="389" width="0.2" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="81.45" y="399.5" ></text>
</g>
<g >
<title>lazy_scan_heap (3,081,731,940 samples, 0.37%)</title><rect x="50.2" y="469" width="4.4" height="15.0" fill="rgb(248,198,47)" rx="2" ry="2" />
<text  x="53.20" y="479.5" ></text>
</g>
<g >
<title>__audit_syscall_entry (378,489,622 samples, 0.05%)</title><rect x="335.1" y="421" width="0.6" height="15.0" fill="rgb(243,176,42)" rx="2" ry="2" />
<text  x="338.14" y="431.5" ></text>
</g>
<g >
<title>LWLockQueueSelf (116,372,146 samples, 0.01%)</title><rect x="1141.1" y="485" width="0.2" height="15.0" fill="rgb(236,146,35)" rx="2" ry="2" />
<text  x="1144.12" y="495.5" ></text>
</g>
<g >
<title>__do_fault.isra.61 (275,527,208 samples, 0.03%)</title><rect x="63.9" y="325" width="0.4" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="66.93" y="335.5" ></text>
</g>
<g >
<title>sys_exit_group (10,517,164,883 samples, 1.28%)</title><rect x="1146.0" y="757" width="15.0" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text  x="1148.99" y="767.5" ></text>
</g>
<g >
<title>scheduler_tick (75,893,920 samples, 0.01%)</title><rect x="173.3" y="325" width="0.2" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text  x="176.34" y="335.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (5,981,226,504 samples, 0.73%)</title><rect x="1024.5" y="469" width="8.6" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="1027.49" y="479.5" ></text>
</g>
<g >
<title>s_lock (5,406,470,673 samples, 0.66%)</title><rect x="65.6" y="405" width="7.7" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text  x="68.58" y="415.5" ></text>
</g>
<g >
<title>vfs_read (180,700,479,590 samples, 21.92%)</title><rect x="341.3" y="405" width="258.7" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="344.25" y="415.5" >vfs_read</text>
</g>
<g >
<title>local_apic_timer_interrupt (77,614,966 samples, 0.01%)</title><rect x="87.4" y="149" width="0.1" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="90.38" y="159.5" ></text>
</g>
<g >
<title>radix_tree_descend (393,041,870 samples, 0.05%)</title><rect x="100.3" y="133" width="0.5" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="103.25" y="143.5" ></text>
</g>
<g >
<title>shmem_getpage_gfp (257,884,254 samples, 0.03%)</title><rect x="64.0" y="293" width="0.3" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="66.95" y="303.5" ></text>
</g>
<g >
<title>do_generic_file_read.constprop.52 (734,745,438 samples, 0.09%)</title><rect x="21.3" y="421" width="1.0" height="15.0" fill="rgb(205,4,1)" rx="2" ry="2" />
<text  x="24.25" y="431.5" ></text>
</g>
<g >
<title>handle_mm_fault (1,048,916,750 samples, 0.13%)</title><rect x="59.3" y="357" width="1.5" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="62.35" y="367.5" ></text>
</g>
<g >
<title>cpumask_next_and (73,626,817 samples, 0.01%)</title><rect x="311.2" y="229" width="0.1" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="314.21" y="239.5" ></text>
</g>
<g >
<title>irq_exit (204,186,860 samples, 0.02%)</title><rect x="1189.6" y="597" width="0.3" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text  x="1192.63" y="607.5" ></text>
</g>
<g >
<title>call_rwsem_wake (267,816,803 samples, 0.03%)</title><rect x="51.1" y="85" width="0.4" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text  x="54.10" y="95.5" ></text>
</g>
<g >
<title>_raw_qspin_lock (376,838,370 samples, 0.05%)</title><rect x="1164.8" y="485" width="0.5" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="1167.76" y="495.5" ></text>
</g>
<g >
<title>system_call_fastpath (735,498,408 samples, 0.09%)</title><rect x="48.4" y="741" width="1.0" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="51.40" y="751.5" ></text>
</g>
<g >
<title>LWLockDequeueSelf (133,353,127 samples, 0.02%)</title><rect x="1140.9" y="485" width="0.2" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text  x="1143.93" y="495.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (112,709,505 samples, 0.01%)</title><rect x="1086.9" y="421" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="1089.95" y="431.5" ></text>
</g>
<g >
<title>do_page_fault (1,514,272,238 samples, 0.18%)</title><rect x="255.0" y="405" width="2.1" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="257.96" y="415.5" ></text>
</g>
<g >
<title>page_fault (1,522,966,548 samples, 0.18%)</title><rect x="254.9" y="421" width="2.2" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="257.95" y="431.5" ></text>
</g>
<g >
<title>ParallelWorkerMain (637,248,213,532 samples, 77.32%)</title><rect x="233.7" y="645" width="912.3" height="15.0" fill="rgb(253,221,53)" rx="2" ry="2" />
<text  x="236.66" y="655.5" >ParallelWorkerMain</text>
</g>
<g >
<title>dsa_get_total_size (275,422,087 samples, 0.03%)</title><rect x="132.5" y="501" width="0.4" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="135.49" y="511.5" ></text>
</g>
<g >
<title>pgstat_tracks_io_op (124,884,007 samples, 0.02%)</title><rect x="621.0" y="373" width="0.1" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="623.96" y="383.5" ></text>
</g>
<g >
<title>vacuum_rel (22,616,657,143 samples, 2.74%)</title><rect x="11.0" y="741" width="32.4" height="15.0" fill="rgb(219,65,15)" rx="2" ry="2" />
<text  x="13.98" y="751.5" >va..</text>
</g>
<g >
<title>sem_post@@GLIBC_2.2.5 (373,026,221 samples, 0.05%)</title><rect x="1142.0" y="469" width="0.5" height="15.0" fill="rgb(214,41,9)" rx="2" ry="2" />
<text  x="1144.99" y="479.5" ></text>
</g>
<g >
<title>schedule (442,089,951 samples, 0.05%)</title><rect x="310.8" y="309" width="0.6" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text  x="313.76" y="319.5" ></text>
</g>
<g >
<title>list_del (217,015,500 samples, 0.03%)</title><rect x="574.6" y="117" width="0.3" height="15.0" fill="rgb(235,140,33)" rx="2" ry="2" />
<text  x="577.60" y="127.5" ></text>
</g>
<g >
<title>vm_normal_page (93,833,366 samples, 0.01%)</title><rect x="1160.9" y="645" width="0.1" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text  x="1163.87" y="655.5" ></text>
</g>
<g >
<title>_mdfd_getseg (242,127,705 samples, 0.03%)</title><rect x="600.5" y="469" width="0.4" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="603.52" y="479.5" ></text>
</g>
<g >
<title>ResourceOwnerForgetBuffer (141,956,280 samples, 0.02%)</title><rect x="607.1" y="485" width="0.2" height="15.0" fill="rgb(247,193,46)" rx="2" ry="2" />
<text  x="610.06" y="495.5" ></text>
</g>
<g >
<title>tick_sched_handle (70,310,500 samples, 0.01%)</title><rect x="96.9" y="69" width="0.1" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="99.89" y="79.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (212,887,781 samples, 0.03%)</title><rect x="393.5" y="197" width="0.3" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="396.48" y="207.5" ></text>
</g>
<g >
<title>SetHintBits (4,415,796,632 samples, 0.54%)</title><rect x="223.1" y="469" width="6.3" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text  x="226.12" y="479.5" ></text>
</g>
<g >
<title>__do_page_fault (1,486,296,025 samples, 0.18%)</title><rect x="255.0" y="389" width="2.1" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="257.96" y="399.5" ></text>
</g>
<g >
<title>heap_page_prune_execute (55,866,958,960 samples, 6.78%)</title><rect x="784.8" y="517" width="80.0" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="787.82" y="527.5" >heap_page..</text>
</g>
<g >
<title>ExecVacuum (22,616,657,143 samples, 2.74%)</title><rect x="11.0" y="773" width="32.4" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="13.98" y="783.5" >Ex..</text>
</g>
<g >
<title>try_to_wake_up (367,067,424 samples, 0.04%)</title><rect x="594.6" y="261" width="0.5" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="597.61" y="271.5" ></text>
</g>
<g >
<title>LWLockRelease (398,251,233 samples, 0.05%)</title><rect x="43.4" y="709" width="0.5" height="15.0" fill="rgb(217,58,13)" rx="2" ry="2" />
<text  x="46.36" y="719.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (135,558,230 samples, 0.02%)</title><rect x="132.6" y="453" width="0.2" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="135.57" y="463.5" ></text>
</g>
<g >
<title>tas (1,519,794,054 samples, 0.18%)</title><rect x="315.6" y="405" width="2.2" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="318.64" y="415.5" ></text>
</g>
<g >
<title>sys_nanosleep (687,299,706 samples, 0.08%)</title><rect x="310.5" y="357" width="1.0" height="15.0" fill="rgb(248,200,48)" rx="2" ry="2" />
<text  x="313.48" y="367.5" ></text>
</g>
<g >
<title>free_pgd_range (87,499,351 samples, 0.01%)</title><rect x="1146.0" y="661" width="0.1" height="15.0" fill="rgb(209,21,5)" rx="2" ry="2" />
<text  x="1149.00" y="671.5" ></text>
</g>
<g >
<title>worker_thread (115,471,936 samples, 0.01%)</title><rect x="10.1" y="741" width="0.2" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text  x="13.10" y="751.5" ></text>
</g>
<g >
<title>BufferGetPage (151,275,356 samples, 0.02%)</title><rect x="759.6" y="517" width="0.2" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="762.59" y="527.5" ></text>
</g>
<g >
<title>BackgroundWorkerInitializeConnectionByOid (212,994,535 samples, 0.03%)</title><rect x="54.7" y="613" width="0.3" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text  x="57.69" y="623.5" ></text>
</g>
<g >
<title>file_update_time (118,238,904 samples, 0.01%)</title><rect x="328.7" y="373" width="0.1" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text  x="331.67" y="383.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (163,321,519 samples, 0.02%)</title><rect x="393.6" y="181" width="0.2" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="396.56" y="191.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (77,614,966 samples, 0.01%)</title><rect x="87.4" y="133" width="0.1" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="90.38" y="143.5" ></text>
</g>
<g >
<title>TransactionIdIsInProgress (77,584,434 samples, 0.01%)</title><rect x="230.7" y="485" width="0.1" height="15.0" fill="rgb(208,16,3)" rx="2" ry="2" />
<text  x="233.68" y="495.5" ></text>
</g>
<g >
<title>handle_mm_fault (3,095,780,057 samples, 0.38%)</title><rect x="267.3" y="373" width="4.5" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="270.33" y="383.5" ></text>
</g>
<g >
<title>__mem_cgroup_count_vm_event (201,912,294 samples, 0.02%)</title><rect x="386.1" y="245" width="0.3" height="15.0" fill="rgb(217,56,13)" rx="2" ry="2" />
<text  x="389.09" y="255.5" ></text>
</g>
<g >
<title>perf_mmap__push (399,474,370 samples, 0.05%)</title><rect x="10.3" y="693" width="0.6" height="15.0" fill="rgb(248,201,48)" rx="2" ry="2" />
<text  x="13.30" y="703.5" ></text>
</g>
<g >
<title>__radix_tree_lookup (910,640,070 samples, 0.11%)</title><rect x="93.6" y="117" width="1.3" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="96.59" y="127.5" ></text>
</g>
<g >
<title>heap_vac_scan_next_block_parallel (10,695,648,653 samples, 1.30%)</title><rect x="606.4" y="533" width="15.3" height="15.0" fill="rgb(247,197,47)" rx="2" ry="2" />
<text  x="609.39" y="543.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (75,510,927 samples, 0.01%)</title><rect x="697.4" y="485" width="0.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="700.37" y="495.5" ></text>
</g>
<g >
<title>system_call_fastpath (37,596,418,486 samples, 4.56%)</title><rect x="78.4" y="421" width="53.8" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="81.35" y="431.5" >syste..</text>
</g>
<g >
<title>handle_mm_fault (1,405,115,788 samples, 0.17%)</title><rect x="255.1" y="373" width="2.0" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="258.07" y="383.5" ></text>
</g>
<g >
<title>heap_prune_satisfies_vacuum (8,381,688,717 samples, 1.02%)</title><rect x="218.8" y="501" width="12.0" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="221.80" y="511.5" ></text>
</g>
<g >
<title>pagevec_lru_move_fn (213,189,052 samples, 0.03%)</title><rect x="1159.0" y="565" width="0.3" height="15.0" fill="rgb(205,0,0)" rx="2" ry="2" />
<text  x="1162.03" y="575.5" ></text>
</g>
<g >
<title>perform_spin_delay (518,164,650 samples, 0.06%)</title><rect x="64.4" y="405" width="0.7" height="15.0" fill="rgb(247,196,46)" rx="2" ry="2" />
<text  x="67.40" y="415.5" ></text>
</g>
<g >
<title>BufTableInsert (290,289,581 samples, 0.04%)</title><rect x="11.2" y="549" width="0.4" height="15.0" fill="rgb(206,8,1)" rx="2" ry="2" />
<text  x="14.18" y="559.5" ></text>
</g>
<g >
<title>xfs_file_aio_write_checks (735,375,822 samples, 0.09%)</title><rect x="14.8" y="341" width="1.0" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text  x="17.77" y="351.5" ></text>
</g>
<g >
<title>radix_tree_descend (122,707,598 samples, 0.01%)</title><rect x="246.1" y="213" width="0.1" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="249.06" y="223.5" ></text>
</g>
<g >
<title>visibilitymap_get_status (1,488,188,829 samples, 0.18%)</title><rect x="133.7" y="501" width="2.1" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text  x="136.67" y="511.5" ></text>
</g>
<g >
<title>LWLockRelease (689,058,711 samples, 0.08%)</title><rect x="320.5" y="453" width="1.0" height="15.0" fill="rgb(217,58,13)" rx="2" ry="2" />
<text  x="323.55" y="463.5" ></text>
</g>
<g >
<title>ConditionalLockBuffer (79,461,601 samples, 0.01%)</title><rect x="234.7" y="549" width="0.1" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="237.72" y="559.5" ></text>
</g>
<g >
<title>__set_page_dirty_no_writeback (94,364,097 samples, 0.01%)</title><rect x="584.3" y="213" width="0.1" height="15.0" fill="rgb(223,86,20)" rx="2" ry="2" />
<text  x="587.28" y="223.5" ></text>
</g>
<g >
<title>GetBufferDescriptor (499,565,098 samples, 0.06%)</title><rect x="1101.3" y="469" width="0.8" height="15.0" fill="rgb(249,202,48)" rx="2" ry="2" />
<text  x="1104.34" y="479.5" ></text>
</g>
<g >
<title>memmove@plt (208,989,997 samples, 0.03%)</title><rect x="854.0" y="485" width="0.3" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text  x="857.01" y="495.5" ></text>
</g>
<g >
<title>pte_alloc_one (266,847,590 samples, 0.03%)</title><rect x="86.3" y="213" width="0.4" height="15.0" fill="rgb(252,217,51)" rx="2" ry="2" />
<text  x="89.33" y="223.5" ></text>
</g>
<g >
<title>idle_cpu (81,790,463 samples, 0.01%)</title><rect x="18.9" y="229" width="0.2" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="21.94" y="239.5" ></text>
</g>
<g >
<title>fsm_set_and_search (402,086,166 samples, 0.05%)</title><rect x="22.5" y="629" width="0.5" height="15.0" fill="rgb(218,64,15)" rx="2" ry="2" />
<text  x="25.46" y="639.5" ></text>
</g>
<g >
<title>tick_sched_timer (112,843,505 samples, 0.01%)</title><rect x="990.8" y="405" width="0.1" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="993.77" y="415.5" ></text>
</g>
<g >
<title>native_write_msr_safe (187,683,415 samples, 0.02%)</title><rect x="1185.4" y="581" width="0.3" height="15.0" fill="rgb(243,176,42)" rx="2" ry="2" />
<text  x="1188.39" y="591.5" ></text>
</g>
<g >
<title>iomap_write_actor (133,604,874 samples, 0.02%)</title><rect x="50.6" y="85" width="0.2" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="53.63" y="95.5" ></text>
</g>
<g >
<title>IOContextForStrategy (131,930,868 samples, 0.02%)</title><rect x="612.6" y="421" width="0.2" height="15.0" fill="rgb(241,169,40)" rx="2" ry="2" />
<text  x="615.63" y="431.5" ></text>
</g>
<g >
<title>bdi_writeback_workfn (115,471,936 samples, 0.01%)</title><rect x="10.1" y="709" width="0.2" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="13.10" y="719.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (147,030,058 samples, 0.02%)</title><rect x="42.2" y="533" width="0.2" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="45.19" y="543.5" ></text>
</g>
<g >
<title>mem_cgroup_charge_common (4,343,329,049 samples, 0.53%)</title><rect x="446.1" y="165" width="6.2" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="449.09" y="175.5" ></text>
</g>
<g >
<title>pagevec_lru_move_fn (240,465,341 samples, 0.03%)</title><rect x="95.5" y="149" width="0.4" height="15.0" fill="rgb(205,0,0)" rx="2" ry="2" />
<text  x="98.52" y="159.5" ></text>
</g>
<g >
<title>heap_prepare_freeze_tuple (1,215,768,247 samples, 0.15%)</title><rect x="933.8" y="501" width="1.8" height="15.0" fill="rgb(227,101,24)" rx="2" ry="2" />
<text  x="936.82" y="511.5" ></text>
</g>
<g >
<title>compactify_tuples (540,545,086 samples, 0.07%)</title><rect x="31.5" y="581" width="0.8" height="15.0" fill="rgb(209,21,5)" rx="2" ry="2" />
<text  x="34.51" y="591.5" ></text>
</g>
<g >
<title>BufferGetBlockNumber (148,789,221 samples, 0.02%)</title><rect x="759.4" y="517" width="0.2" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="762.38" y="527.5" ></text>
</g>
<g >
<title>HeapTupleHeaderAdvanceConflictHorizon (237,164,345 samples, 0.03%)</title><rect x="34.3" y="597" width="0.3" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text  x="37.27" y="607.5" ></text>
</g>
<g >
<title>_raw_qspin_lock (76,829,307 samples, 0.01%)</title><rect x="1184.5" y="677" width="0.1" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="1187.47" y="687.5" ></text>
</g>
<g >
<title>shmem_recalc_inode (290,823,174 samples, 0.04%)</title><rect x="127.2" y="165" width="0.5" height="15.0" fill="rgb(214,42,10)" rx="2" ry="2" />
<text  x="130.25" y="175.5" ></text>
</g>
<g >
<title>pgstat_count_io_op (308,656,816 samples, 0.04%)</title><rect x="620.7" y="405" width="0.4" height="15.0" fill="rgb(207,10,2)" rx="2" ry="2" />
<text  x="623.70" y="415.5" ></text>
</g>
<g >
<title>heap_page_prune_and_freeze (294,701,437,150 samples, 35.76%)</title><rect x="709.6" y="533" width="421.9" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="712.57" y="543.5" >heap_page_prune_and_freeze</text>
</g>
<g >
<title>smp_apic_timer_interrupt (2,125,741,723 samples, 0.26%)</title><rect x="1164.0" y="677" width="3.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="1167.03" y="687.5" ></text>
</g>
<g >
<title>finish_task_switch (168,902,113 samples, 0.02%)</title><rect x="16.9" y="245" width="0.3" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="19.93" y="255.5" ></text>
</g>
<g >
<title>page_fault (110,831,926 samples, 0.01%)</title><rect x="47.8" y="725" width="0.2" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="50.85" y="735.5" ></text>
</g>
<g >
<title>up_read (1,154,836,205 samples, 0.14%)</title><rect x="595.4" y="325" width="1.7" height="15.0" fill="rgb(209,18,4)" rx="2" ry="2" />
<text  x="598.40" y="335.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (136,206,863 samples, 0.02%)</title><rect x="18.6" y="245" width="0.2" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text  x="21.61" y="255.5" ></text>
</g>
<g >
<title>heap_vac_scan_next_block (10,773,298,017 samples, 1.31%)</title><rect x="606.3" y="549" width="15.5" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="609.34" y="559.5" ></text>
</g>
<g >
<title>perform_spin_delay (1,301,039,731 samples, 0.16%)</title><rect x="70.3" y="389" width="1.9" height="15.0" fill="rgb(247,196,46)" rx="2" ry="2" />
<text  x="73.33" y="399.5" ></text>
</g>
<g >
<title>PostmasterMain (128,045,675,684 samples, 15.54%)</title><rect x="50.2" y="741" width="183.3" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text  x="53.19" y="751.5" >PostmasterMain</text>
</g>
<g >
<title>MarkBufferDirtyHint (361,697,728 samples, 0.04%)</title><rect x="1087.1" y="485" width="0.5" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="1090.11" y="495.5" ></text>
</g>
<g >
<title>rwsem_down_read_failed (885,210,911 samples, 0.11%)</title><rect x="130.1" y="277" width="1.2" height="15.0" fill="rgb(254,225,54)" rx="2" ry="2" />
<text  x="133.06" y="287.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (80,712,318 samples, 0.01%)</title><rect x="301.0" y="357" width="0.1" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="304.00" y="367.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (89,760,923 samples, 0.01%)</title><rect x="301.0" y="405" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="304.00" y="415.5" ></text>
</g>
<g >
<title>PostmasterMain (637,279,068,599 samples, 77.32%)</title><rect x="233.6" y="757" width="912.4" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text  x="236.61" y="767.5" >PostmasterMain</text>
</g>
<g >
<title>MarkBufferDirty (943,624,973 samples, 0.11%)</title><rect x="769.5" y="517" width="1.3" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="772.49" y="527.5" ></text>
</g>
<g >
<title>ReadBuffer_common (1,173,786,906 samples, 0.14%)</title><rect x="50.2" y="405" width="1.7" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="53.22" y="415.5" ></text>
</g>
<g >
<title>BufferIsValid (76,715,302 samples, 0.01%)</title><rect x="235.0" y="517" width="0.1" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="238.00" y="527.5" ></text>
</g>
<g >
<title>error_entry (168,602,788 samples, 0.02%)</title><rect x="370.9" y="309" width="0.2" height="15.0" fill="rgb(240,163,39)" rx="2" ry="2" />
<text  x="373.89" y="319.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (127,038,596 samples, 0.02%)</title><rect x="229.0" y="437" width="0.2" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="231.97" y="447.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_or_u32 (316,793,348 samples, 0.04%)</title><rect x="63.2" y="389" width="0.5" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="66.23" y="399.5" ></text>
</g>
<g >
<title>timerqueue_add (87,412,832 samples, 0.01%)</title><rect x="1186.0" y="629" width="0.1" height="15.0" fill="rgb(214,42,10)" rx="2" ry="2" />
<text  x="1188.96" y="639.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (352,160,558 samples, 0.04%)</title><rect x="630.9" y="517" width="0.5" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="633.87" y="527.5" ></text>
</g>
<g >
<title>pread@plt (82,980,503 samples, 0.01%)</title><rect x="600.4" y="453" width="0.1" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="603.40" y="463.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (147,471,237 samples, 0.02%)</title><rect x="393.6" y="149" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="396.58" y="159.5" ></text>
</g>
<g >
<title>standard_ProcessUtility (3,088,212,841 samples, 0.37%)</title><rect x="50.2" y="565" width="4.4" height="15.0" fill="rgb(233,132,31)" rx="2" ry="2" />
<text  x="53.20" y="575.5" ></text>
</g>
<g >
<title>finish_task_switch (192,923,108 samples, 0.02%)</title><rect x="593.9" y="245" width="0.3" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="596.89" y="255.5" ></text>
</g>
<g >
<title>LWLockAcquire (221,825,128 samples, 0.03%)</title><rect x="132.5" y="485" width="0.3" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text  x="135.50" y="495.5" ></text>
</g>
<g >
<title>BufferAlloc (6,493,726,765 samples, 0.79%)</title><rect x="11.2" y="565" width="9.3" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text  x="14.17" y="575.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_or_u32 (89,124,358 samples, 0.01%)</title><rect x="331.5" y="469" width="0.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="334.47" y="479.5" ></text>
</g>
<g >
<title>__find_lock_page (1,559,173,446 samples, 0.19%)</title><rect x="269.2" y="293" width="2.2" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="272.18" y="303.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (251,687,844 samples, 0.03%)</title><rect x="933.5" y="485" width="0.3" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="936.46" y="495.5" ></text>
</g>
<g >
<title>shmem_fault (169,710,844 samples, 0.02%)</title><rect x="57.3" y="293" width="0.3" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="60.32" y="303.5" ></text>
</g>
<g >
<title>page_add_file_rmap (84,308,943 samples, 0.01%)</title><rect x="127.8" y="197" width="0.1" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="130.80" y="207.5" ></text>
</g>
<g >
<title>system_call_fastpath (4,753,132,452 samples, 0.58%)</title><rect x="12.5" y="437" width="6.8" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="15.53" y="447.5" ></text>
</g>
<g >
<title>buffers_to_iovec (119,193,866 samples, 0.01%)</title><rect x="600.9" y="469" width="0.1" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="603.86" y="479.5" ></text>
</g>
<g >
<title>dequeue_entity (107,065,471 samples, 0.01%)</title><rect x="310.9" y="245" width="0.2" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text  x="313.94" y="255.5" ></text>
</g>
<g >
<title>tas (340,898,858 samples, 0.04%)</title><rect x="317.8" y="421" width="0.5" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="320.81" y="431.5" ></text>
</g>
<g >
<title>system_call_fastpath (328,235,540 samples, 0.04%)</title><rect x="1142.1" y="453" width="0.4" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="1145.05" y="463.5" ></text>
</g>
<g >
<title>BlockIdSet (518,101,453 samples, 0.06%)</title><rect x="162.3" y="485" width="0.8" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="165.32" y="495.5" ></text>
</g>
<g >
<title>up_write (1,452,303,115 samples, 0.18%)</title><rect x="17.2" y="325" width="2.1" height="15.0" fill="rgb(235,139,33)" rx="2" ry="2" />
<text  x="20.20" y="335.5" ></text>
</g>
<g >
<title>free_hot_cold_page (499,674,412 samples, 0.06%)</title><rect x="1164.6" y="517" width="0.7" height="15.0" fill="rgb(215,49,11)" rx="2" ry="2" />
<text  x="1167.61" y="527.5" ></text>
</g>
<g >
<title>LWLockWakeup (261,748,892 samples, 0.03%)</title><rect x="604.1" y="485" width="0.4" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text  x="607.13" y="495.5" ></text>
</g>
<g >
<title>xfs_log_commit_cil (93,134,707 samples, 0.01%)</title><rect x="586.7" y="229" width="0.1" height="15.0" fill="rgb(207,11,2)" rx="2" ry="2" />
<text  x="589.68" y="239.5" ></text>
</g>
<g >
<title>rwsem_wake (1,442,221,405 samples, 0.17%)</title><rect x="17.2" y="293" width="2.1" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="20.21" y="303.5" ></text>
</g>
<g >
<title>PageIsVerifiedExtended (210,041,493 samples, 0.03%)</title><rect x="325.1" y="501" width="0.3" height="15.0" fill="rgb(251,215,51)" rx="2" ry="2" />
<text  x="328.05" y="511.5" ></text>
</g>
<g >
<title>StartReadBuffersImpl (987,780,807 samples, 0.12%)</title><rect x="50.2" y="373" width="1.4" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="53.22" y="383.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (760,843,430 samples, 0.09%)</title><rect x="666.4" y="501" width="1.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="669.44" y="511.5" ></text>
</g>
<g >
<title>tick_sched_timer (151,978,247 samples, 0.02%)</title><rect x="933.5" y="421" width="0.3" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="936.55" y="431.5" ></text>
</g>
<g >
<title>do_start_bgworker (124,943,333,272 samples, 15.16%)</title><rect x="54.6" y="677" width="178.9" height="15.0" fill="rgb(217,58,14)" rx="2" ry="2" />
<text  x="57.63" y="687.5" >do_start_bgworker</text>
</g>
<g >
<title>cpuidle_get_cpu_driver (152,075,996 samples, 0.02%)</title><rect x="1176.1" y="693" width="0.2" height="15.0" fill="rgb(231,121,29)" rx="2" ry="2" />
<text  x="1179.13" y="703.5" ></text>
</g>
<g >
<title>BufferIsValid (116,426,514 samples, 0.01%)</title><rect x="227.1" y="405" width="0.1" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="230.06" y="415.5" ></text>
</g>
<g >
<title>heap_prune_record_unchanged_lp_normal (16,015,892,933 samples, 1.94%)</title><rect x="195.3" y="485" width="23.0" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="198.33" y="495.5" >h..</text>
</g>
<g >
<title>PageGetItem (74,045,246 samples, 0.01%)</title><rect x="53.3" y="373" width="0.1" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="56.27" y="383.5" ></text>
</g>
<g >
<title>StartReadBuffer (987,780,807 samples, 0.12%)</title><rect x="50.2" y="389" width="1.4" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text  x="53.22" y="399.5" ></text>
</g>
<g >
<title>WaitReadBuffers (194,162,665,756 samples, 23.56%)</title><rect x="323.2" y="517" width="277.9" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text  x="326.17" y="527.5" >WaitReadBuffers</text>
</g>
<g >
<title>__list_del_entry (174,296,378 samples, 0.02%)</title><rect x="574.7" y="101" width="0.2" height="15.0" fill="rgb(214,41,9)" rx="2" ry="2" />
<text  x="577.66" y="111.5" ></text>
</g>
<g >
<title>call_rwsem_wake (1,442,932,814 samples, 0.18%)</title><rect x="17.2" y="309" width="2.1" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text  x="20.21" y="319.5" ></text>
</g>
<g >
<title>BufferAlloc (5,090,339,570 samples, 0.62%)</title><rect x="613.1" y="405" width="7.3" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text  x="616.13" y="415.5" ></text>
</g>
<g >
<title>start_cpu (19,105,050,665 samples, 2.32%)</title><rect x="1162.6" y="773" width="27.4" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="1165.65" y="783.5" >s..</text>
</g>
<g >
<title>_raw_spin_lock_irqsave (97,476,627 samples, 0.01%)</title><rect x="14.2" y="213" width="0.1" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text  x="17.17" y="223.5" ></text>
</g>
<g >
<title>WaitReadBuffersCanStartIO (75,733,651 samples, 0.01%)</title><rect x="20.6" y="597" width="0.1" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text  x="23.64" y="607.5" ></text>
</g>
<g >
<title>BlockIdSet (428,700,237 samples, 0.05%)</title><rect x="147.8" y="485" width="0.6" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="150.80" y="495.5" ></text>
</g>
<g >
<title>BufferGetBlockNumber (74,426,847 samples, 0.01%)</title><rect x="143.4" y="501" width="0.1" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="146.41" y="511.5" ></text>
</g>
<g >
<title>auditsys (75,952,356 samples, 0.01%)</title><rect x="77.3" y="421" width="0.2" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text  x="80.35" y="431.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_or_u32 (630,445,434 samples, 0.08%)</title><rect x="321.8" y="437" width="0.9" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="324.85" y="447.5" ></text>
</g>
<g >
<title>InitBufferTag (219,492,168 samples, 0.03%)</title><rect x="616.4" y="389" width="0.4" height="15.0" fill="rgb(230,116,27)" rx="2" ry="2" />
<text  x="619.45" y="399.5" ></text>
</g>
<g >
<title>vacuum_delay_point (91,371,963 samples, 0.01%)</title><rect x="1145.8" y="565" width="0.2" height="15.0" fill="rgb(208,17,4)" rx="2" ry="2" />
<text  x="1148.83" y="575.5" ></text>
</g>
<g >
<title>postmaster_child_launch (124,939,849,462 samples, 15.16%)</title><rect x="54.6" y="661" width="178.9" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="57.63" y="671.5" >postmaster_child_launch</text>
</g>
<g >
<title>pick_next_entity (84,032,717 samples, 0.01%)</title><rect x="1182.3" y="677" width="0.2" height="15.0" fill="rgb(244,181,43)" rx="2" ry="2" />
<text  x="1185.33" y="687.5" ></text>
</g>
<g >
<title>error_swapgs (212,144,618 samples, 0.03%)</title><rect x="47.0" y="725" width="0.3" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text  x="49.96" y="735.5" ></text>
</g>
<g >
<title>radix_tree_lookup_slot (8,265,494,430 samples, 1.00%)</title><rect x="426.9" y="149" width="11.9" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="429.94" y="159.5" ></text>
</g>
<g >
<title>sysret_check (2,042,178,839 samples, 0.25%)</title><rect x="337.0" y="437" width="3.0" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="340.04" y="447.5" ></text>
</g>
<g >
<title>TransactionIdFollows (486,638,360 samples, 0.06%)</title><rect x="634.6" y="533" width="0.7" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="637.64" y="543.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32 (327,137,038 samples, 0.04%)</title><rect x="602.0" y="469" width="0.5" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="605.04" y="479.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (3,363,540,090 samples, 0.41%)</title><rect x="1111.8" y="453" width="4.8" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="1114.82" y="463.5" ></text>
</g>
<g >
<title>PageGetItemId (2,601,470,039 samples, 0.32%)</title><rect x="977.3" y="485" width="3.7" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="980.26" y="495.5" ></text>
</g>
<g >
<title>StartReadBuffersImpl (6,507,529,578 samples, 0.79%)</title><rect x="11.2" y="597" width="9.3" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="14.16" y="607.5" ></text>
</g>
<g >
<title>LWLockHeldByMeInMode (122,673,726 samples, 0.01%)</title><rect x="631.5" y="517" width="0.2" height="15.0" fill="rgb(207,12,2)" rx="2" ry="2" />
<text  x="634.51" y="527.5" ></text>
</g>
<g >
<title>__radix_tree_lookup (80,440,470 samples, 0.01%)</title><rect x="79.9" y="261" width="0.2" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="82.95" y="271.5" ></text>
</g>
<g >
<title>LWLockAcquire (1,565,521,410 samples, 0.19%)</title><rect x="601.8" y="501" width="2.2" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text  x="604.80" y="511.5" ></text>
</g>
<g >
<title>UnpinBuffer (371,735,190 samples, 0.05%)</title><rect x="607.0" y="501" width="0.5" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="609.97" y="511.5" ></text>
</g>
<g >
<title>LockBuffer (366,250,569 samples, 0.04%)</title><rect x="232.2" y="501" width="0.6" height="15.0" fill="rgb(235,142,34)" rx="2" ry="2" />
<text  x="235.24" y="511.5" ></text>
</g>
<g >
<title>hrtimer_wakeup (760,595,344 samples, 0.09%)</title><rect x="1165.6" y="613" width="1.1" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="1168.57" y="623.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (75,893,920 samples, 0.01%)</title><rect x="173.3" y="453" width="0.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="176.34" y="463.5" ></text>
</g>
<g >
<title>TransactionIdFollows (180,256,970 samples, 0.02%)</title><rect x="191.0" y="469" width="0.2" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="193.98" y="479.5" ></text>
</g>
<g >
<title>native_write_msr_safe (147,847,573 samples, 0.02%)</title><rect x="1187.9" y="629" width="0.2" height="15.0" fill="rgb(243,176,42)" rx="2" ry="2" />
<text  x="1190.86" y="639.5" ></text>
</g>
<g >
<title>parallel_vacuum_process_table (124,325,521,611 samples, 15.08%)</title><rect x="55.2" y="597" width="178.0" height="15.0" fill="rgb(205,3,0)" rx="2" ry="2" />
<text  x="58.21" y="607.5" >parallel_vacuum_process..</text>
</g>
<g >
<title>__hrtimer_run_queues (80,197,691 samples, 0.01%)</title><rect x="853.8" y="405" width="0.1" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="856.82" y="415.5" ></text>
</g>
<g >
<title>_raw_qspin_lock (603,472,699 samples, 0.07%)</title><rect x="392.6" y="197" width="0.9" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="395.62" y="207.5" ></text>
</g>
<g >
<title>htsv_get_valid_status (208,668,088 samples, 0.03%)</title><rect x="230.8" y="501" width="0.3" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text  x="233.80" y="511.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (450,921,617 samples, 0.05%)</title><rect x="635.3" y="533" width="0.7" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="638.33" y="543.5" ></text>
</g>
<g >
<title>heap_tuple_should_freeze (1,253,447,553 samples, 0.15%)</title><rect x="1059.0" y="485" width="1.8" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="1062.05" y="495.5" ></text>
</g>
<g >
<title>BufferGetBlock (2,078,122,577 samples, 0.25%)</title><rect x="1106.4" y="437" width="2.9" height="15.0" fill="rgb(242,172,41)" rx="2" ry="2" />
<text  x="1109.37" y="447.5" ></text>
</g>
<g >
<title>page_fault (100,582,489 samples, 0.01%)</title><rect x="11.3" y="501" width="0.2" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="14.32" y="511.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (342,353,158 samples, 0.04%)</title><rect x="235.8" y="533" width="0.5" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="238.85" y="543.5" ></text>
</g>
<g >
<title>ReadBufferExtended (1,180,547,054 samples, 0.14%)</title><rect x="50.2" y="421" width="1.7" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text  x="53.22" y="431.5" ></text>
</g>
<g >
<title>__alloc_pages_nodemask (175,142,150 samples, 0.02%)</title><rect x="86.5" y="181" width="0.2" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text  x="89.45" y="191.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (75,893,920 samples, 0.01%)</title><rect x="173.3" y="389" width="0.2" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="176.34" y="399.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (117,806,007 samples, 0.01%)</title><rect x="770.0" y="501" width="0.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="772.98" y="511.5" ></text>
</g>
<g >
<title>heap_prepare_freeze_tuple (130,238,813 samples, 0.02%)</title><rect x="194.3" y="485" width="0.2" height="15.0" fill="rgb(227,101,24)" rx="2" ry="2" />
<text  x="197.30" y="495.5" ></text>
</g>
<g >
<title>__tick_nohz_idle_enter (2,292,490,269 samples, 0.28%)</title><rect x="1183.3" y="709" width="3.3" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text  x="1186.33" y="719.5" ></text>
</g>
<g >
<title>SetHintBits (928,911,221 samples, 0.11%)</title><rect x="41.2" y="581" width="1.3" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text  x="44.16" y="591.5" ></text>
</g>
<g >
<title>StartReadBuffer (173,270,851 samples, 0.02%)</title><rect x="22.8" y="565" width="0.2" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text  x="25.77" y="575.5" ></text>
</g>
<g >
<title>pgstat_tracks_io_object (129,551,904 samples, 0.02%)</title><rect x="332.5" y="453" width="0.2" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="335.54" y="463.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (372,039,770 samples, 0.05%)</title><rect x="445.1" y="133" width="0.5" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="448.07" y="143.5" ></text>
</g>
<g >
<title>TransactionIdFollows (2,709,146,850 samples, 0.33%)</title><rect x="981.0" y="485" width="3.9" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="983.98" y="495.5" ></text>
</g>
<g >
<title>_raw_qspin_lock (423,365,647 samples, 0.05%)</title><rect x="388.5" y="245" width="0.6" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="391.54" y="255.5" ></text>
</g>
<g >
<title>smp_reschedule_interrupt (205,569,025 samples, 0.02%)</title><rect x="1189.6" y="629" width="0.3" height="15.0" fill="rgb(225,96,23)" rx="2" ry="2" />
<text  x="1192.62" y="639.5" ></text>
</g>
<g >
<title>lru_add_drain (232,758,886 samples, 0.03%)</title><rect x="1159.0" y="597" width="0.3" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text  x="1162.01" y="607.5" ></text>
</g>
<g >
<title>heap_page_prune_execute (277,559,257 samples, 0.03%)</title><rect x="52.7" y="389" width="0.4" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="55.69" y="399.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (1,182,210,687 samples, 0.14%)</title><rect x="318.7" y="437" width="1.6" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="321.66" y="447.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (97,460,343 samples, 0.01%)</title><rect x="135.1" y="341" width="0.1" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="138.11" y="351.5" ></text>
</g>
<g >
<title>LWLockRelease (126,604,575 samples, 0.02%)</title><rect x="605.1" y="517" width="0.1" height="15.0" fill="rgb(217,58,13)" rx="2" ry="2" />
<text  x="608.06" y="527.5" ></text>
</g>
<g >
<title>TransactionIdGetCommitLSN (97,888,579 samples, 0.01%)</title><rect x="229.6" y="469" width="0.1" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="232.60" y="479.5" ></text>
</g>
<g >
<title>UnlockReleaseBuffer (1,116,857,671 samples, 0.14%)</title><rect x="604.6" y="549" width="1.6" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text  x="607.63" y="559.5" ></text>
</g>
<g >
<title>system_call_fastpath (105,899,380 samples, 0.01%)</title><rect x="320.8" y="405" width="0.2" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="323.82" y="415.5" ></text>
</g>
<g >
<title>xfs_vn_update_time (608,268,816 samples, 0.07%)</title><rect x="14.9" y="293" width="0.9" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="17.90" y="303.5" ></text>
</g>
<g >
<title>__radix_tree_lookup (771,909,975 samples, 0.09%)</title><rect x="348.3" y="277" width="1.1" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="351.25" y="287.5" ></text>
</g>
<g >
<title>__do_page_fault (1,125,840,029 samples, 0.14%)</title><rect x="59.2" y="373" width="1.6" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="62.24" y="383.5" ></text>
</g>
<g >
<title>table_block_parallelscan_nextpage (164,549,416 samples, 0.02%)</title><rect x="133.4" y="501" width="0.3" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text  x="136.43" y="511.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (738,358,079 samples, 0.09%)</title><rect x="666.5" y="485" width="1.0" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="669.47" y="495.5" ></text>
</g>
<g >
<title>BufferIsValid (130,432,966 samples, 0.02%)</title><rect x="631.2" y="485" width="0.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="634.18" y="495.5" ></text>
</g>
<g >
<title>UnpinBufferNoOwner (271,007,839 samples, 0.03%)</title><rect x="605.7" y="501" width="0.4" height="15.0" fill="rgb(253,221,53)" rx="2" ry="2" />
<text  x="608.72" y="511.5" ></text>
</g>
<g >
<title>ResourceOwnerForgetBufferIO (328,641,149 samples, 0.04%)</title><rect x="330.1" y="485" width="0.4" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text  x="333.08" y="495.5" ></text>
</g>
<g >
<title>xfs_file_buffered_aio_read (177,545,279,973 samples, 21.54%)</title><rect x="342.9" y="357" width="254.2" height="15.0" fill="rgb(217,55,13)" rx="2" ry="2" />
<text  x="345.87" y="367.5" >xfs_file_buffered_aio_read</text>
</g>
<g >
<title>ktime_get (103,708,651 samples, 0.01%)</title><rect x="1176.4" y="693" width="0.2" height="15.0" fill="rgb(207,10,2)" rx="2" ry="2" />
<text  x="1179.41" y="703.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (75,893,920 samples, 0.01%)</title><rect x="173.3" y="421" width="0.2" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="176.34" y="431.5" ></text>
</g>
<g >
<title>PageGetItemId (6,143,384,329 samples, 0.75%)</title><rect x="921.9" y="501" width="8.8" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="924.95" y="511.5" ></text>
</g>
<g >
<title>do_sync_read (125,082,965 samples, 0.02%)</title><rect x="51.7" y="261" width="0.2" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="54.71" y="271.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_or_u32_impl (1,664,615,677 samples, 0.20%)</title><rect x="264.2" y="389" width="2.4" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text  x="267.20" y="399.5" ></text>
</g>
<g >
<title>touch_softlockup_watchdog_sched (74,319,380 samples, 0.01%)</title><rect x="1188.1" y="709" width="0.2" height="15.0" fill="rgb(208,17,4)" rx="2" ry="2" />
<text  x="1191.15" y="719.5" ></text>
</g>
<g >
<title>error_sti (97,105,749 samples, 0.01%)</title><rect x="58.9" y="405" width="0.2" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="61.94" y="415.5" ></text>
</g>
<g >
<title>down_read_trylock (71,760,383 samples, 0.01%)</title><rect x="382.0" y="261" width="0.1" height="15.0" fill="rgb(219,66,15)" rx="2" ry="2" />
<text  x="384.97" y="271.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (331,602,417 samples, 0.04%)</title><rect x="784.3" y="453" width="0.5" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="787.34" y="463.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (279,402,846 samples, 0.03%)</title><rect x="161.2" y="501" width="0.4" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="164.24" y="511.5" ></text>
</g>
<g >
<title>__do_softirq (534,263,057 samples, 0.06%)</title><rect x="1164.6" y="613" width="0.7" height="15.0" fill="rgb(246,191,45)" rx="2" ry="2" />
<text  x="1167.57" y="623.5" ></text>
</g>
<g >
<title>FileWriteV (4,930,570,578 samples, 0.60%)</title><rect x="12.3" y="469" width="7.0" height="15.0" fill="rgb(248,201,48)" rx="2" ry="2" />
<text  x="15.28" y="479.5" ></text>
</g>
<g >
<title>set_page_dirty (1,451,339,273 samples, 0.18%)</title><rect x="582.4" y="229" width="2.0" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text  x="585.36" y="239.5" ></text>
</g>
<g >
<title>do_futex_wait.constprop.1 (842,534,330 samples, 0.10%)</title><rect x="48.2" y="757" width="1.2" height="15.0" fill="rgb(237,150,36)" rx="2" ry="2" />
<text  x="51.24" y="767.5" ></text>
</g>
<g >
<title>xfs_file_aio_read (37,063,306,932 samples, 4.50%)</title><rect x="78.8" y="357" width="53.0" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text  x="81.78" y="367.5" >xfs_f..</text>
</g>
<g >
<title>generic_file_aio_read (735,764,864 samples, 0.09%)</title><rect x="21.3" y="437" width="1.0" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text  x="24.25" y="447.5" ></text>
</g>
<g >
<title>futex_wake (72,503,396 samples, 0.01%)</title><rect x="232.6" y="389" width="0.1" height="15.0" fill="rgb(219,65,15)" rx="2" ry="2" />
<text  x="235.59" y="399.5" ></text>
</g>
<g >
<title>TransactionIdFollows (1,004,135,401 samples, 0.12%)</title><rect x="914.7" y="485" width="1.4" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="917.71" y="495.5" ></text>
</g>
<g >
<title>__dec_zone_page_state (184,851,726 samples, 0.02%)</title><rect x="1155.2" y="613" width="0.2" height="15.0" fill="rgb(250,208,49)" rx="2" ry="2" />
<text  x="1158.17" y="623.5" ></text>
</g>
<g >
<title>do_page_fault (3,492,358,859 samples, 0.42%)</title><rect x="266.9" y="405" width="5.0" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="269.87" y="415.5" ></text>
</g>
<g >
<title>visibilitymap_set (252,733,328 samples, 0.03%)</title><rect x="42.9" y="629" width="0.4" height="15.0" fill="rgb(220,73,17)" rx="2" ry="2" />
<text  x="45.89" y="639.5" ></text>
</g>
<g >
<title>BufferIsValid (143,364,206 samples, 0.02%)</title><rect x="42.0" y="533" width="0.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="44.98" y="543.5" ></text>
</g>
<g >
<title>ktime_get (151,921,201 samples, 0.02%)</title><rect x="1175.8" y="677" width="0.3" height="15.0" fill="rgb(207,10,2)" rx="2" ry="2" />
<text  x="1178.84" y="687.5" ></text>
</g>
<g >
<title>__rmqueue (1,131,138,278 samples, 0.14%)</title><rect x="572.7" y="117" width="1.6" height="15.0" fill="rgb(249,203,48)" rx="2" ry="2" />
<text  x="575.71" y="127.5" ></text>
</g>
<g >
<title>retint_userspace_restore_args (400,029,322 samples, 0.05%)</title><rect x="60.9" y="405" width="0.6" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text  x="63.89" y="415.5" ></text>
</g>
<g >
<title>perf_pmu_sched_task (95,040,665 samples, 0.01%)</title><rect x="1181.9" y="661" width="0.2" height="15.0" fill="rgb(205,0,0)" rx="2" ry="2" />
<text  x="1184.93" y="671.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32 (152,333,997 samples, 0.02%)</title><rect x="617.0" y="357" width="0.2" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="619.98" y="367.5" ></text>
</g>
<g >
<title>do_start_bgworker (637,265,816,681 samples, 77.32%)</title><rect x="233.6" y="693" width="912.4" height="15.0" fill="rgb(217,58,14)" rx="2" ry="2" />
<text  x="236.63" y="703.5" >do_start_bgworker</text>
</g>
<g >
<title>tag_hash (365,341,020 samples, 0.04%)</title><rect x="238.0" y="421" width="0.6" height="15.0" fill="rgb(245,185,44)" rx="2" ry="2" />
<text  x="241.05" y="431.5" ></text>
</g>
<g >
<title>heap_page_prune_execute (2,120,731,342 samples, 0.26%)</title><rect x="29.7" y="613" width="3.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="32.74" y="623.5" ></text>
</g>
<g >
<title>try_to_wake_up (149,194,408 samples, 0.02%)</title><rect x="1142.3" y="373" width="0.2" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="1145.29" y="383.5" ></text>
</g>
<g >
<title>do_page_fault (306,316,030 samples, 0.04%)</title><rect x="75.4" y="421" width="0.4" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text  x="78.40" y="431.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (974,293,906 samples, 0.12%)</title><rect x="1165.4" y="629" width="1.4" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="1168.45" y="639.5" ></text>
</g>
<g >
<title>writeback_sb_inodes (115,471,936 samples, 0.01%)</title><rect x="10.1" y="661" width="0.2" height="15.0" fill="rgb(237,148,35)" rx="2" ry="2" />
<text  x="13.10" y="671.5" ></text>
</g>
<g >
<title>__nanosleep_nocancel (161,702,524 samples, 0.02%)</title><rect x="71.7" y="373" width="0.3" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="74.74" y="383.5" ></text>
</g>
<g >
<title>__pte_alloc (280,034,695 samples, 0.03%)</title><rect x="86.3" y="229" width="0.4" height="15.0" fill="rgb(218,62,15)" rx="2" ry="2" />
<text  x="89.31" y="239.5" ></text>
</g>
<g >
<title>__list_del_entry (257,289,144 samples, 0.03%)</title><rect x="126.5" y="69" width="0.4" height="15.0" fill="rgb(214,41,9)" rx="2" ry="2" />
<text  x="129.50" y="79.5" ></text>
</g>
<g >
<title>copy_user_enhanced_fast_string (2,495,620,242 samples, 0.30%)</title><rect x="80.2" y="293" width="3.6" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text  x="83.18" y="303.5" ></text>
</g>
<g >
<title>ResourceOwnerRememberBuffer (186,384,007 samples, 0.02%)</title><rect x="620.0" y="373" width="0.2" height="15.0" fill="rgb(205,0,0)" rx="2" ry="2" />
<text  x="622.96" y="383.5" ></text>
</g>
<g >
<title>PageGetItemId (206,443,479 samples, 0.03%)</title><rect x="137.4" y="517" width="0.3" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="140.40" y="527.5" ></text>
</g>
<g >
<title>BlockIdSet (2,751,605,618 samples, 0.33%)</title><rect x="688.8" y="501" width="3.9" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="691.78" y="511.5" ></text>
</g>
<g >
<title>iomap_write_begin (201,277,307 samples, 0.02%)</title><rect x="13.7" y="293" width="0.3" height="15.0" fill="rgb(211,30,7)" rx="2" ry="2" />
<text  x="16.71" y="303.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (2,576,624,425 samples, 0.31%)</title><rect x="1093.7" y="453" width="3.7" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="1096.71" y="463.5" ></text>
</g>
<g >
<title>cpu_startup_entry (376,644,420 samples, 0.05%)</title><rect x="1189.5" y="693" width="0.5" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text  x="1192.46" y="703.5" ></text>
</g>
<g >
<title>hash_initial_lookup (104,078,404 samples, 0.01%)</title><rect x="248.7" y="421" width="0.1" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="251.69" y="431.5" ></text>
</g>
<g >
<title>rwsem_wake (267,114,761 samples, 0.03%)</title><rect x="51.1" y="69" width="0.4" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="54.10" y="79.5" ></text>
</g>
<g >
<title>mdreadv (38,842,203,263 samples, 4.71%)</title><rect x="76.7" y="469" width="55.6" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text  x="79.73" y="479.5" >mdreadv</text>
</g>
<g >
<title>GetPrivateRefCountEntry (233,531,901 samples, 0.03%)</title><rect x="228.2" y="421" width="0.3" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="231.17" y="431.5" ></text>
</g>
<g >
<title>radix_tree_descend (404,985,780 samples, 0.05%)</title><rect x="94.3" y="101" width="0.6" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="97.31" y="111.5" ></text>
</g>
<g >
<title>do_lazy_scan_heap (124,261,154,376 samples, 15.08%)</title><rect x="55.3" y="549" width="177.9" height="15.0" fill="rgb(221,75,18)" rx="2" ry="2" />
<text  x="58.27" y="559.5" >do_lazy_scan_heap</text>
</g>
<g >
<title>__nanosleep_nocancel (1,192,127,972 samples, 0.14%)</title><rect x="309.8" y="389" width="1.7" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="312.76" y="399.5" ></text>
</g>
<g >
<title>get_page_from_freelist (303,916,593 samples, 0.04%)</title><rect x="387.9" y="181" width="0.5" height="15.0" fill="rgb(252,218,52)" rx="2" ry="2" />
<text  x="390.93" y="191.5" ></text>
</g>
<g >
<title>unlock_page (653,952,329 samples, 0.08%)</title><rect x="584.4" y="229" width="1.0" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text  x="587.44" y="239.5" ></text>
</g>
<g >
<title>deactivate_task (180,545,601 samples, 0.02%)</title><rect x="310.9" y="277" width="0.3" height="15.0" fill="rgb(206,8,2)" rx="2" ry="2" />
<text  x="313.89" y="287.5" ></text>
</g>
<g >
<title>release_pages (982,397,947 samples, 0.12%)</title><rect x="1159.3" y="597" width="1.4" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text  x="1162.34" y="607.5" ></text>
</g>
<g >
<title>spin_delay (1,097,293,537 samples, 0.13%)</title><rect x="311.6" y="389" width="1.6" height="15.0" fill="rgb(240,162,38)" rx="2" ry="2" />
<text  x="314.61" y="399.5" ></text>
</g>
<g >
<title>PageGetItem (4,897,701,292 samples, 0.59%)</title><rect x="770.9" y="517" width="7.0" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="773.87" y="527.5" ></text>
</g>
<g >
<title>BufferIsPermanent (6,832,761,335 samples, 0.83%)</title><rect x="1089.9" y="469" width="9.8" height="15.0" fill="rgb(250,210,50)" rx="2" ry="2" />
<text  x="1092.94" y="479.5" ></text>
</g>
<g >
<title>tick_sched_timer (105,887,461 samples, 0.01%)</title><rect x="203.0" y="373" width="0.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="206.01" y="383.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (1,175,890,158 samples, 0.14%)</title><rect x="1165.4" y="661" width="1.7" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="1168.38" y="671.5" ></text>
</g>
<g >
<title>radix_tree_descend (2,204,797,539 samples, 0.27%)</title><rect x="435.6" y="133" width="3.2" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="438.62" y="143.5" ></text>
</g>
<g >
<title>pgstat_count_io_op_n (295,160,674 samples, 0.04%)</title><rect x="620.7" y="389" width="0.4" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text  x="623.71" y="399.5" ></text>
</g>
<g >
<title>PageGetItem (152,358,508 samples, 0.02%)</title><rect x="137.2" y="517" width="0.2" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="140.19" y="527.5" ></text>
</g>
<g >
<title>hash_initial_lookup (94,115,665 samples, 0.01%)</title><rect x="616.2" y="357" width="0.2" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="619.24" y="367.5" ></text>
</g>
<g >
<title>PageGetItemId (3,912,143,579 samples, 0.47%)</title><rect x="819.2" y="485" width="5.6" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="822.16" y="495.5" ></text>
</g>
<g >
<title>PageGetItemId (724,909,639 samples, 0.09%)</title><rect x="202.1" y="469" width="1.1" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="205.13" y="479.5" ></text>
</g>
<g >
<title>spin_delay (520,729,305 samples, 0.06%)</title><rect x="72.2" y="389" width="0.8" height="15.0" fill="rgb(240,162,38)" rx="2" ry="2" />
<text  x="75.22" y="399.5" ></text>
</g>
<g >
<title>StartReadBuffersImpl (13,161,404,722 samples, 1.60%)</title><rect x="55.8" y="485" width="18.9" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="58.84" y="495.5" ></text>
</g>
<g >
<title>handle_mm_fault (305,172,058 samples, 0.04%)</title><rect x="57.2" y="341" width="0.5" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="60.23" y="351.5" ></text>
</g>
<g >
<title>update_vacuum_error_info (91,667,286 samples, 0.01%)</title><rect x="1144.0" y="549" width="0.1" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text  x="1146.97" y="559.5" ></text>
</g>
<g >
<title>page_verify_redirects (1,503,325,323 samples, 0.18%)</title><rect x="179.2" y="485" width="2.1" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="182.16" y="495.5" ></text>
</g>
<g >
<title>touch_atime (109,998,855 samples, 0.01%)</title><rect x="129.2" y="293" width="0.2" height="15.0" fill="rgb(205,2,0)" rx="2" ry="2" />
<text  x="132.24" y="303.5" ></text>
</g>
<g >
<title>PageGetItem (726,318,877 samples, 0.09%)</title><rect x="191.2" y="485" width="1.1" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="194.24" y="495.5" ></text>
</g>
<g >
<title>current_fs_time (107,024,710 samples, 0.01%)</title><rect x="582.0" y="213" width="0.2" height="15.0" fill="rgb(219,67,16)" rx="2" ry="2" />
<text  x="585.00" y="223.5" ></text>
</g>
<g >
<title>startup_hacks (637,279,068,599 samples, 77.32%)</title><rect x="233.6" y="773" width="912.4" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text  x="236.61" y="783.5" >startup_hacks</text>
</g>
<g >
<title>heap_vac_scan_next_block (349,956,658 samples, 0.04%)</title><rect x="23.3" y="645" width="0.5" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="26.25" y="655.5" ></text>
</g>
<g >
<title>xfs_iunlock (1,323,827,092 samples, 0.16%)</title><rect x="595.2" y="341" width="1.9" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text  x="598.16" y="351.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (71,803,743 samples, 0.01%)</title><rect x="226.0" y="453" width="0.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="229.01" y="463.5" ></text>
</g>
<g >
<title>shmem_alloc_page (1,182,607,381 samples, 0.14%)</title><rect x="125.6" y="165" width="1.6" height="15.0" fill="rgb(214,42,10)" rx="2" ry="2" />
<text  x="128.56" y="175.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (445,167,955 samples, 0.05%)</title><rect x="1116.6" y="453" width="0.7" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="1119.64" y="463.5" ></text>
</g>
<g >
<title>LockBufHdr (81,409,022 samples, 0.01%)</title><rect x="228.8" y="437" width="0.1" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="231.81" y="447.5" ></text>
</g>
<g >
<title>page_fault (31,565,685,400 samples, 3.83%)</title><rect x="84.0" y="293" width="45.2" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="86.97" y="303.5" >page..</text>
</g>
<g >
<title>pg_atomic_sub_fetch_u32 (184,691,260 samples, 0.02%)</title><rect x="1142.5" y="485" width="0.3" height="15.0" fill="rgb(242,174,41)" rx="2" ry="2" />
<text  x="1145.53" y="495.5" ></text>
</g>
<g >
<title>scheduler_tick (105,887,461 samples, 0.01%)</title><rect x="203.0" y="325" width="0.2" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text  x="206.01" y="335.5" ></text>
</g>
<g >
<title>nr_iowait_cpu (143,367,153 samples, 0.02%)</title><rect x="1164.2" y="629" width="0.2" height="15.0" fill="rgb(252,216,51)" rx="2" ry="2" />
<text  x="1167.17" y="639.5" ></text>
</g>
<g >
<title>heap_prune_record_unchanged_lp_normal (85,121,277,223 samples, 10.33%)</title><rect x="939.0" y="501" width="121.8" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="941.98" y="511.5" >heap_prune_reco..</text>
</g>
<g >
<title>hash_bytes (302,876,303 samples, 0.04%)</title><rect x="238.1" y="405" width="0.4" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="241.06" y="415.5" ></text>
</g>
<g >
<title>PageGetItem (194,036,472 samples, 0.02%)</title><rect x="34.6" y="597" width="0.3" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="37.61" y="607.5" ></text>
</g>
<g >
<title>mem_cgroup_update_page_stat (85,938,277 samples, 0.01%)</title><rect x="579.2" y="197" width="0.1" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text  x="582.22" y="207.5" ></text>
</g>
<g >
<title>up_write (267,816,803 samples, 0.03%)</title><rect x="51.1" y="101" width="0.4" height="15.0" fill="rgb(235,139,33)" rx="2" ry="2" />
<text  x="54.10" y="111.5" ></text>
</g>
<g >
<title>__find_lock_page (1,317,373,682 samples, 0.16%)</title><rect x="244.5" y="277" width="1.9" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="247.53" y="287.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (88,086,298 samples, 0.01%)</title><rect x="41.5" y="549" width="0.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="44.47" y="559.5" ></text>
</g>
<g >
<title>ItemPointerSet (1,206,980,740 samples, 0.15%)</title><rect x="146.7" y="501" width="1.7" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="149.69" y="511.5" ></text>
</g>
<g >
<title>TransactionIdFollows (834,428,765 samples, 0.10%)</title><rect x="150.0" y="501" width="1.1" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="152.95" y="511.5" ></text>
</g>
<g >
<title>lapic_next_deadline (167,573,385 samples, 0.02%)</title><rect x="1187.8" y="645" width="0.3" height="15.0" fill="rgb(222,82,19)" rx="2" ry="2" />
<text  x="1190.83" y="655.5" ></text>
</g>
<g >
<title>iomap_write_actor (257,110,195 samples, 0.03%)</title><rect x="10.4" y="517" width="0.4" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="13.44" y="527.5" ></text>
</g>
<g >
<title>do_set_pte (99,454,556 samples, 0.01%)</title><rect x="271.4" y="341" width="0.2" height="15.0" fill="rgb(253,221,52)" rx="2" ry="2" />
<text  x="274.44" y="351.5" ></text>
</g>
<g >
<title>schedule (104,722,866 samples, 0.01%)</title><rect x="48.6" y="661" width="0.1" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text  x="51.57" y="671.5" ></text>
</g>
<g >
<title>__find_get_page (232,454,527 samples, 0.03%)</title><rect x="64.0" y="261" width="0.3" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="66.99" y="271.5" ></text>
</g>
<g >
<title>PageRepairFragmentation (209,052,116 samples, 0.03%)</title><rect x="52.7" y="373" width="0.3" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="55.73" y="383.5" ></text>
</g>
<g >
<title>sys_pread64 (889,747,881 samples, 0.11%)</title><rect x="21.2" y="517" width="1.2" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text  x="24.15" y="527.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (115,895,397 samples, 0.01%)</title><rect x="1139.5" y="501" width="0.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="1142.53" y="511.5" ></text>
</g>
<g >
<title>update_process_times (110,255,351 samples, 0.01%)</title><rect x="1086.9" y="357" width="0.2" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="1089.95" y="367.5" ></text>
</g>
<g >
<title>__find_get_page (720,144,232 samples, 0.09%)</title><rect x="256.0" y="277" width="1.0" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="258.95" y="287.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (38,186,273,238 samples, 4.63%)</title><rect x="1071.3" y="501" width="54.6" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="1074.28" y="511.5" >HeapT..</text>
</g>
<g >
<title>get_hash_entry (92,960,302 samples, 0.01%)</title><rect x="73.8" y="405" width="0.1" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text  x="76.80" y="415.5" ></text>
</g>
<g >
<title>lazy_scan_prune (1,746,448,023 samples, 0.21%)</title><rect x="52.1" y="421" width="2.5" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text  x="55.10" y="431.5" ></text>
</g>
<g >
<title>PortalRunMulti (3,088,212,841 samples, 0.37%)</title><rect x="50.2" y="613" width="4.4" height="15.0" fill="rgb(245,184,44)" rx="2" ry="2" />
<text  x="53.20" y="623.5" ></text>
</g>
<g >
<title>timerqueue_del (139,858,843 samples, 0.02%)</title><rect x="1185.7" y="629" width="0.2" height="15.0" fill="rgb(236,145,34)" rx="2" ry="2" />
<text  x="1188.70" y="639.5" ></text>
</g>
<g >
<title>radix_tree_lookup_slot (761,273,605 samples, 0.09%)</title><rect x="245.3" y="245" width="1.1" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text  x="248.32" y="255.5" ></text>
</g>
<g >
<title>do_parallel_lazy_scan_heap (3,081,731,940 samples, 0.37%)</title><rect x="50.2" y="453" width="4.4" height="15.0" fill="rgb(249,202,48)" rx="2" ry="2" />
<text  x="53.20" y="463.5" ></text>
</g>
<g >
<title>tag_hash (178,509,464 samples, 0.02%)</title><rect x="56.1" y="405" width="0.2" height="15.0" fill="rgb(245,185,44)" rx="2" ry="2" />
<text  x="59.09" y="415.5" ></text>
</g>
<g >
<title>ItemPointerSet (6,201,621,340 samples, 0.75%)</title><rect x="683.9" y="517" width="8.8" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="686.86" y="527.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (82,655,314 samples, 0.01%)</title><rect x="604.9" y="517" width="0.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="607.93" y="527.5" ></text>
</g>
<g >
<title>sys_futex (105,899,380 samples, 0.01%)</title><rect x="320.8" y="389" width="0.2" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text  x="323.82" y="399.5" ></text>
</g>
<g >
<title>memcg_check_events (289,441,041 samples, 0.04%)</title><rect x="451.2" y="133" width="0.4" height="15.0" fill="rgb(206,4,1)" rx="2" ry="2" />
<text  x="454.17" y="143.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (74,985,907 samples, 0.01%)</title><rect x="621.3" y="405" width="0.1" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="624.34" y="415.5" ></text>
</g>
<g >
<title>pgstat_progress_update_param (117,893,398 samples, 0.01%)</title><rect x="1143.8" y="549" width="0.1" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text  x="1146.77" y="559.5" ></text>
</g>
<g >
<title>BackendStartup (3,096,750,624 samples, 0.38%)</title><rect x="50.2" y="709" width="4.4" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="53.19" y="719.5" ></text>
</g>
<g >
<title>UnlockReleaseBuffer (241,450,219 samples, 0.03%)</title><rect x="132.9" y="533" width="0.3" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text  x="135.88" y="543.5" ></text>
</g>
<g >
<title>pg_atomic_sub_fetch_u32 (233,788,298 samples, 0.03%)</title><rect x="321.1" y="437" width="0.4" height="15.0" fill="rgb(242,174,41)" rx="2" ry="2" />
<text  x="324.14" y="447.5" ></text>
</g>
<g >
<title>fget_light (260,561,970 samples, 0.03%)</title><rect x="340.9" y="405" width="0.3" height="15.0" fill="rgb(211,28,6)" rx="2" ry="2" />
<text  x="343.87" y="415.5" ></text>
</g>
<g >
<title>__find_get_page (1,332,434,262 samples, 0.16%)</title><rect x="93.5" y="149" width="1.9" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="96.54" y="159.5" ></text>
</g>
<g >
<title>__list_del_entry (105,743,768 samples, 0.01%)</title><rect x="572.5" y="117" width="0.1" height="15.0" fill="rgb(214,41,9)" rx="2" ry="2" />
<text  x="575.48" y="127.5" ></text>
</g>
<g >
<title>register_dirty_segment (93,260,048 samples, 0.01%)</title><rect x="19.4" y="469" width="0.1" height="15.0" fill="rgb(253,223,53)" rx="2" ry="2" />
<text  x="22.39" y="479.5" ></text>
</g>
<g >
<title>path_put (75,882,312 samples, 0.01%)</title><rect x="77.7" y="389" width="0.1" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text  x="80.70" y="399.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (105,887,461 samples, 0.01%)</title><rect x="203.0" y="405" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="206.01" y="415.5" ></text>
</g>
<g >
<title>clear_page_c_e (120,456,993 samples, 0.01%)</title><rect x="86.5" y="165" width="0.1" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text  x="89.46" y="175.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32_impl (179,584,817 samples, 0.02%)</title><rect x="1120.0" y="437" width="0.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="1122.99" y="447.5" ></text>
</g>
<g >
<title>__pread_nocancel (185,374,778,929 samples, 22.49%)</title><rect x="334.6" y="453" width="265.4" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="337.57" y="463.5" >__pread_nocancel</text>
</g>
<g >
<title>__lru_cache_add (1,253,252,879 samples, 0.15%)</title><rect x="439.0" y="181" width="1.8" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="441.97" y="191.5" ></text>
</g>
<g >
<title>heap_page_prune_execute (10,657,505,639 samples, 1.29%)</title><rect x="166.1" y="501" width="15.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="169.05" y="511.5" ></text>
</g>
<g >
<title>StartReadBuffer (1,298,543,922 samples, 0.16%)</title><rect x="133.8" y="437" width="1.9" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text  x="136.83" y="447.5" ></text>
</g>
<g >
<title>visibilitymap_pin (88,388,723 samples, 0.01%)</title><rect x="233.0" y="533" width="0.2" height="15.0" fill="rgb(253,221,53)" rx="2" ry="2" />
<text  x="236.04" y="543.5" ></text>
</g>
<g >
<title>MarkBufferDirtyHint (12,348,481,497 samples, 1.50%)</title><rect x="1102.8" y="469" width="17.7" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="1105.78" y="479.5" ></text>
</g>
<g >
<title>rwsem_down_write_failed (77,779,055 samples, 0.01%)</title><rect x="586.5" y="213" width="0.1" height="15.0" fill="rgb(230,116,27)" rx="2" ry="2" />
<text  x="589.51" y="223.5" ></text>
</g>
<g >
<title>BufTableDelete (214,760,085 samples, 0.03%)</title><rect x="19.5" y="517" width="0.3" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="22.53" y="527.5" ></text>
</g>
<g >
<title>BufferGetPage (161,013,276 samples, 0.02%)</title><rect x="1138.3" y="517" width="0.3" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="1141.35" y="527.5" ></text>
</g>
<g >
<title>retint_userspace_restore_args (224,533,485 samples, 0.03%)</title><rect x="257.1" y="421" width="0.3" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text  x="260.13" y="431.5" ></text>
</g>
<g >
<title>call_rwsem_wake (177,494,685 samples, 0.02%)</title><rect x="596.8" y="309" width="0.3" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text  x="599.80" y="319.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32_impl (136,246,566 samples, 0.02%)</title><rect x="1120.2" y="453" width="0.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="1123.25" y="463.5" ></text>
</g>
<g >
<title>handle_mm_fault (142,095,976,800 samples, 17.24%)</title><rect x="382.1" y="261" width="203.4" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="385.10" y="271.5" >handle_mm_fault</text>
</g>
<g >
<title>unmap_page_range (10,140,091,964 samples, 1.23%)</title><rect x="1146.3" y="645" width="14.6" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1149.35" y="655.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (126,685,162 samples, 0.02%)</title><rect x="990.7" y="421" width="0.2" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="993.75" y="431.5" ></text>
</g>
<g >
<title>pg_rotate_left32 (89,420,642 samples, 0.01%)</title><rect x="238.4" y="389" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="241.37" y="399.5" ></text>
</g>
<g >
<title>page_remove_rmap (2,012,295,836 samples, 0.24%)</title><rect x="1154.6" y="629" width="2.8" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="1157.57" y="639.5" ></text>
</g>
<g >
<title>__find_get_page (89,010,712 samples, 0.01%)</title><rect x="13.9" y="245" width="0.1" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="16.87" y="255.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (1,061,775,149 samples, 0.13%)</title><rect x="616.8" y="373" width="1.5" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="619.81" y="383.5" ></text>
</g>
<g >
<title>__find_lock_page (461,878,109 samples, 0.06%)</title><rect x="328.0" y="325" width="0.6" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="330.96" y="335.5" ></text>
</g>
<g >
<title>__do_fault.isra.61 (609,936,763 samples, 0.07%)</title><rect x="327.7" y="373" width="0.9" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="330.75" y="383.5" ></text>
</g>
<g >
<title>scheduler_tick (177,710,001 samples, 0.02%)</title><rect x="784.6" y="389" width="0.2" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text  x="787.56" y="399.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (200,576,965 samples, 0.02%)</title><rect x="1159.0" y="517" width="0.3" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text  x="1162.04" y="527.5" ></text>
</g>
<g >
<title>update_process_times (96,190,387 samples, 0.01%)</title><rect x="990.8" y="373" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="993.79" y="383.5" ></text>
</g>
<g >
<title>sys_pread64 (181,223,138,798 samples, 21.99%)</title><rect x="340.5" y="421" width="259.5" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text  x="343.50" y="431.5" >sys_pread64</text>
</g>
<g >
<title>startup_hacks (128,045,675,684 samples, 15.54%)</title><rect x="50.2" y="757" width="183.3" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text  x="53.19" y="767.5" >startup_hacks</text>
</g>
<g >
<title>pg_atomic_read_u32 (99,054,752 samples, 0.01%)</title><rect x="23.5" y="453" width="0.2" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="26.54" y="463.5" ></text>
</g>
<g >
<title>__do_page_fault (323,802,914 samples, 0.04%)</title><rect x="57.2" y="357" width="0.5" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="60.20" y="367.5" ></text>
</g>
<g >
<title>compactify_tuples (17,388,591,782 samples, 2.11%)</title><rect x="829.1" y="485" width="24.9" height="15.0" fill="rgb(209,21,5)" rx="2" ry="2" />
<text  x="832.11" y="495.5" >c..</text>
</g>
<g >
<title>__libc_start_main (495,367,615 samples, 0.06%)</title><rect x="10.3" y="773" width="0.7" height="15.0" fill="rgb(236,142,34)" rx="2" ry="2" />
<text  x="13.27" y="783.5" ></text>
</g>
<g >
<title>StartReadBuffersImpl (1,284,090,552 samples, 0.16%)</title><rect x="133.9" y="421" width="1.8" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="136.85" y="431.5" ></text>
</g>
<g >
<title>radix_tree_descend (166,031,505 samples, 0.02%)</title><rect x="270.9" y="229" width="0.3" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="273.93" y="239.5" ></text>
</g>
<g >
<title>page_fault (75,077,316 samples, 0.01%)</title><rect x="11.8" y="517" width="0.1" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="14.78" y="527.5" ></text>
</g>
<g >
<title>heap_prune_record_unchanged_lp_normal (565,069,961 samples, 0.07%)</title><rect x="53.5" y="373" width="0.8" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="56.46" y="383.5" ></text>
</g>
<g >
<title>page_fault (3,497,024,377 samples, 0.42%)</title><rect x="266.9" y="421" width="5.0" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="269.86" y="431.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (151,872,239 samples, 0.02%)</title><rect x="393.6" y="165" width="0.2" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="396.57" y="175.5" ></text>
</g>
<g >
<title>sys_futex (109,531,290 samples, 0.01%)</title><rect x="73.9" y="373" width="0.2" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text  x="76.93" y="383.5" ></text>
</g>
<g >
<title>queued_spin_lock_slowpath (72,546,724,437 samples, 8.80%)</title><rect x="465.7" y="149" width="103.9" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="468.73" y="159.5" >queued_spin_..</text>
</g>
<g >
<title>[perf] (424,046,642 samples, 0.05%)</title><rect x="10.3" y="709" width="0.6" height="15.0" fill="rgb(253,223,53)" rx="2" ry="2" />
<text  x="13.27" y="719.5" ></text>
</g>
<g >
<title>PageIsNew (82,558,643 samples, 0.01%)</title><rect x="22.6" y="597" width="0.2" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text  x="25.63" y="607.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (87,695,082 samples, 0.01%)</title><rect x="23.4" y="469" width="0.1" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="26.40" y="479.5" ></text>
</g>
<g >
<title>__block_write_begin_int (101,153,593 samples, 0.01%)</title><rect x="13.7" y="277" width="0.2" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="16.71" y="287.5" ></text>
</g>
<g >
<title>LWLockAcquire (113,831,556 samples, 0.01%)</title><rect x="23.5" y="485" width="0.2" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text  x="26.53" y="495.5" ></text>
</g>
<g >
<title>radix_tree_descend (2,324,884,247 samples, 0.28%)</title><rect x="432.2" y="117" width="3.4" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="435.24" y="127.5" ></text>
</g>
<g >
<title>BufferIsPermanent (1,355,635,527 samples, 0.16%)</title><rect x="223.7" y="453" width="2.0" height="15.0" fill="rgb(250,210,50)" rx="2" ry="2" />
<text  x="226.73" y="463.5" ></text>
</g>
<g >
<title>xfs_file_aio_write (301,734,021 samples, 0.04%)</title><rect x="10.4" y="581" width="0.5" height="15.0" fill="rgb(251,211,50)" rx="2" ry="2" />
<text  x="13.43" y="591.5" ></text>
</g>
<g >
<title>clockevents_program_event (70,291,015 samples, 0.01%)</title><rect x="1167.0" y="613" width="0.1" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="1169.95" y="623.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (94,565,604 samples, 0.01%)</title><rect x="631.4" y="517" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="634.37" y="527.5" ></text>
</g>
<g >
<title>__radix_tree_insert (943,612,261 samples, 0.11%)</title><rect x="99.5" y="149" width="1.3" height="15.0" fill="rgb(235,140,33)" rx="2" ry="2" />
<text  x="102.46" y="159.5" ></text>
</g>
<g >
<title>do_read_fault.isra.63 (304,123,019 samples, 0.04%)</title><rect x="63.9" y="341" width="0.5" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text  x="66.92" y="351.5" ></text>
</g>
<g >
<title>start_secondary (18,725,326,849 samples, 2.27%)</title><rect x="1162.6" y="757" width="26.9" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text  x="1165.65" y="767.5" >s..</text>
</g>
<g >
<title>HeapTupleSatisfiesVacuum (291,239,335 samples, 0.04%)</title><rect x="25.3" y="613" width="0.4" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text  x="28.31" y="623.5" ></text>
</g>
<g >
<title>wake_up_process (744,716,072 samples, 0.09%)</title><rect x="1165.6" y="597" width="1.1" height="15.0" fill="rgb(213,37,8)" rx="2" ry="2" />
<text  x="1168.59" y="607.5" ></text>
</g>
<g >
<title>system_call_after_swapgs (89,443,719 samples, 0.01%)</title><rect x="43.4" y="677" width="0.1" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text  x="46.36" y="687.5" ></text>
</g>
<g >
<title>GetBufferDescriptor (149,671,047 samples, 0.02%)</title><rect x="227.5" y="437" width="0.2" height="15.0" fill="rgb(249,202,48)" rx="2" ry="2" />
<text  x="230.47" y="447.5" ></text>
</g>
<g >
<title>ResourceOwnerForget (307,334,323 samples, 0.04%)</title><rect x="330.1" y="469" width="0.4" height="15.0" fill="rgb(235,142,33)" rx="2" ry="2" />
<text  x="333.11" y="479.5" ></text>
</g>
<g >
<title>page_fault (98,350,942 samples, 0.01%)</title><rect x="310.0" y="373" width="0.2" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="313.04" y="383.5" ></text>
</g>
<g >
<title>BufferGetPage (174,057,608 samples, 0.02%)</title><rect x="233.7" y="565" width="0.2" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="236.67" y="575.5" ></text>
</g>
<g >
<title>smgrreadv (38,868,154,468 samples, 4.72%)</title><rect x="76.7" y="485" width="55.6" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text  x="79.69" y="495.5" >smgrr..</text>
</g>
<g >
<title>__find_lock_page (1,338,191,759 samples, 0.16%)</title><rect x="93.5" y="165" width="1.9" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="96.53" y="175.5" ></text>
</g>
<g >
<title>tick_sched_handle (96,945,934 samples, 0.01%)</title><rect x="990.8" y="389" width="0.1" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="993.79" y="399.5" ></text>
</g>
<g >
<title>TransactionIdGetCommitLSN (261,442,926 samples, 0.03%)</title><rect x="1120.5" y="469" width="0.3" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="1123.45" y="479.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (376,838,370 samples, 0.05%)</title><rect x="1164.8" y="453" width="0.5" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text  x="1167.76" y="463.5" ></text>
</g>
<g >
<title>ServerLoop (637,279,068,599 samples, 77.32%)</title><rect x="233.6" y="741" width="912.4" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text  x="236.61" y="751.5" >ServerLoop</text>
</g>
<g >
<title>xfs_iunlock (1,452,303,115 samples, 0.18%)</title><rect x="17.2" y="341" width="2.1" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text  x="20.20" y="351.5" ></text>
</g>
<g >
<title>update_process_times (75,893,920 samples, 0.01%)</title><rect x="173.3" y="341" width="0.2" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="176.34" y="351.5" ></text>
</g>
<g >
<title>parallel_vacuum_main (124,426,515,195 samples, 15.10%)</title><rect x="55.2" y="613" width="178.1" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="58.19" y="623.5" >parallel_vacuum_main</text>
</g>
<g >
<title>MarkBufferDirty (184,363,082 samples, 0.02%)</title><rect x="163.1" y="501" width="0.3" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="166.09" y="511.5" ></text>
</g>
<g >
<title>BufferIsValid (148,150,867 samples, 0.02%)</title><rect x="228.3" y="405" width="0.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="231.29" y="415.5" ></text>
</g>
<g >
<title>shmem_getpage_gfp (25,879,854,090 samples, 3.14%)</title><rect x="90.6" y="181" width="37.1" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="93.61" y="191.5" >shm..</text>
</g>
<g >
<title>pg_rotate_left32 (207,721,547 samples, 0.03%)</title><rect x="615.0" y="341" width="0.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="617.95" y="351.5" ></text>
</g>
<g >
<title>try_to_wake_up (987,409,050 samples, 0.12%)</title><rect x="17.9" y="261" width="1.4" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="20.86" y="271.5" ></text>
</g>
<g >
<title>ReadBuffer_common (1,309,935,066 samples, 0.16%)</title><rect x="133.8" y="453" width="1.9" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="136.82" y="463.5" ></text>
</g>
<g >
<title>deactivate_task (196,962,692 samples, 0.02%)</title><rect x="130.7" y="229" width="0.3" height="15.0" fill="rgb(206,8,2)" rx="2" ry="2" />
<text  x="133.73" y="239.5" ></text>
</g>
<g >
<title>iomap_file_buffered_write (271,395,752 samples, 0.03%)</title><rect x="10.4" y="549" width="0.4" height="15.0" fill="rgb(206,6,1)" rx="2" ry="2" />
<text  x="13.44" y="559.5" ></text>
</g>
<g >
<title>BufferIsValid (86,736,907 samples, 0.01%)</title><rect x="606.7" y="517" width="0.1" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="609.72" y="527.5" ></text>
</g>
<g >
<title>clockevents_program_event (287,736,488 samples, 0.03%)</title><rect x="1185.3" y="613" width="0.4" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="1188.26" y="623.5" ></text>
</g>
<g >
<title>update_time (618,631,583 samples, 0.08%)</title><rect x="14.9" y="309" width="0.9" height="15.0" fill="rgb(211,31,7)" rx="2" ry="2" />
<text  x="17.88" y="319.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32_impl (399,479,590 samples, 0.05%)</title><rect x="1098.9" y="437" width="0.5" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="1101.86" y="447.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32_impl (78,077,511 samples, 0.01%)</title><rect x="235.5" y="469" width="0.1" height="15.0" fill="rgb(235,141,33)" rx="2" ry="2" />
<text  x="238.51" y="479.5" ></text>
</g>
<g >
<title>tick_nohz_idle_enter (2,419,701,390 samples, 0.29%)</title><rect x="1183.3" y="725" width="3.5" height="15.0" fill="rgb(250,211,50)" rx="2" ry="2" />
<text  x="1186.31" y="735.5" ></text>
</g>
<g >
<title>BufferIsValid (358,459,423 samples, 0.04%)</title><rect x="1096.9" y="421" width="0.5" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1099.85" y="431.5" ></text>
</g>
<g >
<title>shmem_fault (78,256,442 samples, 0.01%)</title><rect x="75.5" y="341" width="0.1" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="78.50" y="351.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (78,424,895 samples, 0.01%)</title><rect x="235.7" y="485" width="0.1" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="238.65" y="495.5" ></text>
</g>
<g >
<title>do_lazy_scan_heap (3,079,689,901 samples, 0.37%)</title><rect x="50.2" y="437" width="4.4" height="15.0" fill="rgb(221,75,18)" rx="2" ry="2" />
<text  x="53.21" y="447.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (74,441,753 samples, 0.01%)</title><rect x="87.4" y="117" width="0.1" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="90.38" y="127.5" ></text>
</g>
<g >
<title>MarkBufferDirty (1,113,157,223 samples, 0.14%)</title><rect x="630.2" y="533" width="1.6" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="633.20" y="543.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32_impl (916,327,583 samples, 0.11%)</title><rect x="602.5" y="453" width="1.4" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="605.55" y="463.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_or_u32_impl (194,896,887 samples, 0.02%)</title><rect x="329.8" y="453" width="0.2" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text  x="332.76" y="463.5" ></text>
</g>
<g >
<title>HeapTupleHeaderAdvanceConflictHorizon (1,221,295,602 samples, 0.15%)</title><rect x="189.5" y="485" width="1.7" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text  x="192.49" y="495.5" ></text>
</g>
<g >
<title>get_hash_entry (5,600,904,123 samples, 0.68%)</title><rect x="240.7" y="421" width="8.0" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text  x="243.67" y="431.5" ></text>
</g>
<g >
<title>TransactionIdIsInProgress (201,566,214 samples, 0.02%)</title><rect x="229.7" y="469" width="0.3" height="15.0" fill="rgb(208,16,3)" rx="2" ry="2" />
<text  x="232.74" y="479.5" ></text>
</g>
<g >
<title>ResourceOwnerForget (102,780,829 samples, 0.01%)</title><rect x="607.1" y="469" width="0.2" height="15.0" fill="rgb(235,142,33)" rx="2" ry="2" />
<text  x="610.12" y="479.5" ></text>
</g>
<g >
<title>__list_add (74,140,569 samples, 0.01%)</title><rect x="570.6" y="133" width="0.1" height="15.0" fill="rgb(235,141,33)" rx="2" ry="2" />
<text  x="573.60" y="143.5" ></text>
</g>
<g >
<title>TransactionIdDidCommit (686,273,186 samples, 0.08%)</title><rect x="1122.1" y="485" width="1.0" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text  x="1125.10" y="495.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (107,024,501 samples, 0.01%)</title><rect x="130.2" y="229" width="0.2" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text  x="133.20" y="239.5" ></text>
</g>
<g >
<title>BufTableLookup (89,512,577 samples, 0.01%)</title><rect x="23.4" y="485" width="0.1" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text  x="26.40" y="495.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (82,446,602 samples, 0.01%)</title><rect x="229.9" y="453" width="0.1" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="232.91" y="463.5" ></text>
</g>
<g >
<title>TransactionIdFollows (96,453,919 samples, 0.01%)</title><rect x="26.5" y="613" width="0.1" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="29.47" y="623.5" ></text>
</g>
<g >
<title>retint_swapgs (87,849,845 samples, 0.01%)</title><rect x="246.7" y="405" width="0.1" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="249.70" y="415.5" ></text>
</g>
<g >
<title>__radix_tree_create (2,480,859,096 samples, 0.30%)</title><rect x="456.9" y="149" width="3.6" height="15.0" fill="rgb(248,201,48)" rx="2" ry="2" />
<text  x="459.90" y="159.5" ></text>
</g>
<g >
<title>xfs_trans_ijoin (84,314,857 samples, 0.01%)</title><rect x="15.6" y="277" width="0.2" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text  x="18.64" y="287.5" ></text>
</g>
<g >
<title>GetVictimBuffer (5,804,088,260 samples, 0.70%)</title><rect x="11.9" y="549" width="8.3" height="15.0" fill="rgb(209,18,4)" rx="2" ry="2" />
<text  x="14.92" y="559.5" ></text>
</g>
<g >
<title>lru_add_drain_cpu (231,948,564 samples, 0.03%)</title><rect x="1159.0" y="581" width="0.3" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="1162.01" y="591.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (200,576,965 samples, 0.02%)</title><rect x="1159.0" y="549" width="0.3" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text  x="1162.04" y="559.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (134,744,497 samples, 0.02%)</title><rect x="853.8" y="437" width="0.2" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="856.81" y="447.5" ></text>
</g>
<g >
<title>__writeback_inodes_wb (115,471,936 samples, 0.01%)</title><rect x="10.1" y="677" width="0.2" height="15.0" fill="rgb(234,133,32)" rx="2" ry="2" />
<text  x="13.10" y="687.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (585,022,069 samples, 0.07%)</title><rect x="224.4" y="437" width="0.9" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="227.43" y="447.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (202,480,015 samples, 0.02%)</title><rect x="933.5" y="437" width="0.3" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="936.47" y="447.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_add_u32 (179,345,556 samples, 0.02%)</title><rect x="65.1" y="405" width="0.3" height="15.0" fill="rgb(206,4,1)" rx="2" ry="2" />
<text  x="68.14" y="415.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (927,468,592 samples, 0.11%)</title><rect x="602.5" y="469" width="1.4" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="605.53" y="479.5" ></text>
</g>
<g >
<title>BufTableInsert (7,254,884,841 samples, 0.88%)</title><rect x="238.6" y="453" width="10.4" height="15.0" fill="rgb(206,8,1)" rx="2" ry="2" />
<text  x="241.58" y="463.5" ></text>
</g>
<g >
<title>__find_get_page (414,148,384 samples, 0.05%)</title><rect x="328.0" y="309" width="0.6" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="330.97" y="319.5" ></text>
</g>
<g >
<title>_raw_qspin_lock_irq (74,745,830,129 samples, 9.07%)</title><rect x="462.6" y="165" width="107.0" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="465.58" y="175.5" >_raw_qspin_lo..</text>
</g>
<g >
<title>smp_apic_timer_interrupt (89,760,923 samples, 0.01%)</title><rect x="301.0" y="389" width="0.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="304.00" y="399.5" ></text>
</g>
<g >
<title>iput (96,801,624 samples, 0.01%)</title><rect x="48.4" y="661" width="0.2" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text  x="51.43" y="671.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (111,981,977 samples, 0.01%)</title><rect x="1086.9" y="405" width="0.2" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="1089.95" y="415.5" ></text>
</g>
<g >
<title>PageRepairFragmentation (8,308,151,997 samples, 1.01%)</title><rect x="167.3" y="485" width="11.9" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="170.26" y="495.5" ></text>
</g>
<g >
<title>vfs_read (861,773,929 samples, 0.10%)</title><rect x="21.2" y="501" width="1.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="24.19" y="511.5" ></text>
</g>
<g >
<title>LWLockWaitListLock (133,068,234 samples, 0.02%)</title><rect x="604.1" y="469" width="0.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="607.14" y="479.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (74,369,605 samples, 0.01%)</title><rect x="20.2" y="517" width="0.2" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="23.25" y="527.5" ></text>
</g>
<g >
<title>rwsem_down_write_failed (832,917,593 samples, 0.10%)</title><rect x="16.0" y="293" width="1.2" height="15.0" fill="rgb(230,116,27)" rx="2" ry="2" />
<text  x="19.01" y="303.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (137,640,001 samples, 0.02%)</title><rect x="393.6" y="133" width="0.2" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="396.58" y="143.5" ></text>
</g>
<g >
<title>pgstat_progress_update_param (159,560,102 samples, 0.02%)</title><rect x="1145.5" y="565" width="0.2" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text  x="1148.48" y="575.5" ></text>
</g>
<g >
<title>hash_bytes (144,268,336 samples, 0.02%)</title><rect x="613.6" y="357" width="0.2" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="616.56" y="367.5" ></text>
</g>
<g >
<title>__list_add (78,715,736 samples, 0.01%)</title><rect x="591.5" y="277" width="0.1" height="15.0" fill="rgb(235,141,33)" rx="2" ry="2" />
<text  x="594.48" y="287.5" ></text>
</g>
<g >
<title>PinBufferForBlock (59,826,216,177 samples, 7.26%)</title><rect x="237.5" y="485" width="85.6" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text  x="240.45" y="495.5" >PinBufferF..</text>
</g>
<g >
<title>radix_tree_maybe_preload (118,390,622 samples, 0.01%)</title><rect x="452.3" y="181" width="0.2" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="455.35" y="191.5" ></text>
</g>
<g >
<title>update_process_times (76,404,427 samples, 0.01%)</title><rect x="933.7" y="389" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="936.65" y="399.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (89,160,177 samples, 0.01%)</title><rect x="1143.0" y="501" width="0.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="1146.01" y="511.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (77,677,794 samples, 0.01%)</title><rect x="73.6" y="405" width="0.1" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="76.60" y="415.5" ></text>
</g>
<g >
<title>ItemPointerSet (5,252,318,095 samples, 0.64%)</title><rect x="761.9" y="517" width="7.5" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="764.92" y="527.5" ></text>
</g>
<g >
<title>page_fault (1,186,716,279 samples, 0.14%)</title><rect x="45.1" y="725" width="1.7" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="48.10" y="735.5" ></text>
</g>
<g >
<title>heap_prune_record_unchanged_lp_normal (896,865,081 samples, 0.11%)</title><rect x="1063.0" y="517" width="1.3" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text  x="1065.98" y="527.5" ></text>
</g>
<g >
<title>page_add_file_rmap (289,036,469 samples, 0.04%)</title><rect x="578.9" y="213" width="0.4" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="581.93" y="223.5" ></text>
</g>
<g >
<title>rebalance_domains (140,183,602 samples, 0.02%)</title><rect x="1189.7" y="517" width="0.2" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="1192.70" y="527.5" ></text>
</g>
<g >
<title>system_call_fastpath (129,399,723 samples, 0.02%)</title><rect x="51.7" y="309" width="0.2" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="54.71" y="319.5" ></text>
</g>
<g >
<title>shmem_fault (760,194,814 samples, 0.09%)</title><rect x="59.6" y="309" width="1.1" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="62.64" y="319.5" ></text>
</g>
<g >
<title>BufferIsValid (1,258,646,007 samples, 0.15%)</title><rect x="1091.4" y="453" width="1.8" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1094.43" y="463.5" ></text>
</g>
<g >
<title>PageGetItemId (359,397,888 samples, 0.04%)</title><rect x="166.7" y="485" width="0.5" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="169.73" y="495.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32 (545,736,451 samples, 0.07%)</title><rect x="1140.0" y="469" width="0.8" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="1143.05" y="479.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32 (190,081,086 samples, 0.02%)</title><rect x="770.3" y="501" width="0.3" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="773.31" y="511.5" ></text>
</g>
<g >
<title>shmem_getpage_gfp (72,870,510 samples, 0.01%)</title><rect x="75.5" y="325" width="0.1" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="78.51" y="335.5" ></text>
</g>
<g >
<title>do_sync_write (302,519,142 samples, 0.04%)</title><rect x="10.4" y="597" width="0.5" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="13.43" y="607.5" ></text>
</g>
<g >
<title>GetVictimBuffer (8,376,972,519 samples, 1.02%)</title><rect x="61.5" y="437" width="12.0" height="15.0" fill="rgb(209,18,4)" rx="2" ry="2" />
<text  x="64.47" y="447.5" ></text>
</g>
<g >
<title>do_parallel_lazy_scan_heap (22,573,331,038 samples, 2.74%)</title><rect x="11.0" y="677" width="32.3" height="15.0" fill="rgb(249,202,48)" rx="2" ry="2" />
<text  x="14.02" y="687.5" >do..</text>
</g>
<g >
<title>tick_do_update_jiffies64 (94,199,503 samples, 0.01%)</title><rect x="1183.2" y="725" width="0.1" height="15.0" fill="rgb(208,14,3)" rx="2" ry="2" />
<text  x="1186.17" y="735.5" ></text>
</g>
<g >
<title>main (495,367,615 samples, 0.06%)</title><rect x="10.3" y="757" width="0.7" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text  x="13.27" y="767.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (271,714,266 samples, 0.03%)</title><rect x="236.0" y="517" width="0.3" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="238.95" y="527.5" ></text>
</g>
<g >
<title>LWLockAcquire (1,441,347,497 samples, 0.17%)</title><rect x="318.5" y="453" width="2.0" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text  x="321.46" y="463.5" ></text>
</g>
<g >
<title>pick_next_task_fair (116,656,498 samples, 0.01%)</title><rect x="311.2" y="277" width="0.1" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text  x="314.17" y="287.5" ></text>
</g>
<g >
<title>dsa_get_total_size (118,829,265 samples, 0.01%)</title><rect x="601.5" y="533" width="0.1" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="604.46" y="543.5" ></text>
</g>
<g >
<title>heap_tuple_should_freeze (17,407,587,303 samples, 2.11%)</title><rect x="1034.1" y="469" width="24.9" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="1037.12" y="479.5" >h..</text>
</g>
<g >
<title>pg_atomic_read_u32_impl (80,991,435 samples, 0.01%)</title><rect x="225.5" y="421" width="0.1" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="228.47" y="431.5" ></text>
</g>
<g >
<title>mem_cgroup_page_lruvec (162,476,310 samples, 0.02%)</title><rect x="440.2" y="149" width="0.2" height="15.0" fill="rgb(212,32,7)" rx="2" ry="2" />
<text  x="443.20" y="159.5" ></text>
</g>
<g >
<title>hash_search (84,772,866 samples, 0.01%)</title><rect x="12.1" y="501" width="0.2" height="15.0" fill="rgb(216,55,13)" rx="2" ry="2" />
<text  x="15.13" y="511.5" ></text>
</g>
<g >
<title>heap_prune_chain (5,056,190,885 samples, 0.61%)</title><rect x="32.8" y="613" width="7.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="35.78" y="623.5" ></text>
</g>
<g >
<title>tick_program_event (220,321,226 samples, 0.03%)</title><rect x="1187.8" y="677" width="0.3" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text  x="1190.77" y="687.5" ></text>
</g>
<g >
<title>ParallelWorkerMain (124,784,586,574 samples, 15.14%)</title><rect x="54.7" y="629" width="178.6" height="15.0" fill="rgb(253,221,53)" rx="2" ry="2" />
<text  x="57.67" y="639.5" >ParallelWorkerMain</text>
</g>
<g >
<title>get_futex_key (435,364,263 samples, 0.05%)</title><rect x="48.8" y="661" width="0.6" height="15.0" fill="rgb(252,216,51)" rx="2" ry="2" />
<text  x="51.82" y="671.5" ></text>
</g>
<g >
<title>pg_atomic_sub_fetch_u32_impl (182,633,630 samples, 0.02%)</title><rect x="321.2" y="421" width="0.3" height="15.0" fill="rgb(225,94,22)" rx="2" ry="2" />
<text  x="324.22" y="431.5" ></text>
</g>
<g >
<title>tick_sched_handle (120,082,497 samples, 0.01%)</title><rect x="933.6" y="405" width="0.2" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text  x="936.59" y="415.5" ></text>
</g>
<g >
<title>__mem_cgroup_commit_charge (817,128,077 samples, 0.10%)</title><rect x="97.2" y="133" width="1.1" height="15.0" fill="rgb(212,32,7)" rx="2" ry="2" />
<text  x="100.16" y="143.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (128,189,296 samples, 0.02%)</title><rect x="990.7" y="437" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="993.75" y="447.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32_impl (166,198,298 samples, 0.02%)</title><rect x="1143.3" y="485" width="0.2" height="15.0" fill="rgb(235,141,33)" rx="2" ry="2" />
<text  x="1146.30" y="495.5" ></text>
</g>
<g >
<title>retint_userspace_restore_args (155,169,508 samples, 0.02%)</title><rect x="57.7" y="389" width="0.2" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text  x="60.66" y="399.5" ></text>
</g>
<g >
<title>radix_tree_descend (190,671,547 samples, 0.02%)</title><rect x="349.4" y="277" width="0.2" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="352.37" y="287.5" ></text>
</g>
<g >
<title>spin_delay (1,619,860,846 samples, 0.20%)</title><rect x="313.3" y="405" width="2.3" height="15.0" fill="rgb(240,162,38)" rx="2" ry="2" />
<text  x="316.32" y="415.5" ></text>
</g>
<g >
<title>page_verify_redirects (7,203,449,051 samples, 0.87%)</title><rect x="854.5" y="501" width="10.3" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text  x="857.49" y="511.5" ></text>
</g>
<g >
<title>mdreadv (319,415,208 samples, 0.04%)</title><rect x="49.5" y="757" width="0.5" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text  x="52.51" y="767.5" ></text>
</g>
<g >
<title>lazy_scan_new_or_empty (140,622,587 samples, 0.02%)</title><rect x="621.8" y="549" width="0.2" height="15.0" fill="rgb(248,201,48)" rx="2" ry="2" />
<text  x="624.77" y="559.5" ></text>
</g>
<g >
<title>handle_mm_fault (30,313,051,968 samples, 3.68%)</title><rect x="85.7" y="245" width="43.4" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text  x="88.67" y="255.5" >hand..</text>
</g>
<g >
<title>__hrtimer_run_queues (73,285,622 samples, 0.01%)</title><rect x="96.9" y="101" width="0.1" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="99.88" y="111.5" ></text>
</g>
<g >
<title>MarkBufferDirtyHint (2,113,199,118 samples, 0.26%)</title><rect x="226.2" y="453" width="3.0" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text  x="229.16" y="463.5" ></text>
</g>
<g >
<title>wake_up_q (155,771,975 samples, 0.02%)</title><rect x="596.8" y="277" width="0.3" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text  x="599.83" y="287.5" ></text>
</g>
<g >
<title>ReadBuffer_common (7,888,684,788 samples, 0.96%)</title><rect x="11.1" y="629" width="11.3" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="14.14" y="639.5" ></text>
</g>
<g >
<title>xfs_file_aio_read (797,197,097 samples, 0.10%)</title><rect x="21.2" y="469" width="1.2" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text  x="24.22" y="479.5" ></text>
</g>
<g >
<title>s_lock (28,144,310,941 samples, 3.41%)</title><rect x="277.5" y="421" width="40.3" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text  x="280.52" y="431.5" >s_l..</text>
</g>
<g >
<title>vm_extend (93,639,631 samples, 0.01%)</title><rect x="1145.3" y="517" width="0.1" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text  x="1148.27" y="527.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (80,712,318 samples, 0.01%)</title><rect x="301.0" y="341" width="0.1" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="304.00" y="351.5" ></text>
</g>
<g >
<title>visibilitymap_set (5,583,768,820 samples, 0.68%)</title><rect x="1135.8" y="533" width="8.0" height="15.0" fill="rgb(220,73,17)" rx="2" ry="2" />
<text  x="1138.78" y="543.5" ></text>
</g>
<g >
<title>FileReadV (142,578,174 samples, 0.02%)</title><rect x="51.7" y="341" width="0.2" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text  x="54.69" y="351.5" ></text>
</g>
<g >
<title>do_read_fault.isra.63 (248,307,353 samples, 0.03%)</title><rect x="57.3" y="325" width="0.4" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text  x="60.31" y="335.5" ></text>
</g>
<g >
<title>__libc_pread64 (104,848,999 samples, 0.01%)</title><rect x="334.4" y="453" width="0.2" height="15.0" fill="rgb(254,226,54)" rx="2" ry="2" />
<text  x="337.42" y="463.5" ></text>
</g>
<g >
<title>pg_atomic_sub_fetch_u32_impl (131,575,020 samples, 0.02%)</title><rect x="1142.6" y="469" width="0.2" height="15.0" fill="rgb(225,94,22)" rx="2" ry="2" />
<text  x="1145.60" y="479.5" ></text>
</g>
<g >
<title>do_shared_fault.isra.64 (136,590,186,189 samples, 16.57%)</title><rect x="389.8" y="245" width="195.6" height="15.0" fill="rgb(245,185,44)" rx="2" ry="2" />
<text  x="392.82" y="255.5" >do_shared_fault.isra.64</text>
</g>
<g >
<title>heap_prune_record_unused (136,873,054 samples, 0.02%)</title><rect x="1060.8" y="501" width="0.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="1063.84" y="511.5" ></text>
</g>
<g >
<title>free_page_and_swap_cache (517,871,509 samples, 0.06%)</title><rect x="1164.6" y="565" width="0.7" height="15.0" fill="rgb(215,49,11)" rx="2" ry="2" />
<text  x="1167.58" y="575.5" ></text>
</g>
<g >
<title>call_string_check_hook (71,216,627 samples, 0.01%)</title><rect x="55.0" y="581" width="0.1" height="15.0" fill="rgb(236,144,34)" rx="2" ry="2" />
<text  x="58.02" y="591.5" ></text>
</g>
<g >
<title>__do_fault.isra.61 (130,815,869,069 samples, 15.87%)</title><rect x="390.6" y="229" width="187.3" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text  x="393.65" y="239.5" >__do_fault.isra.61</text>
</g>
<g >
<title>DataChecksumsEnabled (70,930,903 samples, 0.01%)</title><rect x="50.3" y="277" width="0.1" height="15.0" fill="rgb(226,96,23)" rx="2" ry="2" />
<text  x="53.30" y="287.5" ></text>
</g>
<g >
<title>ReadBufferExtended (187,539,598 samples, 0.02%)</title><rect x="22.8" y="597" width="0.2" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text  x="25.75" y="607.5" ></text>
</g>
<g >
<title>up_read (342,688,447 samples, 0.04%)</title><rect x="131.3" y="309" width="0.5" height="15.0" fill="rgb(209,18,4)" rx="2" ry="2" />
<text  x="134.34" y="319.5" ></text>
</g>
<g >
<title>LWLockAcquire (160,900,720 samples, 0.02%)</title><rect x="73.5" y="437" width="0.2" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text  x="76.49" y="447.5" ></text>
</g>
<g >
<title>rwsem_wake (174,896,789 samples, 0.02%)</title><rect x="596.8" y="293" width="0.3" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="599.81" y="303.5" ></text>
</g>
<g >
<title>sem_post@@GLIBC_2.2.5 (131,267,901 samples, 0.02%)</title><rect x="320.8" y="421" width="0.2" height="15.0" fill="rgb(214,41,9)" rx="2" ry="2" />
<text  x="323.78" y="431.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32 (122,520,356 samples, 0.01%)</title><rect x="235.4" y="485" width="0.2" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="238.45" y="495.5" ></text>
</g>
<g >
<title>ReleaseBuffer (76,446,006 samples, 0.01%)</title><rect x="133.3" y="501" width="0.1" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text  x="136.29" y="511.5" ></text>
</g>
<g >
<title>HeapTupleSatisfiesVacuumHorizon (1,463,672,283 samples, 0.18%)</title><rect x="40.5" y="597" width="2.1" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text  x="43.51" y="607.5" ></text>
</g>
<g >
<title>PageIsVerifiedExtended (77,882,144 samples, 0.01%)</title><rect x="74.8" y="485" width="0.2" height="15.0" fill="rgb(251,215,51)" rx="2" ry="2" />
<text  x="77.84" y="495.5" ></text>
</g>
<g >
<title>do_lazy_scan_heap (636,588,717,219 samples, 77.24%)</title><rect x="234.0" y="565" width="911.4" height="15.0" fill="rgb(221,75,18)" rx="2" ry="2" />
<text  x="237.04" y="575.5" >do_lazy_scan_heap</text>
</g>
<g >
<title>xfs_file_buffered_aio_write (597,755,680 samples, 0.07%)</title><rect x="50.6" y="133" width="0.9" height="15.0" fill="rgb(243,176,42)" rx="2" ry="2" />
<text  x="53.62" y="143.5" ></text>
</g>
<g >
<title>shmem_fault (2,001,318,628 samples, 0.24%)</title><rect x="268.6" y="325" width="2.8" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="271.57" y="335.5" ></text>
</g>
<g >
<title>mark_buffer_dirty (215,333,607 samples, 0.03%)</title><rect x="14.1" y="245" width="0.3" height="15.0" fill="rgb(240,163,39)" rx="2" ry="2" />
<text  x="17.12" y="255.5" ></text>
</g>
<g >
<title>run_rebalance_domains (203,779,720 samples, 0.02%)</title><rect x="1189.6" y="533" width="0.3" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text  x="1192.63" y="543.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (159,713,281 samples, 0.02%)</title><rect x="828.8" y="421" width="0.3" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="831.83" y="431.5" ></text>
</g>
<g >
<title>tag_hash (91,291,506 samples, 0.01%)</title><rect x="615.2" y="373" width="0.2" height="15.0" fill="rgb(245,185,44)" rx="2" ry="2" />
<text  x="618.25" y="383.5" ></text>
</g>
<g >
<title>__schedule (588,202,006 samples, 0.07%)</title><rect x="16.4" y="261" width="0.8" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text  x="19.35" y="271.5" ></text>
</g>
<g >
<title>BlockIdSet (134,762,803 samples, 0.02%)</title><rect x="160.9" y="501" width="0.2" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="163.89" y="511.5" ></text>
</g>
<g >
<title>update_process_times (100,961,491 samples, 0.01%)</title><rect x="1034.0" y="357" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="1036.96" y="367.5" ></text>
</g>
<g >
<title>do_read_fault.isra.63 (854,347,269 samples, 0.10%)</title><rect x="59.6" y="341" width="1.2" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text  x="62.62" y="351.5" ></text>
</g>
<g >
<title>__alloc_pages_nodemask (3,905,384,102 samples, 0.47%)</title><rect x="570.0" y="149" width="5.6" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text  x="573.05" y="159.5" ></text>
</g>
<g >
<title>dput (207,161,830 samples, 0.03%)</title><rect x="336.6" y="389" width="0.3" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text  x="339.60" y="399.5" ></text>
</g>
<g >
<title>WaitReadBuffersCanStartIO (146,203,233 samples, 0.02%)</title><rect x="76.2" y="485" width="0.2" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text  x="79.16" y="495.5" ></text>
</g>
<g >
<title>PinBufferForBlock (225,838,527 samples, 0.03%)</title><rect x="23.4" y="517" width="0.3" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text  x="26.39" y="527.5" ></text>
</g>
<g >
<title>system_call_fastpath (109,531,290 samples, 0.01%)</title><rect x="73.9" y="389" width="0.2" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="76.93" y="399.5" ></text>
</g>
<g >
<title>__inc_zone_page_state (81,189,101 samples, 0.01%)</title><rect x="99.3" y="149" width="0.1" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text  x="102.31" y="159.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (132,829,552 samples, 0.02%)</title><rect x="1144.6" y="517" width="0.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="1147.61" y="527.5" ></text>
</g>
<g >
<title>sys_nanosleep (94,502,853 samples, 0.01%)</title><rect x="71.8" y="341" width="0.2" height="15.0" fill="rgb(248,200,48)" rx="2" ry="2" />
<text  x="74.83" y="351.5" ></text>
</g>
<g >
<title>__switch_to (352,258,769 samples, 0.04%)</title><rect x="1161.1" y="773" width="0.5" height="15.0" fill="rgb(205,2,0)" rx="2" ry="2" />
<text  x="1164.10" y="783.5" ></text>
</g>
<g >
<title>PageGetItemId (3,037,372,577 samples, 0.37%)</title><rect x="697.5" y="517" width="4.3" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="700.48" y="527.5" ></text>
</g>
<g >
<title>iomap_write_actor (1,258,576,371 samples, 0.15%)</title><rect x="12.8" y="309" width="1.8" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text  x="15.77" y="319.5" ></text>
</g>
<g >
<title>PageGetItemId (579,037,268 samples, 0.07%)</title><rect x="149.1" y="501" width="0.8" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="152.12" y="511.5" ></text>
</g>
<g >
<title>__find_get_page (620,921,777 samples, 0.08%)</title><rect x="79.2" y="293" width="0.9" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text  x="82.22" y="303.5" ></text>
</g>
<g >
<title>BufferAlloc (59,632,114,198 samples, 7.24%)</title><rect x="237.6" y="469" width="85.4" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text  x="240.63" y="479.5" >BufferAlloc</text>
</g>
<g >
<title>MarkBufferDirty (139,688,583 samples, 0.02%)</title><rect x="137.0" y="517" width="0.2" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="139.98" y="527.5" ></text>
</g>
<g >
<title>BufferIsValid (92,825,077 samples, 0.01%)</title><rect x="224.7" y="421" width="0.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="227.75" y="431.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32_impl (95,971,567 samples, 0.01%)</title><rect x="617.1" y="341" width="0.1" height="15.0" fill="rgb(235,141,33)" rx="2" ry="2" />
<text  x="620.06" y="351.5" ></text>
</g>
<g >
<title>BufferGetBlock (89,798,587 samples, 0.01%)</title><rect x="667.6" y="501" width="0.2" height="15.0" fill="rgb(242,172,41)" rx="2" ry="2" />
<text  x="670.63" y="511.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (79,157,899 samples, 0.01%)</title><rect x="619.8" y="373" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="622.83" y="383.5" ></text>
</g>
<g >
<title>table_parallel_vacuum_scan (637,236,586,138 samples, 77.31%)</title><rect x="233.7" y="597" width="912.3" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text  x="236.67" y="607.5" >table_parallel_vacuum_scan</text>
</g>
<g >
<title>get_page_from_freelist (2,979,438,510 samples, 0.36%)</title><rect x="570.9" y="133" width="4.3" height="15.0" fill="rgb(252,218,52)" rx="2" ry="2" />
<text  x="573.92" y="143.5" ></text>
</g>
<g >
<title>pg_atomic_compare_exchange_u32_impl (417,280,752 samples, 0.05%)</title><rect x="1140.2" y="453" width="0.6" height="15.0" fill="rgb(235,141,33)" rx="2" ry="2" />
<text  x="1143.23" y="463.5" ></text>
</g>
<g >
<title>iomap_apply (157,026,362 samples, 0.02%)</title><rect x="50.6" y="101" width="0.3" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="53.63" y="111.5" ></text>
</g>
<g >
<title>__set_page_dirty (194,182,657 samples, 0.02%)</title><rect x="14.1" y="229" width="0.3" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="17.15" y="239.5" ></text>
</g>
<g >
<title>selinux_file_permission (145,028,512 samples, 0.02%)</title><rect x="599.4" y="373" width="0.2" height="15.0" fill="rgb(249,204,48)" rx="2" ry="2" />
<text  x="602.35" y="383.5" ></text>
</g>
<g >
<title>ReadBuffer_common (53,497,394,339 samples, 6.49%)</title><rect x="55.8" y="517" width="76.6" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="58.78" y="527.5" >ReadBuff..</text>
</g>
<g >
<title>radix_tree_descend (251,125,417 samples, 0.03%)</title><rect x="99.9" y="117" width="0.3" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text  x="102.89" y="127.5" ></text>
</g>
<g >
<title>set_page_dirty (250,383,969 samples, 0.03%)</title><rect x="128.6" y="213" width="0.4" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text  x="131.60" y="223.5" ></text>
</g>
<g >
<title>ResourceOwnerForgetBuffer (164,220,958 samples, 0.02%)</title><rect x="605.5" y="501" width="0.2" height="15.0" fill="rgb(247,193,46)" rx="2" ry="2" />
<text  x="608.49" y="511.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (1,336,748,520 samples, 0.16%)</title><rect x="601.9" y="485" width="2.0" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text  x="604.95" y="495.5" ></text>
</g>
<g >
<title>lazy_scan_prune (13,601,458,906 samples, 1.65%)</title><rect x="23.8" y="645" width="19.5" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text  x="26.78" y="655.5" ></text>
</g>
<g >
<title>PinBufferForBlock (398,251,233 samples, 0.05%)</title><rect x="43.4" y="741" width="0.5" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text  x="46.36" y="751.5" ></text>
</g>
<g >
<title>block_write_end (301,601,796 samples, 0.04%)</title><rect x="14.0" y="277" width="0.4" height="15.0" fill="rgb(213,38,9)" rx="2" ry="2" />
<text  x="17.00" y="287.5" ></text>
</g>
<g >
<title>xfs_file_buffered_aio_write (4,576,999,294 samples, 0.56%)</title><rect x="12.7" y="357" width="6.6" height="15.0" fill="rgb(243,176,42)" rx="2" ry="2" />
<text  x="15.72" y="367.5" ></text>
</g>
<g >
<title>scheduler_tick (74,180,775 samples, 0.01%)</title><rect x="621.3" y="341" width="0.1" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text  x="624.34" y="351.5" ></text>
</g>
<g >
<title>GetPrivateRefCount (116,993,009 samples, 0.01%)</title><rect x="625.4" y="517" width="0.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="628.35" y="527.5" ></text>
</g>
<g >
<title>exit_mmap (10,483,653,966 samples, 1.27%)</title><rect x="1146.0" y="693" width="15.0" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="1148.99" y="703.5" ></text>
</g>
<g >
<title>heap_vac_scan_next_block (1,797,995,109 samples, 0.22%)</title><rect x="133.2" y="533" width="2.6" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="136.23" y="543.5" ></text>
</g>
<g >
<title>ConditionVariableBroadcast (642,734,508 samples, 0.08%)</title><rect x="75.0" y="469" width="0.9" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="77.99" y="479.5" ></text>
</g>
<g >
<title>shmem_fault (28,372,881,604 samples, 3.44%)</title><rect x="87.0" y="197" width="40.7" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text  x="90.04" y="207.5" >shm..</text>
</g>
<g >
<title>sys_pwrite64 (707,254,508 samples, 0.09%)</title><rect x="50.5" y="197" width="1.0" height="15.0" fill="rgb(238,156,37)" rx="2" ry="2" />
<text  x="53.47" y="207.5" ></text>
</g>
<g >
<title>LWLockWakeup (157,903,922 samples, 0.02%)</title><rect x="320.7" y="437" width="0.3" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text  x="323.74" y="447.5" ></text>
</g>
<g >
<title>shmem_add_to_page_cache.isra.26 (18,810,627,991 samples, 2.28%)</title><rect x="98.6" y="165" width="27.0" height="15.0" fill="rgb(250,207,49)" rx="2" ry="2" />
<text  x="101.63" y="175.5" >s..</text>
</g>
<g >
<title>TransactionIdPrecedes (4,804,527,150 samples, 0.58%)</title><rect x="1048.8" y="453" width="6.9" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="1051.83" y="463.5" ></text>
</g>
<g >
<title>PageGetItemId (632,391,766 samples, 0.08%)</title><rect x="180.4" y="469" width="0.9" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="183.39" y="479.5" ></text>
</g>
<g >
<title>file_update_time (643,537,680 samples, 0.08%)</title><rect x="14.9" y="325" width="0.9" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text  x="17.85" y="335.5" ></text>
</g>
<g >
<title>TransactionIdDidCommit (73,967,477 samples, 0.01%)</title><rect x="230.6" y="485" width="0.1" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text  x="233.57" y="495.5" ></text>
</g>
<g >
<title>xfs_file_buffered_aio_read (781,770,891 samples, 0.09%)</title><rect x="21.2" y="453" width="1.2" height="15.0" fill="rgb(217,55,13)" rx="2" ry="2" />
<text  x="24.25" y="463.5" ></text>
</g>
<g >
<title>__list_add (174,328,507 samples, 0.02%)</title><rect x="572.2" y="117" width="0.3" height="15.0" fill="rgb(235,141,33)" rx="2" ry="2" />
<text  x="575.23" y="127.5" ></text>
</g>
<g >
<title>ReadBuffer_common (185,939,462 samples, 0.02%)</title><rect x="22.8" y="581" width="0.2" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text  x="25.75" y="591.5" ></text>
</g>
<g >
<title>clockevents_program_event (209,766,194 samples, 0.03%)</title><rect x="1187.8" y="661" width="0.3" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text  x="1190.77" y="671.5" ></text>
</g>
<g >
<title>zone_statistics (165,101,474 samples, 0.02%)</title><rect x="575.0" y="117" width="0.2" height="15.0" fill="rgb(232,125,29)" rx="2" ry="2" />
<text  x="577.95" y="127.5" ></text>
</g>
<g >
<title>system_call_fastpath (314,355,073 samples, 0.04%)</title><rect x="10.4" y="645" width="0.5" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text  x="13.42" y="655.5" ></text>
</g>
<g >
<title>pg_atomic_fetch_or_u32 (88,573,341 samples, 0.01%)</title><rect x="276.3" y="421" width="0.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="279.30" y="431.5" ></text>
</g>
<g >
<title>release_pages (223,385,334 samples, 0.03%)</title><rect x="440.4" y="149" width="0.4" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text  x="443.45" y="159.5" ></text>
</g>
<g >
<title>ss_report_location (222,040,834 samples, 0.03%)</title><rect x="609.6" y="501" width="0.3" height="15.0" fill="rgb(249,202,48)" rx="2" ry="2" />
<text  x="612.57" y="511.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (2,211,397,386 samples, 0.27%)</title><rect x="58.3" y="421" width="3.2" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="61.30" y="431.5" ></text>
</g>
<g >
<title>set_page_dirty (819,418,366 samples, 0.10%)</title><rect x="1157.4" y="629" width="1.2" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text  x="1160.45" y="639.5" ></text>
</g>
<g >
<title>__pwrite_nocancel (4,885,135,977 samples, 0.59%)</title><rect x="12.3" y="453" width="7.0" height="15.0" fill="rgb(219,67,16)" rx="2" ry="2" />
<text  x="15.34" y="463.5" ></text>
</g>
<g >
<title>PageGetItemId (4,430,694,609 samples, 0.54%)</title><rect x="777.9" y="517" width="6.3" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text  x="780.88" y="527.5" ></text>
</g>
<g >
<title>BufTableHashCode (1,411,021,184 samples, 0.17%)</title><rect x="613.4" y="389" width="2.0" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text  x="616.36" y="399.5" ></text>
</g>
<g >
<title>heap_prune_satisfies_vacuum (184,225,118 samples, 0.02%)</title><rect x="54.3" y="389" width="0.3" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="57.29" y="399.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (266,032,186 samples, 0.03%)</title><rect x="224.9" y="421" width="0.4" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="227.88" y="431.5" ></text>
</g>
<g >
<title>process_pm_pmsignal (637,265,816,681 samples, 77.32%)</title><rect x="233.6" y="725" width="912.4" height="15.0" fill="rgb(254,228,54)" rx="2" ry="2" />
<text  x="236.63" y="735.5" >process_pm_pmsignal</text>
</g>
<g >
<title>htsv_get_valid_status (1,297,094,185 samples, 0.16%)</title><rect x="1061.0" y="501" width="1.9" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text  x="1064.04" y="511.5" ></text>
</g>
<g >
<title>get_hash_value (1,265,120,321 samples, 0.15%)</title><rect x="613.4" y="373" width="1.8" height="15.0" fill="rgb(211,27,6)" rx="2" ry="2" />
<text  x="616.44" y="383.5" ></text>
</g>
<g >
<title>scheduler_tick (90,492,093 samples, 0.01%)</title><rect x="393.6" y="69" width="0.2" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text  x="396.65" y="79.5" ></text>
</g>
<g >
<title>auditsys (81,869,134 samples, 0.01%)</title><rect x="309.9" y="373" width="0.1" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text  x="312.85" y="383.5" ></text>
</g>
<g >
<title>sys_pread64 (129,399,723 samples, 0.02%)</title><rect x="51.7" y="293" width="0.2" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text  x="54.71" y="303.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (131,208,502 samples, 0.02%)</title><rect x="682.4" y="469" width="0.1" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="685.35" y="479.5" ></text>
</g>
<g >
<title>pgstat_count_io_op_n (431,293,237 samples, 0.05%)</title><rect x="332.1" y="485" width="0.6" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text  x="335.11" y="495.5" ></text>
</g>
<g >
<title>set_next_entity (71,827,163 samples, 0.01%)</title><rect x="1182.6" y="677" width="0.1" height="15.0" fill="rgb(232,125,29)" rx="2" ry="2" />
<text  x="1185.61" y="687.5" ></text>
</g>
<g >
<title>file_update_time (456,508,965 samples, 0.06%)</title><rect x="127.9" y="213" width="0.7" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text  x="130.92" y="223.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (200,799,339 samples, 0.02%)</title><rect x="828.8" y="437" width="0.3" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="831.83" y="447.5" ></text>
</g>
<g >
<title>mmput (10,483,653,966 samples, 1.27%)</title><rect x="1146.0" y="709" width="15.0" height="15.0" fill="rgb(226,99,23)" rx="2" ry="2" />
<text  x="1148.99" y="719.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (315,709,195 samples, 0.04%)</title><rect x="445.1" y="117" width="0.4" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text  x="448.07" y="127.5" ></text>
</g>
<g >
<title>do_shared_fault.isra.64 (29,497,082,559 samples, 3.58%)</title><rect x="86.8" y="229" width="42.2" height="15.0" fill="rgb(245,185,44)" rx="2" ry="2" />
<text  x="89.81" y="239.5" >do_..</text>
</g>
<g >
<title>page_fault (328,458,750 samples, 0.04%)</title><rect x="57.2" y="389" width="0.5" height="15.0" fill="rgb(243,177,42)" rx="2" ry="2" />
<text  x="60.19" y="399.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (129,480,975 samples, 0.02%)</title><rect x="682.4" y="453" width="0.1" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text  x="685.35" y="463.5" ></text>
</g>
<g >
<title>generic_file_aio_read (35,237,479,854 samples, 4.28%)</title><rect x="79.0" y="325" width="50.4" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text  x="81.96" y="335.5" >gener..</text>
</g>
<g >
<title>tick_sched_timer (125,874,696 samples, 0.02%)</title><rect x="682.4" y="421" width="0.1" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="685.36" y="431.5" ></text>
</g>
<g >
<title>__do_page_fault (3,447,512,278 samples, 0.42%)</title><rect x="266.9" y="389" width="4.9" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text  x="269.88" y="399.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (160,828,911 samples, 0.02%)</title><rect x="770.6" y="501" width="0.2" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="773.61" y="511.5" ></text>
</g>
<g >
<title>BufferAlloc (12,993,680,265 samples, 1.58%)</title><rect x="56.0" y="453" width="18.6" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text  x="59.04" y="463.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (245,327,453 samples, 0.03%)</title><rect x="933.5" y="469" width="0.3" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="936.47" y="479.5" ></text>
</g>
<g >
<title>__find_lock_page (135,896,440 samples, 0.02%)</title><rect x="57.4" y="261" width="0.2" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="60.37" y="271.5" ></text>
</g>
<g >
<title>smp_apic_timer_interrupt (206,275,440 samples, 0.03%)</title><rect x="709.2" y="501" width="0.3" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text  x="712.19" y="511.5" ></text>
</g>
<g >
<title>wake_up_q (99,261,607 samples, 0.01%)</title><rect x="131.2" y="261" width="0.1" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text  x="134.18" y="271.5" ></text>
</g>
<g >
<title>queued_spin_lock_slowpath (640,204,755 samples, 0.08%)</title><rect x="95.9" y="149" width="0.9" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="98.87" y="159.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (200,799,339 samples, 0.02%)</title><rect x="828.8" y="453" width="0.3" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="831.83" y="463.5" ></text>
</g>
<g >
<title>BufferIsValid (1,132,572,914 samples, 0.14%)</title><rect x="1099.7" y="469" width="1.6" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="1102.72" y="479.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (205,061,762 samples, 0.02%)</title><rect x="11.6" y="533" width="0.3" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text  x="14.60" y="543.5" ></text>
</g>
<g >
<title>BufTableHashCode (293,916,447 samples, 0.04%)</title><rect x="134.3" y="373" width="0.4" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text  x="137.27" y="383.5" ></text>
</g>
<g >
<title>get_next_timer_interrupt (76,093,774 samples, 0.01%)</title><rect x="1184.9" y="677" width="0.1" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text  x="1187.90" y="687.5" ></text>
</g>
<g >
<title>vfs_write (4,721,878,079 samples, 0.57%)</title><rect x="12.6" y="405" width="6.7" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text  x="15.57" y="415.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (4,099,787,855 samples, 0.50%)</title><rect x="984.9" y="485" width="5.8" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="987.86" y="495.5" ></text>
</g>
<g >
<title>ItemPointerSet (231,346,316 samples, 0.03%)</title><rect x="28.8" y="613" width="0.3" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text  x="31.76" y="623.5" ></text>
</g>
<g >
<title>LWLockHeldByMe (905,157,760 samples, 0.11%)</title><rect x="1117.3" y="453" width="1.3" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text  x="1120.28" y="463.5" ></text>
</g>
<g >
<title>pg_atomic_read_u32 (1,164,123,826 samples, 0.14%)</title><rect x="1097.8" y="453" width="1.6" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text  x="1100.77" y="463.5" ></text>
</g>
<g >
<title>generic_file_aio_read (108,562,124 samples, 0.01%)</title><rect x="51.7" y="213" width="0.2" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text  x="54.73" y="223.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (81,334,197 samples, 0.01%)</title><rect x="87.4" y="181" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="90.38" y="191.5" ></text>
</g>
<g >
<title>do_sync_write (4,667,300,246 samples, 0.57%)</title><rect x="12.6" y="389" width="6.7" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="15.60" y="399.5" ></text>
</g>
<g >
<title>up_read (93,745,128 samples, 0.01%)</title><rect x="587.1" y="341" width="0.2" height="15.0" fill="rgb(209,18,4)" rx="2" ry="2" />
<text  x="590.12" y="351.5" ></text>
</g>
<g >
<title>__schedule (98,781,414 samples, 0.01%)</title><rect x="48.6" y="645" width="0.1" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text  x="51.57" y="655.5" ></text>
</g>
<g >
<title>LockBuffer (80,265,410 samples, 0.01%)</title><rect x="43.1" y="613" width="0.2" height="15.0" fill="rgb(235,142,34)" rx="2" ry="2" />
<text  x="46.14" y="623.5" ></text>
</g>
<g >
<title>check_preempt_curr (192,581,581 samples, 0.02%)</title><rect x="1180.9" y="677" width="0.3" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text  x="1183.94" y="687.5" ></text>
</g>
<g >
<title>copy_user_enhanced_fast_string (14,635,789,779 samples, 1.78%)</title><rect x="349.9" y="309" width="21.0" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text  x="352.92" y="319.5" ></text>
</g>
<g >
<title>MarkBufferDirty (73,475,188 samples, 0.01%)</title><rect x="232.8" y="501" width="0.1" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text  x="235.76" y="511.5" ></text>
</g>
<g >
<title>vm_readbuf (1,434,602,771 samples, 0.17%)</title><rect x="133.7" y="485" width="2.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text  x="136.75" y="495.5" ></text>
</g>
<g >
<title>TerminateBufferIO (3,770,178,884 samples, 0.46%)</title><rect x="325.4" y="501" width="5.4" height="15.0" fill="rgb(239,160,38)" rx="2" ry="2" />
<text  x="328.40" y="511.5" ></text>
</g>
<g >
<title>vacuum_delay_point (112,464,995 samples, 0.01%)</title><rect x="1144.1" y="549" width="0.2" height="15.0" fill="rgb(208,17,4)" rx="2" ry="2" />
<text  x="1147.10" y="559.5" ></text>
</g>
<g >
<title>lookup_page_cgroup (218,104,246 samples, 0.03%)</title><rect x="450.7" y="133" width="0.3" height="15.0" fill="rgb(228,107,25)" rx="2" ry="2" />
<text  x="453.66" y="143.5" ></text>
</g>
<g >
<title>_raw_qspin_lock_irq (17,281,462,928 samples, 2.10%)</title><rect x="100.8" y="149" width="24.8" height="15.0" fill="rgb(251,214,51)" rx="2" ry="2" />
<text  x="103.81" y="159.5" >_..</text>
</g>
<g >
<title>pg_atomic_fetch_or_u32_impl (581,773,591 samples, 0.07%)</title><rect x="321.9" y="421" width="0.8" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text  x="324.92" y="431.5" ></text>
</g>
<g >
<title>TransactionIdPrecedes (386,075,659 samples, 0.05%)</title><rect x="39.2" y="549" width="0.5" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text  x="42.18" y="559.5" ></text>
</g>
<g >
<title>iomap_write_begin (113,094,523 samples, 0.01%)</title><rect x="10.5" y="501" width="0.2" height="15.0" fill="rgb(211,30,7)" rx="2" ry="2" />
<text  x="13.54" y="511.5" ></text>
</g>
<g >
<title>perform_spin_delay (2,282,751,802 samples, 0.28%)</title><rect x="271.9" y="421" width="3.2" height="15.0" fill="rgb(247,196,46)" rx="2" ry="2" />
<text  x="274.87" y="431.5" ></text>
</g>
<g >
<title>update_time (254,983,261 samples, 0.03%)</title><rect x="586.5" y="293" width="0.4" height="15.0" fill="rgb(211,31,7)" rx="2" ry="2" />
<text  x="589.50" y="303.5" ></text>
</g>
<g >
<title>try_to_wake_up (128,474,798 samples, 0.02%)</title><rect x="51.3" y="37" width="0.2" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text  x="54.30" y="47.5" ></text>
</g>
<g >
<title>heap_tuple_should_freeze (3,112,739,937 samples, 0.38%)</title><rect x="213.4" y="453" width="4.5" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text  x="216.44" y="463.5" ></text>
</g>
<g >
<title>BufferGetPage (2,664,606,174 samples, 0.32%)</title><rect x="1106.0" y="453" width="3.9" height="15.0" fill="rgb(253,220,52)" rx="2" ry="2" />
<text  x="1109.04" y="463.5" ></text>
</g>
<g >
<title>BufTableLookup (169,301,111 samples, 0.02%)</title><rect x="134.7" y="373" width="0.2" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text  x="137.70" y="383.5" ></text>
</g>
<g >
<title>tick_sched_timer (110,974,429 samples, 0.01%)</title><rect x="1086.9" y="389" width="0.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text  x="1089.95" y="399.5" ></text>
</g>
<g >
<title>resched_curr (73,591,642 samples, 0.01%)</title><rect x="1181.2" y="677" width="0.1" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text  x="1184.22" y="687.5" ></text>
</g>
<g >
<title>local_apic_timer_interrupt (342,711,918 samples, 0.04%)</title><rect x="784.3" y="485" width="0.5" height="15.0" fill="rgb(213,37,9)" rx="2" ry="2" />
<text  x="787.33" y="495.5" ></text>
</g>
<g >
<title>__write_nocancel (335,326,487 samples, 0.04%)</title><rect x="10.4" y="661" width="0.5" height="15.0" fill="rgb(243,175,42)" rx="2" ry="2" />
<text  x="13.39" y="671.5" ></text>
</g>
<g >
<title>set_cpu_sd_state_idle (84,511,405 samples, 0.01%)</title><rect x="1186.6" y="709" width="0.2" height="15.0" fill="rgb(211,29,6)" rx="2" ry="2" />
<text  x="1189.65" y="719.5" ></text>
</g>
<g >
<title>do_read_fault.isra.63 (1,608,472,116 samples, 0.20%)</title><rect x="244.3" y="341" width="2.3" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text  x="247.31" y="351.5" ></text>
</g>
<g >
<title>BufferIsValid (243,620,002 samples, 0.03%)</title><rect x="224.0" y="437" width="0.4" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text  x="227.03" y="447.5" ></text>
</g>
<g >
<title>TransactionIdFollows (2,590,304,802 samples, 0.31%)</title><rect x="701.9" y="517" width="3.7" height="15.0" fill="rgb(222,79,18)" rx="2" ry="2" />
<text  x="704.88" y="527.5" ></text>
</g>
<g >
<title>GetVictimBuffer (920,252,455 samples, 0.11%)</title><rect x="50.3" y="325" width="1.3" height="15.0" fill="rgb(209,18,4)" rx="2" ry="2" />
<text  x="53.29" y="335.5" ></text>
</g>
<g >
<title>apic_timer_interrupt (75,510,927 samples, 0.01%)</title><rect x="697.4" y="501" width="0.1" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text  x="700.37" y="511.5" ></text>
</g>
<g >
<title>maybe_start_bgworkers (124,943,333,272 samples, 15.16%)</title><rect x="54.6" y="693" width="178.9" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text  x="57.63" y="703.5" >maybe_start_bgworkers</text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (16,850,311,132 samples, 2.04%)</title><rect x="101.4" y="117" width="24.2" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text  x="104.43" y="127.5" >n..</text>
</g>
<g >
<title>heap_prune_record_unused (1,565,040,447 samples, 0.19%)</title><rect x="936.7" y="485" width="2.3" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text  x="939.74" y="495.5" ></text>
</g>
<g >
<title>down_read (5,430,949,971 samples, 0.66%)</title><rect x="587.4" y="325" width="7.8" height="15.0" fill="rgb(246,188,45)" rx="2" ry="2" />
<text  x="590.39" y="335.5" ></text>
</g>
<g >
<title>__radix_tree_lookup (384,526,653 samples, 0.05%)</title><rect x="60.0" y="229" width="0.6" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text  x="63.05" y="239.5" ></text>
</g>
</g>
</svg>

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#8)

Re: Parallel heap vacuum

Sorry for the very late reply.

On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

Thank you for testing!

I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.

2.
I compared parallel heap scan and found that it does not have compute_worker

API.

Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the

heap structure,

so should we add the API for table scan as well?)

There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.

Okay, thanks for confirming. I wanted to ask others as well.

I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.

Here are other preliminary comments for v2 patch. This does not contain
cosmetic ones.

1.
Shared data structure PHVShared does not contain the mutex lock. Is it intentional
because they are accessed by leader only after parallel workers exit?

Yes, the fields in PHVShared are read-only for workers. Since no
concurrent reads/writes happen on these fields we don't need to
protect them.

2.
Per my understanding, the vacuuming goes like below steps.

a. paralell workers are launched for scanning pages
b. leader waits until scans are done
c. leader does vacuum alone (you may extend here...)
d. parallel workers are launched again to cleanup indeces

If so, can we reuse parallel workers for the cleanup? Or, this is painful
engineering than the benefit?

I've not thought of this idea but I think it's possible from a
technical perspective. It saves overheads of relaunching workers but
I'm not sure how much it would help for a better performance and I'm
concerned it would make the code complex. For example, different
numbers of workers might be required for table vacuuming and index
vacuuming. So we would end up increasing or decreasing workers.

3.
According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
"parallel worker ..." Can we extend this to improve the trackability in the
pg_stat_activity?

It would be a good improvement for better trackability. But I think we
should do that in a separate patch as it's not just a problem for
parallel heap vacuum.

4.
I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
bottleneck when tid is added to the shared TidStore. My another primitive idea
is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
and gather after the heap scanning. If you extend like parallel workers do vacuuming,
the gathering may not be needed: they can access own TidStore and clean up.
One downside is that the memory consumption may be quite large.

Interesting idea. Suppose we support parallel heap vacuum as well, we
wouldn't need locks and to support shared-iteration on TidStore. I
think each worker should use a fraction of maintenance_work_mem.
However, one downside would be that we need to check as many TidStore
as workers during index vacuuming.

FYI I've implemented the parallel heap vacuum part and am doing some
benchmark tests. I'll share the updated patches along with test
results this week.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#10

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Masahiko Sawada (#9)

4 attachment(s)

Re: Parallel heap vacuum

On Tue, Oct 22, 2024 at 4:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Sorry for the very late reply.

On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

Thank you for testing!

I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.

2.
I compared parallel heap scan and found that it does not have compute_worker

API.

Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the

heap structure,

so should we add the API for table scan as well?)

There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.

Okay, thanks for confirming. I wanted to ask others as well.

I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.

Here are other preliminary comments for v2 patch. This does not contain
cosmetic ones.

1.
Shared data structure PHVShared does not contain the mutex lock. Is it intentional
because they are accessed by leader only after parallel workers exit?

Yes, the fields in PHVShared are read-only for workers. Since no
concurrent reads/writes happen on these fields we don't need to
protect them.

2.
Per my understanding, the vacuuming goes like below steps.

a. paralell workers are launched for scanning pages
b. leader waits until scans are done
c. leader does vacuum alone (you may extend here...)
d. parallel workers are launched again to cleanup indeces

If so, can we reuse parallel workers for the cleanup? Or, this is painful
engineering than the benefit?

I've not thought of this idea but I think it's possible from a
technical perspective. It saves overheads of relaunching workers but
I'm not sure how much it would help for a better performance and I'm
concerned it would make the code complex. For example, different
numbers of workers might be required for table vacuuming and index
vacuuming. So we would end up increasing or decreasing workers.

3.
According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
"parallel worker ..." Can we extend this to improve the trackability in the
pg_stat_activity?

It would be a good improvement for better trackability. But I think we
should do that in a separate patch as it's not just a problem for
parallel heap vacuum.

4.
I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
bottleneck when tid is added to the shared TidStore. My another primitive idea
is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
and gather after the heap scanning. If you extend like parallel workers do vacuuming,
the gathering may not be needed: they can access own TidStore and clean up.
One downside is that the memory consumption may be quite large.

Interesting idea. Suppose we support parallel heap vacuum as well, we
wouldn't need locks and to support shared-iteration on TidStore. I
think each worker should use a fraction of maintenance_work_mem.
However, one downside would be that we need to check as many TidStore
as workers during index vacuuming.

On further thoughts, I think this idea doesn't go well. The index
vacuuming is the most time-consuming phase among vacuum phases, I
think it would not be a good idea to make it slower even if we can do
parallel heap scan and heap vacuum without any locking. Also, merging
multiple TidStore to one is not straightforward since the block ranges
that each worker processes are overlapped.

FYI I've implemented the parallel heap vacuum part and am doing some
benchmark tests. I'll share the updated patches along with test
results this week.

Please find the attached patches. From the previous version, I made a
lot of changes including bug fixes, addressing review comments, and
adding parallel heap vacuum support. Parallel vacuum related
infrastructure are implemented in vacuumparallel.c, and lazyvacuum.c
now uses ParallelVacuumState for parallel heap scan/vacuum, index
bulkdelete/cleanup, or both. Parallel vacuum workers launch at the
beginning of each phase, and exit at the end of each phase. Since
different numbers of workers could be used for heap scan/vacuum and
index bulkdelete/cleanup, it's possible that only either heap
scan/vacuum or index bulkdelete/cleanup is parallelized.

In order to implement parallel heap vacuum, I extended radix tree and
tidstore to support shared iteration. The shared iteration works only
with a shared tidstore but a non-shared iteration works with a local
tidstore as well as a shared tidstore. For example, if a table is
large and has one index, we use only parallel heap scan/vacuum. In
this case, we store dead item TIDs into a shared tidstore during
parallel heap scan, but during parallel index bulk-deletion we perform
a non-shared iteration on the shared tidstore, which is more efficient
as it doesn't acquire any locks during the iteration.

I've done benchmark tests with a 10GB unlogged table (created on a
tmpfs tablespace) having 4 btree indexes while changing parallel
degrees. I restarted postgres server before each run to ensure that
data is not on the shared memory. I avoided disk I/O during lazy
vacuum as much as possible. Here is comparison between HEAD and
patched (took median of 5 runs):

+----------+-----------+-----------+-------------+
| parallel |   HEAD    |  patched  | improvement |
+----------+-----------+-----------+-------------+
| 0        | 53079.53  | 53468.734 | 1.007       |
| 1        | 48101.46  | 35712.613 | 0.742       |
| 2        | 37767.902 | 23566.426 | 0.624       |
| 4        | 38005.836 | 20192.055 | 0.531       |
| 8        | 37754.47  | 18614.717 | 0.493       |
+----------+-----------+-----------+-------------+

Here is the breakdowns of execution times of each vacuum phase (from
left, heap scan, index bulkdel, and heap vacuum):

- HEAD
parallel 0: 53079.530 (15886, 28039, 9270)
parallel 1: 48101.460 (15931, 23247, 9215)
parallel 2: 37767.902 (15259, 12888, 9479)
parallel 4: 38005.836 (16097, 12683, 9217)
parallel 8: 37754.470 (16016, 12535, 9306)

- Patched
parallel 0: 53468.734 (15990, 28296, 9465)
parallel 1: 35712.613 ( 8254, 23569, 3700)
parallel 2: 23566.426 ( 6180, 12760, 3283)
parallel 4: 20192.055 ( 4058, 12776, 2154)
parallel 8: 18614.717 ( 2797, 13244, 1579)

The index bulkdel phase is saturated with parallel 2 as one worker is
assigned to one index. On HEAD, there is no further performance gain
with more than 'parallel 4'. On the other hand, on Patched, it got
faster even at 'parallel 4' and 'parallel 8' since other two phases
were also done with parallel workers.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0004-Support-parallel-heap-vacuum-during-lazy-vacuum.patchapplication/octet-stream; name=v3-0004-Support-parallel-heap-vacuum-during-lazy-vacuum.patchDownload

From dd9f54e11877f7de08b084eac1701b35859e0fbc Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:37:45 -0700
Subject: [PATCH v3 4/4] Support parallel heap vacuum during lazy vacuum.

This commit further extends parallel vacuum to perform the heap vacuum
phase with parallel workers. It leverages the shared TidStore iteration.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c | 157 ++++++++++++++++++---------
 1 file changed, 106 insertions(+), 51 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index fd6c054901..6c22ca5a62 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -160,6 +160,7 @@ typedef struct LVRelScanStats
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	BlockNumber vacuumed_pages; /* # pages vacuumed in one second pass */
 
 	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
@@ -192,6 +193,9 @@ typedef struct PHVShared
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState vistest;
 
+	dsa_pointer shared_iter_handle;
+	bool		do_heap_vacuum;
+
 	/* per-worker scan stats for parallel heap vacuum scan */
 	LVRelScanStats worker_scan_stats[FLEXIBLE_ARRAY_MEMBER];
 }			PHVShared;
@@ -353,6 +357,7 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
+static void do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter);
 static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, OffsetNumber *deadoffsets,
 								  int num_offsets, Buffer vmbuffer);
@@ -531,6 +536,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	scan_stats->lpdead_item_pages = 0;
 	scan_stats->missed_dead_pages = 0;
 	scan_stats->nonempty_pages = 0;
+	scan_stats->vacuumed_pages = 0;
 
 	/* Initialize remaining counters (be tidy) */
 	scan_stats->tuples_deleted = 0;
@@ -2363,46 +2369,14 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	return allindexes;
 }
 
-/*
- *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
- *
- * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
- * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
- *
- * We may also be able to truncate the line pointer array of the heap pages we
- * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
- * array, it can be reclaimed as free space.  These LP_UNUSED items usually
- * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
- * each page to LP_UNUSED, and then consider if it's possible to truncate the
- * page's line pointer array).
- *
- * Note: the reason for doing this as a second pass is we cannot remove the
- * tuples until we've removed their index entries, and we want to process
- * index entry removal in batches as large as possible.
- */
 static void
-lazy_vacuum_heap_rel(LVRelState *vacrel)
+do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter)
 {
-	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
-	LVSavedErrInfo saved_err_info;
-	TidStoreIter *iter;
-	TidStoreIterResult *iter_result;
-
-	Assert(vacrel->do_index_vacuuming);
-	Assert(vacrel->do_index_cleanup);
-	Assert(vacrel->num_index_scans > 0);
 
-	/* Report that we are now vacuuming the heap */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
-
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrel, &saved_err_info,
-							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 InvalidBlockNumber, InvalidOffsetNumber);
+	/* LVSavedErrInfo saved_err_info; */
+	TidStoreIterResult *iter_result;
 
-	iter = TidStoreBeginIterate(vacrel->dead_items);
 	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
@@ -2440,26 +2414,88 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		UnlockReleaseBuffer(buf);
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-		vacuumed_pages++;
+		vacrel->scan_stats->vacuumed_pages++;
 	}
-	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
+}
+
+/*
+ *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
+ *
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
+ *
+ * We may also be able to truncate the line pointer array of the heap pages we
+ * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
+ * array, it can be reclaimed as free space.  These LP_UNUSED items usually
+ * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
+ * each page to LP_UNUSED, and then consider if it's possible to truncate the
+ * page's line pointer array).
+ *
+ * Note: the reason for doing this as a second pass is we cannot remove the
+ * tuples until we've removed their index entries, and we want to process
+ * index entry removal in batches as large as possible.
+ */
+static void
+lazy_vacuum_heap_rel(LVRelState *vacrel)
+{
+	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
+	Assert(vacrel->num_index_scans > 0);
+
+	/* Report that we are now vacuuming the heap */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+							 InvalidBlockNumber, InvalidOffsetNumber);
+
+	vacrel->scan_stats->vacuumed_pages = 0;
+
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		PHVState   *phvstate = vacrel->phvstate;
+
+		iter = TidStoreBeginIterateShared(vacrel->dead_items);
+
+		phvstate->shared->do_heap_vacuum = true;
+		phvstate->shared->shared_iter_handle = TidStoreGetSharedIterHandle(iter);
+
+		/* launch workers */
+		vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+	}
+	else
+		iter = TidStoreBeginIterate(vacrel->dead_items);
+
+	/* do the real work */
+	do_lazy_vacuum_heap_rel(vacrel, iter);
+
+	if (ParallelHeapVacuumIsActive(vacrel))
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+
+	TidStoreEndIterate(iter);
+
 	/*
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
 		   (vacrel->dead_items_info->num_items == vacrel->scan_stats->lpdead_items &&
-			vacuumed_pages == vacrel->scan_stats->lpdead_item_pages));
+			vacrel->scan_stats->vacuumed_pages == vacrel->scan_stats->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
 					vacrel->relname, (long long) vacrel->dead_items_info->num_items,
-					vacuumed_pages)));
+					vacrel->scan_stats->vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3563,7 +3599,6 @@ heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
 	PHVScanWorkerState *scanstate;
 	LVRelScanStats *scan_stats;
 	ErrorContextCallback errcallback;
-	bool		scan_done;
 
 	phvstate = palloc(sizeof(PHVState));
 
@@ -3625,25 +3660,44 @@ heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
 	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
 	vacrel.relname = pstrdup(RelationGetRelationName(rel));
 	vacrel.indname = NULL;
-	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = &vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
-	scan_done = do_lazy_scan_heap(&vacrel);
+	if (shared->do_heap_vacuum)
+	{
+		TidStoreIter *iter;
+
+		iter = TidStoreAttachIterateShared(vacrel.dead_items, shared->shared_iter_handle);
+
+		/* Join parallel heap vacuum */
+		vacrel.phase = VACUUM_ERRCB_PHASE_VACUUM_HEAP;
+		do_lazy_vacuum_heap_rel(&vacrel, iter);
+
+		TidStoreEndIterate(iter);
+	}
+	else
+	{
+		bool		scan_done;
+
+		/* Join parallel heap scan */
+		vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+		scan_done = do_lazy_scan_heap(&vacrel);
+
+		/*
+		 * If the leader or a worker finishes the heap scan because dead_items
+		 * TIDs is close to the limit, it might have some allocated blocks in
+		 * its scan state. Since this scan state might not be used in the next
+		 * heap scan, we remember that it might have some unconsumed blocks so
+		 * that the leader complete the scans after the heap scan phase
+		 * finishes.
+		 */
+		phvstate->myscanstate->maybe_have_blocks = !scan_done;
+	}
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
-
-	/*
-	 * If the leader or a worker finishes the heap scan because dead_items
-	 * TIDs is close to the limit, it might have some allocated blocks in its
-	 * scan state. Since this scan state might not be used in the next heap
-	 * scan, we remember that it might have some unconsumed blocks so that the
-	 * leader complete the scans after the heap scan phase finishes.
-	 */
-	phvstate->myscanstate->maybe_have_blocks = !scan_done;
 }
 
 /*
@@ -3771,6 +3825,7 @@ do_parallel_lazy_scan_heap(LVRelState *vacrel)
 	Assert(!IsParallelWorker());
 
 	/* launcher workers */
+	vacrel->phvstate->shared->do_heap_vacuum = false;
 	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
 
 	/* initialize parallel scan description to join as a worker */
-- 
2.43.5

v3-0003-Support-shared-itereation-on-TidStore.patchapplication/octet-stream; name=v3-0003-Support-shared-itereation-on-TidStore.patchDownload

From 09b7bcd6c8e3fbc9438c6edf1aac75a55b3909be Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:34:57 -0700
Subject: [PATCH v3 3/4] Support shared itereation on TidStore.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/common/tidstore.c | 59 ++++++++++++++++++++++++++++
 src/include/access/tidstore.h        |  3 ++
 2 files changed, 62 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index a7179759d6..637d26012d 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -483,6 +483,7 @@ TidStoreBeginIterate(TidStore *ts)
 	iter = palloc0(sizeof(TidStoreIter));
 	iter->ts = ts;
 
+	/* begin iteration on the radix tree */
 	if (TidStoreIsShared(ts))
 		iter->tree_iter.shared = shared_ts_begin_iterate(ts->tree.shared);
 	else
@@ -533,6 +534,56 @@ TidStoreEndIterate(TidStoreIter *iter)
 	pfree(iter);
 }
 
+/*
+ * Prepare to iterate through a shared TidStore in shared mode. This function
+ * is aimed to start the iteration on the given TidStore with parallel workers.
+ *
+ * The TidStoreIter struct is created in the caller's memory context, and it
+ * will be freed in TidStoreEndIterate.
+ *
+ * The caller is responsible for locking TidStore until the iteration is
+ * finished.
+ */
+TidStoreIter *
+TidStoreBeginIterateShared(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	if (!TidStoreIsShared(ts))
+		elog(ERROR, "cannot begin shared iteration on local TidStore");
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* begin the shared iteration on radix tree */
+	iter->tree_iter.shared =
+		(shared_ts_iter *) shared_ts_begin_iterate_shared(ts->tree.shared);
+
+	return iter;
+}
+
+/*
+ * Attach to the shared TidStore iterator. 'iter_handle' is the dsa_pointer
+ * returned by TidStoreGetSharedIterHandle(). The returned object is allocated
+ * in backend-local memory using CurrentMemoryContext.
+ */
+TidStoreIter *
+TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle)
+{
+	TidStoreIter *iter;
+
+	Assert(TidStoreIsShared(ts));
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* Attach to the shared iterator */
+	iter->tree_iter.shared = shared_ts_attach_iterate_shared(ts->tree.shared,
+															 iter_handle);
+
+	return iter;
+}
+
 /*
  * Return the memory usage of TidStore.
  */
@@ -564,6 +615,14 @@ TidStoreGetHandle(TidStore *ts)
 	return (dsa_pointer) shared_ts_get_handle(ts->tree.shared);
 }
 
+dsa_pointer
+TidStoreGetSharedIterHandle(TidStoreIter *iter)
+{
+	Assert(TidStoreIsShared(iter->ts));
+
+	return (dsa_pointer) shared_ts_get_iter_handle(iter->tree_iter.shared);
+}
+
 /*
  * Given a TidStoreIterResult returned by TidStoreIterateNext(), extract the
  * offset numbers.  Returns the number of offsets filled in, if <=
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index d95cabd7b5..0c79a101fd 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -37,6 +37,9 @@ extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
 extern void TidStoreUnlock(TidStore *ts);
+extern TidStoreIter *TidStoreBeginIterateShared(TidStore *ts);
+extern TidStoreIter *TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle);
+extern dsa_pointer TidStoreGetSharedIterHandle(TidStoreIter *iter);
 extern void TidStoreDestroy(TidStore *ts);
 extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 									int num_offsets);
-- 
2.43.5

v3-0001-Support-parallel-heap-scan-during-lazy-vacuum.patchapplication/octet-stream; name=v3-0001-Support-parallel-heap-scan-during-lazy-vacuum.patchDownload

From a8c8a2bbf943b157eb6f0e754cb9aaa432e5bce3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 1 Jul 2024 15:17:46 +0900
Subject: [PATCH v3 1/4] Support parallel heap scan during lazy vacuum.

Commit 40d964ec99 allowed vacuum command to process indexes in
parallel. This change extends the parallel vacuum to support parallel
heap scan during lazy vacuum.
---
 src/backend/access/heap/heapam_handler.c |    6 +
 src/backend/access/heap/vacuumlazy.c     | 1135 ++++++++++++++++++----
 src/backend/commands/vacuumparallel.c    |  311 +++++-
 src/backend/storage/ipc/procarray.c      |    9 -
 src/include/access/heapam.h              |    8 +
 src/include/access/tableam.h             |   87 ++
 src/include/commands/vacuum.h            |    8 +-
 src/include/utils/snapmgr.h              |   14 +-
 8 files changed, 1313 insertions(+), 265 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8c59b77b64..c8602f4d30 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2625,6 +2625,12 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_scan_worker = heap_parallel_vacuum_scan_worker,
+
 	.scan_analyze_next_block = heapam_scan_analyze_next_block,
 	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d82aa3d489..fd6c054901 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -49,6 +49,7 @@
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -117,10 +118,24 @@
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
 /*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
+ * DSM keys for heap parallel vacuum scan. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_SCAN_SHARED			0xFFFF0001
+#define LV_PARALLEL_SCAN_DESC			0xFFFF0002
+#define LV_PARALLEL_SCAN_DESC_WORKER	0xFFFF0003
+
+/*
+ * Macros to check if we are in parallel heap vacuuming, parallel index vacuuming,
+ * or both. If ParallelVacuumIsActive() is true, we are in the parallel mode, meaning
+ * that we have dead items TIDs on shared memory area.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelIndexVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_index((vacrel)->pvs) > 0)
+#define ParallelHeapVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_table((vacrel)->pvs) > 0)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -133,6 +148,108 @@ typedef enum
 	VACUUM_ERRCB_PHASE_TRUNCATE,
 } VacErrPhase;
 
+/*
+ * Relation statistics collected during heap scanning and need to be shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber frozen_pages;	/* # pages with newly frozen tuples */
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+}			LVRelScanStats;
+
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The initial values shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState vistest;
+
+	/* per-worker scan stats for parallel heap vacuum scan */
+	LVRelScanStats worker_scan_stats[FLEXIBLE_ARRAY_MEMBER];
+}			PHVShared;
+#define SizeOfPHVShared (offsetof(PHVShared, worker_scan_stats))
+
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerState
+{
+	bool		initialized;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData state;
+
+	/*
+	 * True if a parallel vacuum scan worker allocated blocks in state but
+	 * might have not scanned all of them. The leader process will take over
+	 * for scanning these remaining blocks.
+	 */
+	bool		maybe_have_blocks;
+
+	/* current block number being processed */
+	pg_atomic_uint32 cur_blkno;
+}			PHVScanWorkerState;
+
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pscandesc;
+
+	/* Shared information */
+	PHVShared  *shared;
+
+	/*
+	 * Points to all per-worker scan state array stored on DSM area.
+	 *
+	 * During parallel heap scan, each worker allocates some chunks of blocks
+	 * to scan in its scan state, and could exit while leaving some chunks
+	 * un-scanned if the size of dead_items TIDs is close to overrunning the
+	 * the available space. We store scan states on shared memory area so that
+	 * workers can resume heap scans from the previous point.
+	 */
+	PHVScanWorkerState *scanstates;
+
+	/* Assigned per-worker scan state */
+	PHVScanWorkerState *myscanstate;
+
+	/*
+	 * All blocks up to this value has been scanned, i.e. minimum of cur_blkno
+	 * among all PHVScanWorkerState. It's updated by
+	 * parallel_heap_vacuum_compute_min_blkno().
+	 */
+	BlockNumber min_blkno;
+
+	/* The number of workers launched for parallel heap vacuum */
+	int			nworkers_launched;
+}			PHVState;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -144,6 +261,9 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
+	/* Parallel heap vacuum state and sizes for each struct */
+	PHVState   *phvstate;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -159,10 +279,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -188,12 +304,10 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber frozen_pages;	/* # pages with newly frozen tuples */
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	BlockNumber next_fsm_block_to_vacuum;
+
+	/* Statistics collected during heap scan */
+	LVRelScanStats *scan_stats;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -203,13 +317,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -229,6 +336,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static bool do_lazy_scan_heap(LVRelState *vacrel);
 static bool heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 									 bool *all_visible_according_to_vm);
 static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
@@ -271,6 +379,12 @@ static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 static void update_relstats_all_indexes(LVRelState *vacrel);
+
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_heap_vacuum_compute_min_blkno(LVRelState *vacrel);
+static void parallel_heap_vacuum_gather_scan_stats(LVRelState *vacrel);
+static void parallel_heap_complete_unfinised_scan(LVRelState *vacrel);
+
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
 									 LVSavedErrInfo *saved_vacrel,
@@ -296,6 +410,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVRelScanStats *scan_stats;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -406,14 +521,28 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		Assert(params->index_cleanup == VACOPTVALUE_AUTO);
 	}
 
+	vacrel->next_fsm_block_to_vacuum = 0;
+
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->frozen_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	scan_stats = palloc(sizeof(LVRelScanStats));
+	scan_stats->scanned_pages = 0;
+	scan_stats->removed_pages = 0;
+	scan_stats->frozen_pages = 0;
+	scan_stats->lpdead_item_pages = 0;
+	scan_stats->missed_dead_pages = 0;
+	scan_stats->nonempty_pages = 0;
+
+	/* Initialize remaining counters (be tidy) */
+	scan_stats->tuples_deleted = 0;
+	scan_stats->tuples_frozen = 0;
+	scan_stats->lpdead_items = 0;
+	scan_stats->live_tuples = 0;
+	scan_stats->recently_dead_tuples = 0;
+	scan_stats->missed_dead_tuples = 0;
+
+	vacrel->scan_stats = scan_stats;
+
+	vacrel->num_index_scans = 0;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -421,14 +550,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
-	/* Initialize remaining counters (be tidy) */
-	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -450,9 +572,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
+	vacrel->scan_stats->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_stats->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_stats->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -533,15 +655,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_stats->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_stats->NewRelfrozenXid));
+	Assert(vacrel->scan_stats->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_stats->NewRelminMxid));
+	if (vacrel->scan_stats->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -549,8 +671,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_stats->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_stats->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
@@ -571,7 +693,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_stats->NewRelfrozenXid, vacrel->scan_stats->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -587,8 +709,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples);
+						 vacrel->scan_stats->recently_dead_tuples +
+						 vacrel->scan_stats->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	if (instrument)
@@ -661,21 +783,21 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_stats->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_stats->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->scan_stats->scanned_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->scan_stats->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->scan_stats->recently_dead_tuples);
+			if (vacrel->scan_stats->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->scan_stats->missed_dead_tuples,
+								 vacrel->scan_stats->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -683,25 +805,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_stats->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_stats->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_stats->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_stats->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->frozen_pages,
+							 vacrel->scan_stats->frozen_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->frozen_pages / orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 100.0 * vacrel->scan_stats->frozen_pages / orig_rel_pages,
+							 (long long) vacrel->scan_stats->tuples_frozen);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -721,10 +843,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_stats->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->scan_stats->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->scan_stats->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -825,14 +947,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_fsm_block_to_vacuum = 0;
-	bool		all_visible_according_to_vm;
-
-	TidStore   *dead_items = vacrel->dead_items;
+	BlockNumber rel_pages = vacrel->rel_pages;
 	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
-	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -852,6 +968,72 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_allvis = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/*
+	 * Do the actual work. If parallel heap vacuum is active, we scan and
+	 * vacuum heap with parallel workers.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel);
+
+	/* report that everything is now scanned */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_stats->scanned_pages,
+													 vacrel->scan_stats->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_stats->recently_dead_tuples +
+		vacrel->scan_stats->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+								rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * Return true if we processed all blocks, otherwise false if we exit from this function
+ * while not completing the heap scan due to full of dead item TIDs. In serial heap scan
+ * case, this function always returns true. In parallel heap vacuum scan, this function
+ * is called by both worker processes and the leader process, and could return false.
+ */
+static bool
+do_lazy_scan_heap(LVRelState *vacrel)
+{
+	bool		all_visible_according_to_vm;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
+	BlockNumber blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		scan_done = true;
+
 	while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
 	{
 		Buffer		buf;
@@ -859,13 +1041,20 @@ lazy_scan_heap(LVRelState *vacrel)
 		bool		has_lpdead_items;
 		bool		got_cleanup_lock = false;
 
-		vacrel->scanned_pages++;
+		vacrel->scan_stats->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
 
+		/*
+		 * If parallel vacuum scan is enabled, advertise the current block
+		 * number
+		 */
+		if (ParallelHeapVacuumIsActive(vacrel))
+			pg_atomic_write_u32(&(vacrel->phvstate->myscanstate->cur_blkno), (uint32) blkno);
+
 		vacuum_delay_point();
 
 		/*
@@ -877,46 +1066,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (!IsParallelWorker() &&
+			vacrel->scan_stats->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
-		/*
-		 * Consider if we definitely have enough space to process TIDs on page
-		 * already.  If we are close to overrunning the available space for
-		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page.
-		 */
-		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
-		{
-			/*
-			 * Before beginning index vacuuming, we release any pin we may
-			 * hold on the visibility map page.  This isn't necessary for
-			 * correctness, but we do it anyway to avoid holding the pin
-			 * across a lengthy, unrelated operation.
-			 */
-			if (BufferIsValid(vmbuffer))
-			{
-				ReleaseBuffer(vmbuffer);
-				vmbuffer = InvalidBuffer;
-			}
-
-			/* Perform a round of index and heap vacuuming */
-			vacrel->consider_bypass_optimization = false;
-			lazy_vacuum(vacrel);
-
-			/*
-			 * Vacuum the Free Space Map to make newly-freed space visible on
-			 * upper-level FSM pages.  Note we have not yet processed blkno.
-			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-									blkno);
-			next_fsm_block_to_vacuum = blkno;
-
-			/* Report that we are once again scanning the heap */
-			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
-		}
-
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
@@ -1005,9 +1158,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1021,57 +1175,172 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				BlockNumber fsm_vac_up_to;
+
+				/*
+				 * If parallel heap vacuum scan is active, compute the minimum
+				 * block number we scanned so far.
+				 */
+				if (ParallelHeapVacuumIsActive(vacrel))
+				{
+					parallel_heap_vacuum_compute_min_blkno(vacrel);
+					fsm_vac_up_to = vacrel->phvstate->min_blkno;
+				}
+				else
+				{
+					/* blkno is already processed */
+					fsm_vac_up_to = blkno + 1;
+				}
+
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+										fsm_vac_up_to);
+				vacrel->next_fsm_block_to_vacuum = fsm_vac_up_to;
 			}
 		}
 		else
 			UnlockReleaseBuffer(buf);
+
+		/*
+		 * Consider if we definitely have enough space to process TIDs on page
+		 * already.  If we are close to overrunning the available space for
+		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
+		 * this page.
+		 */
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
+		{
+			/*
+			 * Before beginning index vacuuming, we release any pin we may
+			 * hold on the visibility map page.  This isn't necessary for
+			 * correctness, but we do it anyway to avoid holding the pin
+			 * across a lengthy, unrelated operation.
+			 */
+			if (BufferIsValid(vmbuffer))
+			{
+				ReleaseBuffer(vmbuffer);
+				vmbuffer = InvalidBuffer;
+			}
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/* Remember we might have some unprocessed blocks */
+				scan_done = false;
+
+				/*
+				 * Pause the heap scan without invoking index and heap
+				 * vacuuming. The leader process also skips FSM vacuum since
+				 * some blocks before blkno might have not processed yet. The
+				 * leader will wait for all workers to finish and perform
+				 * index and heap vacuuming, and then perform FSM vacuum.
+				 */
+				break;
+			}
+
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
+
+			/*
+			 * Vacuum the Free Space Map to make newly-freed space visible on
+			 * upper-level FSM pages.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = blkno;
+
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			continue;
+		}
 	}
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	return scan_done;
+}
 
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+/*
+ * A parallel scan variant of heap_vac_scan_next_block.
+ *
+ * In parallel vacuum scan, we don't use the SKIP_PAGES_THRESHOLD optimization.
+ */
+static bool
+heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno,
+								  bool *all_visible_according_to_vm)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+	BlockNumber next_block;
+	Buffer		vmbuffer = InvalidBuffer;
+	uint8		mapbits = 0;
 
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+	Assert(ParallelHeapVacuumIsActive(vacrel));
 
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
+	for (;;)
+	{
+		next_block = table_block_parallelscan_nextpage(vacrel->rel,
+													   &(phvstate->myscanstate->state),
+													   phvstate->pscandesc);
 
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+		/* Have we reached the end of the table? */
+		if (!BlockNumberIsValid(next_block) || next_block >= vacrel->rel_pages)
+		{
+			if (BufferIsValid(vmbuffer))
+				ReleaseBuffer(vmbuffer);
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+			*blkno = vacrel->rel_pages;
+			return false;
+		}
 
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
+		/* We always treat the last block as unsafe to skip */
+		if (next_block == vacrel->rel_pages - 1)
+			break;
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_block, &vmbuffer);
+
+		/*
+		 * A block is unskippable if it is not all visible according to the
+		 * visibility map.
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+			break;
+		}
+
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		if (!vacrel->skipwithvm)
+			break;
+
+		/*
+		 * Aggressive VACUUM caller can't skip pages just because they are
+		 * all-visible.
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * All-visible block is safe to skip in non-aggressive case. But
+			 * remember that the final range contains such a block for later.
+			 */
+			vacrel->scan_stats->skippedallvis = true;
+		}
+	}
+
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
+
+	*blkno = next_block;
+	*all_visible_according_to_vm = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
+
+	return true;
 }
 
 /*
@@ -1098,6 +1367,9 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 {
 	BlockNumber next_block;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		return heap_vac_scan_next_block_parallel(vacrel, blkno, all_visible_according_to_vm);
+
 	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
 	next_block = vacrel->current_block + 1;
 
@@ -1147,7 +1419,7 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_stats->skippedallvis = true;
 		}
 	}
 
@@ -1220,11 +1492,12 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 
 		/*
 		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
+		 * (caller must have the opportunity to set
+		 * vacrel->scan_stats->nonempty_pages). This rule avoids having
+		 * lazy_truncate_heap() take access-exclusive lock on rel to attempt a
+		 * truncation that fails anyway, just because there are tuples on the
+		 * last page (it is likely that there will be tuples on other nearby
+		 * pages as well, but those can be skipped).
 		 *
 		 * Implement this by always treating the last block as unsafe to skip.
 		 */
@@ -1449,10 +1722,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_stats->NewRelfrozenXid, &vacrel->scan_stats->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_stats->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_stats->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1461,7 +1734,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * nfrozen == 0, since it only counts pages with newly frozen tuples
 		 * (don't confuse that with pages newly set all-frozen in VM).
 		 */
-		vacrel->frozen_pages++;
+		vacrel->scan_stats->frozen_pages++;
 	}
 
 	/*
@@ -1496,7 +1769,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_stats->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -1511,15 +1784,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_stats->tuples_deleted += presult.ndeleted;
+	vacrel->scan_stats->tuples_frozen += presult.nfrozen;
+	vacrel->scan_stats->lpdead_items += presult.lpdead_items;
+	vacrel->scan_stats->live_tuples += presult.live_tuples;
+	vacrel->scan_stats->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_stats->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -1669,8 +1942,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_stats->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_stats->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1797,8 +2070,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_stats->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_stats->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -1825,25 +2098,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_stats->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_stats->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_stats->live_tuples += live_tuples;
+	vacrel->scan_stats->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_stats->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_stats->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_stats->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -1872,7 +2145,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_stats->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -1906,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_stats->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -1933,7 +2206,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		bypass = (vacrel->scan_stats->lpdead_item_pages < threshold &&
 				  (TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L)));
 	}
 
@@ -2026,7 +2299,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2071,7 +2344,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_stats->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2180,8 +2453,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_stats->lpdead_items &&
+			vacuumed_pages == vacrel->scan_stats->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2334,7 +2607,7 @@ lazy_check_wraparound_failsafe(LVRelState *vacrel)
 		vacrel->do_index_cleanup = false;
 		vacrel->do_rel_truncate = false;
 
-		/* Reset the progress counters */
+		/* Reset the progress scan_stats */
 		pgstat_progress_update_multi_param(2, progress_index, progress_val);
 
 		ereport(WARNING,
@@ -2362,7 +2635,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_stats->scanned_pages < vacrel->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -2385,7 +2658,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2409,7 +2682,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 											estimated_count);
 	}
 
-	/* Reset the progress counters */
+	/* Reset the progress scan_stats */
 	pgstat_progress_update_multi_param(2, progress_end_index, progress_end_val);
 }
 
@@ -2543,7 +2816,7 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->scan_stats->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
 		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
@@ -2569,7 +2842,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_stats->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -2670,7 +2943,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_stats->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
@@ -2678,7 +2951,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_stats->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -2706,7 +2979,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_stats->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2818,7 +3091,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_stats->nonempty_pages;
 }
 
 /*
@@ -2836,12 +3109,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -2859,11 +3128,20 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize parallel heap scan/vacuuming or index vacuuming
+			 * or both based on the table size and the number of indexes. Note
+			 * that only one worker can be used for an index, we invoke
+			 * parallelism for index vacuuming only if there are at least two
+			 * indexes on a table.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -2904,9 +3182,19 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	/*
+	 * Protect both dead_items and dead_items_info from concurrent updates in
+	 * parallel heap scan cases.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(dead_items);
+
 	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(dead_items);
@@ -3108,6 +3396,453 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of parallel workers for parallel vacuum heap scan.
+ *
+ * The calculation logic is borrowed from compute_parallel_worker().
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	int			parallel_workers = 0;
+	int			heap_parallel_threshold;
+	int			heap_pages;
+
+	if (nrequested == 0)
+	{
+		/*
+		 * Select the number of workers based on the log of the size of the
+		 * relation.  This probably needs to be a good deal more
+		 * sophisticated, but we need something here for now.  Note that the
+		 * upper limit of the min_parallel_table_scan_size GUC is chosen to
+		 * prevent overflow here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		heap_pages = RelationGetNumberOfBlocks(rel);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nrequested;
+
+	return parallel_workers;
+}
+
+/* Estimate shared memory sizes required for parallel heap vacuum */
+static inline void
+heap_parallel_estimate_shared_memory_size(Relation rel, int nworkers, Size *pscan_len,
+										  Size *shared_len, Size *pscanwork_len)
+{
+	Size		size = 0;
+
+	size = add_size(size, SizeOfPHVShared);
+	size = add_size(size, mul_size(sizeof(LVRelScanStats), nworkers));
+	*shared_len = size;
+
+	*pscan_len = table_block_parallelscan_estimate(rel);
+
+	*pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+}
+
+/*
+ * Compute the amount of space we'll need in the parallel heap vacuum
+ * DSM, and inform pcxt->estimator about our needs.
+ *
+ * nworkers is the number of workers for the table vacuum. Note that it could
+ * be different than pcxt->nworkers since it is the maximum of number of
+ * workers for table vacuum and index vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+							  int nworkers, void *state)
+{
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+
+	/* space for PHVShared */
+	shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for ParallelBlockTableScanDesc */
+	pscan_len = table_block_parallelscan_estimate(rel);
+	shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for per-worker scan state, PHVScanWorkerState */
+	pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+								int nworkers, void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	PHVState   *phvstate = vacrel->phvstate;
+	ParallelBlockTableScanDesc pscan;
+	PHVScanWorkerState *pscanwork;
+	PHVShared  *shared;
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	phvstate = (PHVState *) palloc(sizeof(PHVState));
+
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+
+	shared = shm_toc_allocate(pcxt->toc, shared_len);
+
+	/* Prepare the shared information */
+
+	MemSet(shared, 0, shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_stats->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_stats->NewRelminMxid;
+	shared->skippedallvis = vacrel->scan_stats->skippedallvis;
+
+	/*
+	 * XXX: we copy the contents of vistest to the shared area, but in order
+	 * to do that, we need to either expose GlobalVisTest or to provide
+	 * functions to copy contents of GlobalVisTest to somewhere. Currently we
+	 * do the former but not sure it's the best choice.
+	 *
+	 * Alternative idea is to have each worker determine cutoff and have their
+	 * own vistest. But we need to carefully consider it since parallel
+	 * workers end up having different cutoff and horizon.
+	 */
+	shared->vistest = *vacrel->vistest;
+
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_SHARED, shared);
+
+	phvstate->shared = shared;
+
+	/* prepare the  parallel block table scan description */
+	pscan = shm_toc_allocate(pcxt->toc, pscan_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_DESC, pscan);
+
+	/* initialize parallel scan description */
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pscan);
+	phvstate->pscandesc = pscan;
+
+	/* prepare the workers' parallel block table scan state */
+	pscanwork = shm_toc_allocate(pcxt->toc, pscanwork_len);
+	MemSet(pscanwork, 0, pscanwork_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_DESC_WORKER, pscanwork);
+	phvstate->scanstates = pscanwork;
+
+	vacrel->phvstate = phvstate;
+}
+
+/*
+ * Main function for parallel heap vacuum workers.
+ */
+void
+heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
+								 ParallelWorkerContext *pwcxt)
+{
+	LVRelState	vacrel = {0};
+	PHVState   *phvstate;
+	PHVShared  *shared;
+	ParallelBlockTableScanDesc pscandesc;
+	PHVScanWorkerState *scanstate;
+	LVRelScanStats *scan_stats;
+	ErrorContextCallback errcallback;
+	bool		scan_done;
+
+	phvstate = palloc(sizeof(PHVState));
+
+	pscandesc = (ParallelBlockTableScanDesc) shm_toc_lookup(pwcxt->toc,
+															LV_PARALLEL_SCAN_DESC,
+															false);
+	phvstate->pscandesc = pscandesc;
+
+	shared = (PHVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_SCAN_SHARED,
+										  false);
+	phvstate->shared = shared;
+
+	scanstate = (PHVScanWorkerState *) shm_toc_lookup(pwcxt->toc,
+													  LV_PARALLEL_SCAN_DESC_WORKER,
+													  false);
+
+	phvstate->myscanstate = &(scanstate[ParallelWorkerNumber]);
+	scan_stats = &(shared->worker_scan_stats[ParallelWorkerNumber]);
+
+	/* Prepare LVRelState */
+	vacrel.rel = rel;
+	vacrel.indrels = parallel_vacuum_get_table_indexes(pvs, &vacrel.nindexes);
+	vacrel.pvs = pvs;
+	vacrel.phvstate = phvstate;
+	vacrel.aggressive = shared->aggressive;
+	vacrel.skipwithvm = shared->skipwithvm;
+	vacrel.cutoffs = shared->cutoffs;
+	vacrel.vistest = &(shared->vistest);
+	vacrel.dead_items = parallel_vacuum_get_dead_items(pvs,
+													   &vacrel.dead_items_info);
+	vacrel.rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel.scan_stats = scan_stats;
+
+	/* initialize per-worker relation statistics */
+	MemSet(scan_stats, 0, sizeof(LVRelScanStats));
+
+	/* Set fields necessary for heap scan */
+	vacrel.scan_stats->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel.scan_stats->NewRelminMxid = shared->NewRelminMxid;
+	vacrel.scan_stats->skippedallvis = shared->skippedallvis;
+
+	/* Initialize the per-worker scan state if not yet */
+	if (!phvstate->myscanstate->initialized)
+	{
+		table_block_parallelscan_startblock_init(rel,
+												 &(phvstate->myscanstate->state),
+												 phvstate->pscandesc);
+
+		pg_atomic_init_u32(&(phvstate->myscanstate->cur_blkno), 0);
+		phvstate->myscanstate->maybe_have_blocks = false;
+		phvstate->myscanstate->initialized = true;
+	}
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel.dbname = get_database_name(MyDatabaseId);
+	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel.relname = pstrdup(RelationGetRelationName(rel));
+	vacrel.indname = NULL;
+	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	scan_done = do_lazy_scan_heap(&vacrel);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	/*
+	 * If the leader or a worker finishes the heap scan because dead_items
+	 * TIDs is close to the limit, it might have some allocated blocks in its
+	 * scan state. Since this scan state might not be used in the next heap
+	 * scan, we remember that it might have some unconsumed blocks so that the
+	 * leader complete the scans after the heap scan phase finishes.
+	 */
+	phvstate->myscanstate->maybe_have_blocks = !scan_done;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+parallel_heap_complete_unfinised_scan(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		PHVScanWorkerState *wstate = &(vacrel->phvstate->scanstates[i]);
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		if (!wstate->maybe_have_blocks)
+
+			continue;
+
+		/* Attache the worker's scan state and do heap scan */
+		vacrel->phvstate->myscanstate = wstate;
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		Assert(scan_done);
+	}
+
+	/*
+	 * We don't need to gather the scan statistics here because statistics
+	 * have already been accumulated the leaders statistics directly.
+	 */
+}
+
+/*
+ * Compute the minimum block number we have scanned so far and update
+ * vacrel->min_blkno.
+ */
+static void
+parallel_heap_vacuum_compute_min_blkno(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * We check all worker scan states here to compute the minimum block
+	 * number among all scan states.
+	 */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		PHVScanWorkerState *wstate = &(phvstate->scanstates[i]);
+		BlockNumber blkno;
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!wstate->initialized)
+			continue;
+
+		blkno = pg_atomic_read_u32(&(wstate->cur_blkno));
+		if (blkno < phvstate->min_blkno)
+			phvstate->min_blkno = blkno;
+	}
+}
+
+/*
+ * Accumulate relation scan_stats that parallel workers collected into the
+ * leader's counters.
+ */
+static void
+parallel_heap_vacuum_gather_scan_stats(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the scan statistics that workers collected */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		LVRelScanStats *ss = &(phvstate->shared->worker_scan_stats[i]);
+
+		vacrel->scan_stats->scanned_pages += ss->scanned_pages;
+		vacrel->scan_stats->removed_pages += ss->removed_pages;
+		vacrel->scan_stats->frozen_pages += ss->frozen_pages;
+		vacrel->scan_stats->lpdead_item_pages += ss->lpdead_item_pages;
+		vacrel->scan_stats->missed_dead_pages += ss->missed_dead_pages;
+		vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
+		vacrel->scan_stats->tuples_deleted += ss->tuples_deleted;
+		vacrel->scan_stats->tuples_frozen += ss->tuples_frozen;
+		vacrel->scan_stats->lpdead_items += ss->lpdead_items;
+		vacrel->scan_stats->live_tuples += ss->live_tuples;
+		vacrel->scan_stats->recently_dead_tuples += ss->recently_dead_tuples;
+		vacrel->scan_stats->missed_dead_tuples += ss->missed_dead_tuples;
+
+		if (ss->nonempty_pages < vacrel->scan_stats->nonempty_pages)
+			vacrel->scan_stats->nonempty_pages = ss->nonempty_pages;
+
+		if (TransactionIdPrecedes(ss->NewRelfrozenXid, vacrel->scan_stats->NewRelfrozenXid))
+			vacrel->scan_stats->NewRelfrozenXid = ss->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(ss->NewRelminMxid, vacrel->scan_stats->NewRelminMxid))
+			vacrel->scan_stats->NewRelminMxid = ss->NewRelminMxid;
+
+		if (!vacrel->scan_stats->skippedallvis && ss->skippedallvis)
+			vacrel->scan_stats->skippedallvis = true;
+	}
+
+	/* Also, compute the minimum block number we scanned so far */
+	parallel_heap_vacuum_compute_min_blkno(vacrel);
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches parallel
+ * workers to scan the heap in parallel.
+ */
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	PHVScanWorkerState *scanstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+	/* initialize parallel scan description to join as a worker */
+	scanstate = palloc(sizeof(PHVScanWorkerState));
+	table_block_parallelscan_startblock_init(vacrel->rel, &(scanstate->state),
+											 vacrel->phvstate->pscandesc);
+	vacrel->phvstate->myscanstate = scanstate;
+
+	for (;;)
+	{
+		bool		scan_done;
+
+		/*
+		 * Scan the table until either we are close to overrunning the
+		 * available space for dead_items TIDs or we reach the end of the
+		 * table.
+		 */
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/* stop parallel workers and gather the collected stats */
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+		parallel_heap_vacuum_gather_scan_stats(vacrel);
+
+		/*
+		 * If the heap scan paused in the middle of the table due to full of
+		 * dead_items TIDs, perform a round of index and heap vacuuming.
+		 */
+		if (!scan_done)
+		{
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
+
+			/*
+			 * Vacuum the Free Space Map to make newly-freed space visible on
+			 * upper-level FSM pages.
+			 */
+			if (vacrel->phvstate->min_blkno > vacrel->next_fsm_block_to_vacuum)
+			{
+				/*
+				 * min_blkno should have already been updated when gathering
+				 * statistics
+				 */
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+										vacrel->phvstate->min_blkno + 1);
+				vacrel->next_fsm_block_to_vacuum = vacrel->phvstate->min_blkno;
+			}
+
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			/* re-launcher workers */
+			vacrel->phvstate->nworkers_launched =
+				parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+			continue;
+		}
+
+		/* We reach the end of the table */
+		break;
+	}
+
+	/*
+	 * The parallel heap vacuum finished, but it's possible that some workers
+	 * have allocated blocks but not processed yet. This can happen for
+	 * example when workers exit because of full of dead_items TIDs and the
+	 * leader process could launch fewer workers in the next cycle.
+	 */
+	parallel_heap_complete_unfinised_scan(vacrel);
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 4fd6574e12..1101e799f9 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -6,15 +6,24 @@
  * This file contains routines that are intended to support setting up, using,
  * and tearing down a ParallelVacuumState.
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan or both index bulk deletion and
+ * index cleanup or all of them with parallel worker processes. Different
+ * numbers of workers are launched for the table vacuuming and index processing.
+ * ParallelVacuumState contains shared information as well as the memory space
+ * for storing dead items allocated in the DSA area.
+ *
+ * When initializing parallel table vacuum scan, we invoke table AM routines for
+ * estimating DSM sizes and initializing DSM memory. Parallel table vacuum
+ * workers invoke the table AM routine for vacuuming the table.
+ *
+ * For processing indexes in parallel, individual indexes are processed by one
+ * vacuum process. We launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the parallel
+ * worker processes exit.
+ *
+ * Each time we process table or indexes in parallel, the parallel context is
+ * re-initialized so that the same DSM can be used for multiple passes of table vacuum
+ * or index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -28,6 +37,7 @@
 
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * True if the caller wants parallel workers to invoke vacuum table scan
+	 * callback.
+	 */
+	bool		do_vacuum_table_scan;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +180,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Passed to parallel table scan workers. NULL for leader process */
+	ParallelWorkerContext *pwcxt;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -193,6 +212,16 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/*
+	 * The number of workers for parallel table scan/vacuuming and index
+	 * vacuuming, respectively.
+	 */
+	int			nworkers_for_table;
+	int			nworkers_for_index;
+
+	/* How many times parallel table vacuum scan is called? */
+	int			num_table_scans;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
@@ -221,8 +250,9 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static void parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_table,
+											int *nworkers_index, bool *will_parallel_vacuum);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -242,7 +272,7 @@ static void parallel_vacuum_error_callback(void *arg);
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -256,6 +286,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_table;
+	int			nworkers_index;
 	int			querylen;
 
 	/*
@@ -263,15 +295,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	 * relation
 	 */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
-													   nrequested_workers,
-													   will_parallel_vacuum);
+	parallel_vacuum_compute_workers(rel, indrels, nindexes, nrequested_workers,
+									&nworkers_table, &nworkers_index,
+									will_parallel_vacuum);
+
+	parallel_workers = Max(nworkers_table, nworkers_index);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
@@ -285,6 +319,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->will_parallel_vacuum = will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
+	pvs->nworkers_for_table = nworkers_table;
+	pvs->nworkers_for_index = nworkers_index;
 
 	EnterParallelMode();
 	pcxt = CreateParallelContext("postgres", "parallel_vacuum_main",
@@ -327,6 +363,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (nworkers_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, nworkers_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -419,6 +459,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Prepare AM-specific DSM for parallel table vacuum */
+	if (nworkers_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, nworkers_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -534,33 +578,47 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.
+ *
+ * For parallel table vacuum, it asks AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to *nworkers_table.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel index vacuum, The index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. nrequested is the number of
+ * parallel workers that user requested.  If nrequested is 0, we compute the
+ * parallel degree based on nindexes, that is the number of indexes that
+ * support parallel vacuum.  This function also sets will_parallel_vacuum to
+ * remember indexes that participate in parallel vacuum.
  */
-static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+static void
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table,
+								int *nworkers_index, bool *will_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			parallel_workers_table = 0;
+	int			parallel_workers_index = 0;
+
+	*nworkers_table = 0;
+	*nworkers_index = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
 	 * when parallelism is disabled.
 	 */
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
-		return 0;
+		return;
+
+	/*
+	 * Compute the number of workers for parallel table scan. Cap by
+	 * max_parallel_maintenance_workers.
+	 */
+	parallel_workers_table = Min(table_paralle_vacuum_compute_workers(rel, nrequested),
+								 max_parallel_maintenance_workers);
 
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
@@ -591,17 +649,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	nindexes_parallel--;
 
 	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Compute the parallel degree for parallel index vacuum */
+		parallel_workers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		parallel_workers_index = Min(parallel_workers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table = parallel_workers_table;
+	*nworkers_index = parallel_workers_index;
 }
 
 /*
@@ -671,7 +730,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	if (nworkers > 0)
 	{
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (num_index_scans > 0 || pvs->num_table_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -980,6 +1039,139 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Prepare DSM and shared vacuum delays, and launch parallel workers for parallel
+ * table vacuum. Return the number of parallel workers launched.
+ *
+ * The caller must call parallel_vacuum_table_scan_end() to finish the parallel
+ * table vacuum.
+ */
+int
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->do_vacuum_table_scan = true;
+
+	if (pvs->num_table_scans > 0)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	ereport(pvs->shared->elevel,
+			(errmsg(ngettext("launched %d parallel vacuum worker for table processing (planned: %d)",
+							 "launched %d parallel vacuum workers for table processing (planned: %d)",
+							 pvs->pcxt->nworkers_launched),
+					pvs->pcxt->nworkers_launched, pvs->nworkers_for_table)));
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel table vacuum scan, and gather statistics.
+ */
+void
+parallel_vacuum_table_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	pvs->shared->do_vacuum_table_scan = false;
+	pvs->num_table_scans++;
+}
+
+/* Return the array of indexes associated to the given table to be vacuumed */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/* Return the number of workers for parallel table vacuum */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
+/* Return the number of workers for parallel index processing */
+int
+parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_index;
+}
+
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+	Assert(VacuumActiveNWorkers);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do table vacuum scan */
+	table_parallel_vacuum_scan(pvs->heaprel, pvs, pvs->pwcxt);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -999,7 +1191,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
-	ErrorContextCallback errcallback;
 
 	/*
 	 * A parallel vacuum worker must have only PROC_IN_VACUUM flag since we
@@ -1031,7 +1222,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	if (shared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = shared->maintenance_work_mem_worker;
@@ -1062,6 +1252,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
 	pvs.heaprel = rel;
 
+	pvs.pwcxt = palloc(sizeof(ParallelWorkerContext));
+	pvs.pwcxt->toc = toc;
+	pvs.pwcxt->seg = seg;
+
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
 	pvs.status = PARALLEL_INDVAC_STATUS_INITIAL;
@@ -1070,17 +1264,29 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
-	/* Setup error traceback support for ereport() */
-	errcallback.callback = parallel_vacuum_error_callback;
-	errcallback.arg = &pvs;
-	errcallback.previous = error_context_stack;
-	error_context_stack = &errcallback;
-
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	if (pvs.shared->do_vacuum_table_scan)
+	{
+		parallel_vacuum_process_table(&pvs);
+	}
+	else
+	{
+		ErrorContextCallback errcallback;
+
+		/* Setup error traceback support for ereport() */
+		errcallback.callback = parallel_vacuum_error_callback;
+		errcallback.arg = &pvs;
+		errcallback.previous = error_context_stack;
+		error_context_stack = &errcallback;
+
+		/* Process indexes to perform vacuum/cleanup */
+		parallel_vacuum_process_safe_indexes(&pvs);
+
+		/* Pop the error context stack */
+		error_context_stack = errcallback.previous;
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
@@ -1090,9 +1296,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 
 	TidStoreDetach(dead_items);
 
-	/* Pop the error context stack */
-	error_context_stack = errcallback.previous;
-
 	vac_close_indexes(nindexes, indrels, RowExclusiveLock);
 	table_close(rel, ShareUpdateExclusiveLock);
 	FreeAccessStrategy(pvs.bstrategy);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 36610a1c7e..5b2b08a844 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -164,15 +164,6 @@ typedef struct ProcArrayStruct
  *
  * The typedef is in the header.
  */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b951466ced..e81513c2db 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -21,6 +21,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "commands/vacuum.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -400,6 +401,13 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int requested);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+										  int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
+											 ParallelWorkerContext *pwcxt);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index da661289c1..fc48f74828 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -655,6 +656,46 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * function must return 0 to disable parallel table vacuum.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel, int requested);
+
+	/*
+	 * Compute the amount of DSM space AM need in the parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * This callback is called for parallel table vacuum workers.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_scan_worker) (Relation rel,
+												ParallelVacuumState *pvs,
+												ParallelWorkerContext *pwcxt);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1710,6 +1751,52 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							   void *state)
+{
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Start a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_scan(Relation rel, ParallelVacuumState *pvs,
+						   ParallelWorkerContext *pwcxt)
+{
+	rel->rd_tableam->parallel_vacuum_scan_worker(rel, pvs, pwcxt);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..a225f31429 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -360,7 +360,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -372,6 +373,11 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051..6ccb19a29f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -102,8 +102,20 @@ extern char *ExportSnapshot(Snapshot snapshot);
 /*
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
+ *
+ * XXX the struct definition is temporarily moved from procarray.c for
+ * parallel table vacuum development. We need to find a suitable way for
+ * parallel table vacuum workers to share the GlobalVisState.
  */
-typedef struct GlobalVisState GlobalVisState;
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
-- 
2.43.5

v3-0002-raidxtree.h-support-shared-iteration.patchapplication/octet-stream; name=v3-0002-raidxtree.h-support-shared-iteration.patchDownload

From b8254de5f092f9b51c0a2537858813c59adc560f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:29:51 -0700
Subject: [PATCH v3 2/4] raidxtree.h: support shared iteration.

This commit supports a shared iteration operation on a radix tree with
multiple processes. The radix tree must be in shared mode to start a
shared itereation. Parallel workers can attach the shared iteration
using the iterator handle given by the leader process. Same as the
normal interation, it's guarnteed that the shared iteration returns
key-values in an ascending order.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/include/lib/radixtree.h | 221 +++++++++++++++++++++++++++++++-----
 1 file changed, 190 insertions(+), 31 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 88bf695e3f..b93553200d 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -177,6 +177,9 @@
 #define RT_ATTACH RT_MAKE_NAME(attach)
 #define RT_DETACH RT_MAKE_NAME(detach)
 #define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#define RT_BEGIN_ITERATE_SHARED RT_MAKE_NAME(begin_iterate_shared)
+#define RT_ATTACH_ITERATE_SHARED RT_MAKE_NAME(attach_iterate_shared)
+#define RT_GET_ITER_HANDLE RT_MAKE_NAME(get_iter_handle)
 #define RT_LOCK_EXCLUSIVE RT_MAKE_NAME(lock_exclusive)
 #define RT_LOCK_SHARE RT_MAKE_NAME(lock_share)
 #define RT_UNLOCK RT_MAKE_NAME(unlock)
@@ -236,15 +239,19 @@
 #define RT_SHRINK_NODE_16 RT_MAKE_NAME(shrink_child_16)
 #define RT_SHRINK_NODE_48 RT_MAKE_NAME(shrink_child_48)
 #define RT_SHRINK_NODE_256 RT_MAKE_NAME(shrink_child_256)
+#define RT_INITIALIZE_ITER RT_MAKE_NAME(initialize_iter)
 #define RT_NODE_ITERATE_NEXT RT_MAKE_NAME(node_iterate_next)
 #define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
 
 /* type declarations */
 #define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
 #define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER_CONTROL RT_MAKE_NAME(iter_control)
 #define RT_ITER RT_MAKE_NAME(iter)
 #ifdef RT_SHMEM
 #define RT_HANDLE RT_MAKE_NAME(handle)
+#define RT_ITER_CONTROL_SHARED RT_MAKE_NAME(iter_control_shared)
+#define RT_ITER_HANDLE RT_MAKE_NAME(iter_handle)
 #endif
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_CHILD_PTR RT_MAKE_NAME(child_ptr)
@@ -270,6 +277,7 @@ typedef struct RT_ITER RT_ITER;
 
 #ifdef RT_SHMEM
 typedef dsa_pointer RT_HANDLE;
+typedef dsa_pointer RT_ITER_HANDLE;
 #endif
 
 #ifdef RT_SHMEM
@@ -687,6 +695,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 	RT_HANDLE	handle;
 	uint32		magic;
 	LWLock		lock;
+	int			tranche_id;
 #endif
 
 	RT_PTR_ALLOC root;
@@ -740,11 +749,9 @@ typedef struct RT_NODE_ITER
 	int			idx;
 }			RT_NODE_ITER;
 
-/* state for iterating over the whole radix tree */
-struct RT_ITER
+/* Contain the iteration state data */
+typedef struct RT_ITER_CONTROL
 {
-	RT_RADIX_TREE *tree;
-
 	/*
 	 * A stack to track iteration for each level. Level 0 is the lowest (or
 	 * leaf) level
@@ -755,8 +762,36 @@ struct RT_ITER
 
 	/* The key constructed during iteration */
 	uint64		key;
-};
+}			RT_ITER_CONTROL;
+
+#ifdef RT_SHMEM
+/* Contain the shared iteration state data */
+typedef struct RT_ITER_CONTROL_SHARED
+{
+	/* Actual shared iteration state data */
+	RT_ITER_CONTROL common;
+
+	/* protect the control data */
+	LWLock		lock;
+
+	RT_ITER_HANDLE handle;
+	pg_atomic_uint32 refcnt;
+}			RT_ITER_CONTROL_SHARED;
+#endif
+
+/* state for iterating over the whole radix tree */
+struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
 
+	/* pointing to either local memory or DSA */
+	RT_ITER_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	/* True if the iterator is for shared iteration */
+	bool		shared;
+#endif
+};
 
 /* verification (available only in assert-enabled builds) */
 static void RT_VERIFY_NODE(RT_NODE * node);
@@ -1848,6 +1883,7 @@ RT_CREATE(MemoryContext ctx)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
 	tree->ctl->handle = dp;
 	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	tree->ctl->tranche_id = tranche_id;
 	LWLockInitialize(&tree->ctl->lock, tranche_id);
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
@@ -1900,6 +1936,9 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 	dsa_pointer control;
 
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->iter_context = AllocSetContextCreate(CurrentMemoryContext,
+											   RT_STR(RT_PREFIX) "_radix_tree iter context",
+											   ALLOCSET_SMALL_SIZES);
 
 	/* Find the control object in shared memory */
 	control = handle;
@@ -2072,35 +2111,86 @@ RT_FREE(RT_RADIX_TREE * tree)
 
 /***************** ITERATION *****************/
 
+/* Common routine to initialize the given iterator */
+static void
+RT_INITIALIZE_ITER(RT_RADIX_TREE * tree, RT_ITER * iter)
+{
+	RT_CHILD_PTR root;
+
+	iter->tree = tree;
+
+	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
+	root.alloc = iter->tree->ctl->root;
+	RT_PTR_SET_LOCAL(tree, &root);
+
+	iter->ctl->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+
+	/* Set the root to start */
+	iter->ctl->cur_level = iter->ctl->top_level;
+	iter->ctl->node_iters[iter->ctl->cur_level].node = root;
+	iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
+}
+
 /*
  * Create and return the iterator for the given radix tree.
  *
- * Taking a lock in shared mode during the iteration is the caller's
- * responsibility.
+ * Taking a lock on a radix tree in shared mode during the iteration is the
+ * caller's responsibility.
  */
 RT_SCOPE	RT_ITER *
 RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 {
 	RT_ITER    *iter;
-	RT_CHILD_PTR root;
 
 	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
 											  sizeof(RT_ITER));
-	iter->tree = tree;
+	iter->ctl = (RT_ITER_CONTROL *) MemoryContextAllocZero(tree->iter_context,
+														   sizeof(RT_ITER_CONTROL));
 
-	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
-	root.alloc = iter->tree->ctl->root;
-	RT_PTR_SET_LOCAL(tree, &root);
+	RT_INITIALIZE_ITER(tree, iter);
 
-	iter->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+#ifdef RT_SHMEM
+	/* we will non-shared iteration on a shared radix tree */
+	iter->shared = false;
+#endif
 
-	/* Set the root to start */
-	iter->cur_level = iter->top_level;
-	iter->node_iters[iter->cur_level].node = root;
-	iter->node_iters[iter->cur_level].idx = 0;
+	return iter;
+}
+
+#ifdef RT_SHMEM
+/*
+ * Create and return the shared iterator for the given shard radix tree.
+ *
+ * Taking a lock on a radix tree in shared mode during the shared iteration to
+ * prevent concurrent writes is the caller's responsibility.
+ */
+RT_SCOPE	RT_ITER *
+RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl_shared;
+	dsa_pointer dp;
+
+	/* The radix tree must be in shared mode */
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	dp = dsa_allocate(tree->dsa, sizeof(RT_ITER_CONTROL_SHARED));
+	ctl_shared = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, dp);
+	ctl_shared->handle = dp;
+	LWLockInitialize(&ctl_shared->lock, tree->ctl->tranche_id);
+	pg_atomic_init_u32(&ctl_shared->refcnt, 1);
+
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
+											  sizeof(RT_ITER));
+
+	iter->ctl = (RT_ITER_CONTROL *) ctl_shared;
+	iter->shared = true;
+
+	RT_INITIALIZE_ITER(tree, iter);
 
 	return iter;
 }
+#endif
 
 /*
  * Scan the inner node and return the next child pointer if one exists, otherwise
@@ -2114,12 +2204,18 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	RT_CHILD_PTR node;
 	RT_PTR_ALLOC *slot = NULL;
 
+	node_iter = &(iter->ctl->node_iters[level]);
+	node = node_iter->node;
+
 #ifdef RT_SHMEM
-	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
-#endif
 
-	node_iter = &(iter->node_iters[level]);
-	node = node_iter->node;
+	/*
+	 * Since the iterator is shared, the local pointer of the node might be
+	 * set by other backends, we need to make sure to use the local pointer.
+	 */
+	if (iter->shared)
+		RT_PTR_SET_LOCAL(iter->tree, &node);
+#endif
 
 	Assert(node.local != NULL);
 
@@ -2192,8 +2288,8 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	}
 
 	/* Update the key */
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
-	iter->key |= (((uint64) key_chunk) << (level * RT_SPAN));
+	iter->ctl->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
+	iter->ctl->key |= (((uint64) key_chunk) << (level * RT_SPAN));
 
 	return slot;
 }
@@ -2207,18 +2303,29 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 {
 	RT_PTR_ALLOC *slot = NULL;
 
-	while (iter->cur_level <= iter->top_level)
+#ifdef RT_SHMEM
+	/* Prevent the shared iterator from being updated concurrently */
+	if (iter->shared)
+		LWLockAcquire(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock, LW_EXCLUSIVE);
+#endif
+
+	while (iter->ctl->cur_level <= iter->ctl->top_level)
 	{
 		RT_CHILD_PTR node;
 
-		slot = RT_NODE_ITERATE_NEXT(iter, iter->cur_level);
+		slot = RT_NODE_ITERATE_NEXT(iter, iter->ctl->cur_level);
 
-		if (iter->cur_level == 0 && slot != NULL)
+		if (iter->ctl->cur_level == 0 && slot != NULL)
 		{
 			/* Found a value at the leaf node */
-			*key_p = iter->key;
+			*key_p = iter->ctl->key;
 			node.alloc = *slot;
 
+#ifdef RT_SHMEM
+			if (iter->shared)
+				LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 			if (RT_CHILDPTR_IS_VALUE(*slot))
 				return (RT_VALUE_TYPE *) slot;
 			else
@@ -2234,17 +2341,23 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 			node.alloc = *slot;
 			RT_PTR_SET_LOCAL(iter->tree, &node);
 
-			iter->cur_level--;
-			iter->node_iters[iter->cur_level].node = node;
-			iter->node_iters[iter->cur_level].idx = 0;
+			iter->ctl->cur_level--;
+			iter->ctl->node_iters[iter->ctl->cur_level].node = node;
+			iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
 		}
 		else
 		{
 			/* Not found the child slot, move up the tree */
-			iter->cur_level++;
+			iter->ctl->cur_level++;
 		}
+
 	}
 
+#ifdef RT_SHMEM
+	if (iter->shared)
+		LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 	/* We've visited all nodes, so the iteration finished */
 	return NULL;
 }
@@ -2255,9 +2368,45 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 RT_SCOPE void
 RT_END_ITERATE(RT_ITER * iter)
 {
+#ifdef RT_SHMEM
+	RT_ITER_CONTROL_SHARED *ctl = (RT_ITER_CONTROL_SHARED *) iter->ctl;;
+
+	if (iter->shared &&
+		pg_atomic_sub_fetch_u32(&ctl->refcnt, 1) == 0)
+		dsa_free(iter->tree->dsa, ctl->handle);
+#endif
 	pfree(iter);
 }
 
+#ifdef	RT_SHMEM
+RT_SCOPE	RT_ITER_HANDLE
+RT_GET_ITER_HANDLE(RT_ITER * iter)
+{
+	Assert(iter->shared);
+	return ((RT_ITER_CONTROL_SHARED *) iter->ctl)->handle;
+
+}
+
+RT_SCOPE	RT_ITER *
+RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl;
+
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
+											  sizeof(RT_ITER));
+	iter->tree = tree;
+	ctl = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, handle);
+	iter->ctl = (RT_ITER_CONTROL *) ctl;
+	iter->shared = true;
+
+	/* For every iterator, increase the refcnt by 1 */
+	pg_atomic_add_fetch_u32(&ctl->refcnt, 1);
+
+	return iter;
+}
+#endif
+
 /***************** DELETION *****************/
 
 #ifdef RT_USE_DELETE
@@ -2957,7 +3106,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_PTR_ALLOC
 #undef RT_INVALID_PTR_ALLOC
 #undef RT_HANDLE
+#undef RT_ITER_HANDLE
+#undef RT_ITER_CONTROL
+#undef RT_ITER_HANDLE
 #undef RT_ITER
+#undef RT_SHARED_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
 #undef RT_NODE_KIND_4
@@ -2994,6 +3147,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_LOCK_SHARE
 #undef RT_UNLOCK
 #undef RT_GET_HANDLE
+#undef RT_BEGIN_ITERATE_SHARED
+#undef RT_ATTACH_ITERATE_SHARED
+#undef RT_GET_ITER_HANDLE
+#undef RT_ATTACH_ITER
+#undef RT_GET_ITER_HANDLE
 #undef RT_FIND
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
@@ -3050,5 +3208,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_SHRINK_NODE_256
 #undef RT_NODE_DELETE
 #undef RT_NODE_INSERT
+#undef RT_INITIALIZE_ITER
 #undef RT_NODE_ITERATE_NEXT
 #undef RT_VERIFY_NODE
-- 
2.43.5

#11

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Masahiko Sawada (#10)

4 attachment(s)

Re: Parallel heap vacuum

On Fri, Oct 25, 2024 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Oct 22, 2024 at 4:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Sorry for the very late reply.

On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

Thank you for testing!

I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.

2.
I compared parallel heap scan and found that it does not have compute_worker

API.

Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the

heap structure,

so should we add the API for table scan as well?)

There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.

Okay, thanks for confirming. I wanted to ask others as well.

I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.

Here are other preliminary comments for v2 patch. This does not contain
cosmetic ones.

1.
Shared data structure PHVShared does not contain the mutex lock. Is it intentional
because they are accessed by leader only after parallel workers exit?

Yes, the fields in PHVShared are read-only for workers. Since no
concurrent reads/writes happen on these fields we don't need to
protect them.

2.
Per my understanding, the vacuuming goes like below steps.

a. paralell workers are launched for scanning pages
b. leader waits until scans are done
c. leader does vacuum alone (you may extend here...)
d. parallel workers are launched again to cleanup indeces

If so, can we reuse parallel workers for the cleanup? Or, this is painful
engineering than the benefit?

I've not thought of this idea but I think it's possible from a
technical perspective. It saves overheads of relaunching workers but
I'm not sure how much it would help for a better performance and I'm
concerned it would make the code complex. For example, different
numbers of workers might be required for table vacuuming and index
vacuuming. So we would end up increasing or decreasing workers.

3.
According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
"parallel worker ..." Can we extend this to improve the trackability in the
pg_stat_activity?

It would be a good improvement for better trackability. But I think we
should do that in a separate patch as it's not just a problem for
parallel heap vacuum.

4.
I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
bottleneck when tid is added to the shared TidStore. My another primitive idea
is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
and gather after the heap scanning. If you extend like parallel workers do vacuuming,
the gathering may not be needed: they can access own TidStore and clean up.
One downside is that the memory consumption may be quite large.

Interesting idea. Suppose we support parallel heap vacuum as well, we
wouldn't need locks and to support shared-iteration on TidStore. I
think each worker should use a fraction of maintenance_work_mem.
However, one downside would be that we need to check as many TidStore
as workers during index vacuuming.

On further thoughts, I think this idea doesn't go well. The index
vacuuming is the most time-consuming phase among vacuum phases, I
think it would not be a good idea to make it slower even if we can do
parallel heap scan and heap vacuum without any locking. Also, merging
multiple TidStore to one is not straightforward since the block ranges
that each worker processes are overlapped.

FYI I've implemented the parallel heap vacuum part and am doing some
benchmark tests. I'll share the updated patches along with test
results this week.

Please find the attached patches. From the previous version, I made a
lot of changes including bug fixes, addressing review comments, and
adding parallel heap vacuum support. Parallel vacuum related
infrastructure are implemented in vacuumparallel.c, and lazyvacuum.c
now uses ParallelVacuumState for parallel heap scan/vacuum, index
bulkdelete/cleanup, or both. Parallel vacuum workers launch at the
beginning of each phase, and exit at the end of each phase. Since
different numbers of workers could be used for heap scan/vacuum and
index bulkdelete/cleanup, it's possible that only either heap
scan/vacuum or index bulkdelete/cleanup is parallelized.

In order to implement parallel heap vacuum, I extended radix tree and
tidstore to support shared iteration. The shared iteration works only
with a shared tidstore but a non-shared iteration works with a local
tidstore as well as a shared tidstore. For example, if a table is
large and has one index, we use only parallel heap scan/vacuum. In
this case, we store dead item TIDs into a shared tidstore during
parallel heap scan, but during parallel index bulk-deletion we perform
a non-shared iteration on the shared tidstore, which is more efficient
as it doesn't acquire any locks during the iteration.

I've done benchmark tests with a 10GB unlogged table (created on a
tmpfs tablespace) having 4 btree indexes while changing parallel
degrees. I restarted postgres server before each run to ensure that
data is not on the shared memory. I avoided disk I/O during lazy
vacuum as much as possible. Here is comparison between HEAD and
patched (took median of 5 runs):
+----------+-----------+-----------+-------------+
| parallel |   HEAD    |  patched  | improvement |
+----------+-----------+-----------+-------------+
| 0        | 53079.53  | 53468.734 | 1.007       |
| 1        | 48101.46  | 35712.613 | 0.742       |
| 2        | 37767.902 | 23566.426 | 0.624       |
| 4        | 38005.836 | 20192.055 | 0.531       |
| 8        | 37754.47  | 18614.717 | 0.493       |
+----------+-----------+-----------+-------------+
Here is the breakdowns of execution times of each vacuum phase (from
left, heap scan, index bulkdel, and heap vacuum):

- HEAD
parallel 0: 53079.530 (15886, 28039, 9270)
parallel 1: 48101.460 (15931, 23247, 9215)
parallel 2: 37767.902 (15259, 12888, 9479)
parallel 4: 38005.836 (16097, 12683, 9217)
parallel 8: 37754.470 (16016, 12535, 9306)

- Patched
parallel 0: 53468.734 (15990, 28296, 9465)
parallel 1: 35712.613 ( 8254, 23569, 3700)
parallel 2: 23566.426 ( 6180, 12760, 3283)
parallel 4: 20192.055 ( 4058, 12776, 2154)
parallel 8: 18614.717 ( 2797, 13244, 1579)

The index bulkdel phase is saturated with parallel 2 as one worker is
assigned to one index. On HEAD, there is no further performance gain
with more than 'parallel 4'. On the other hand, on Patched, it got
faster even at 'parallel 4' and 'parallel 8' since other two phases
were also done with parallel workers.

I've attached new version patches that fixes failures reported by
cfbot. I hope these changes make cfbot happy.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0004-Support-parallel-heap-vacuum-during-lazy-vacuum.patchapplication/octet-stream; name=v4-0004-Support-parallel-heap-vacuum-during-lazy-vacuum.patchDownload

From 92cd53dff4e9a3da1278e7b666c15c03132c434d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:37:45 -0700
Subject: [PATCH v4 4/4] Support parallel heap vacuum during lazy vacuum.

This commit further extends parallel vacuum to perform the heap vacuum
phase with parallel workers. It leverages the shared TidStore iteration.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c | 175 +++++++++++++++++++--------
 1 file changed, 122 insertions(+), 53 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 10991666e0b..1ab34732833 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -158,6 +158,7 @@ typedef struct LVRelScanStats
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	BlockNumber vacuumed_pages; /* # pages vacuumed in one second pass */
 
 	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
@@ -186,11 +187,15 @@ typedef struct PHVShared
 	MultiXactId NewRelminMxid;
 
 	bool		skippedallvis;
+	bool		do_index_vacuuming;
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState vistest;
 
+	dsa_pointer shared_iter_handle;
+	bool		do_heap_vacuum;
+
 	/* per-worker scan stats for parallel heap vacuum scan */
 	LVRelScanStats worker_scan_stats[FLEXIBLE_ARRAY_MEMBER];
 }			PHVShared;
@@ -352,6 +357,7 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
+static void do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter);
 static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, OffsetNumber *deadoffsets,
 								  int num_offsets, Buffer vmbuffer);
@@ -530,6 +536,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	scan_stats->lpdead_item_pages = 0;
 	scan_stats->missed_dead_pages = 0;
 	scan_stats->nonempty_pages = 0;
+	scan_stats->vacuumed_pages = 0;
 
 	/* Initialize remaining counters (be tidy) */
 	scan_stats->tuples_deleted = 0;
@@ -2362,46 +2369,14 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	return allindexes;
 }
 
-/*
- *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
- *
- * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
- * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
- *
- * We may also be able to truncate the line pointer array of the heap pages we
- * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
- * array, it can be reclaimed as free space.  These LP_UNUSED items usually
- * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
- * each page to LP_UNUSED, and then consider if it's possible to truncate the
- * page's line pointer array).
- *
- * Note: the reason for doing this as a second pass is we cannot remove the
- * tuples until we've removed their index entries, and we want to process
- * index entry removal in batches as large as possible.
- */
 static void
-lazy_vacuum_heap_rel(LVRelState *vacrel)
+do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter)
 {
-	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
-	LVSavedErrInfo saved_err_info;
-	TidStoreIter *iter;
-	TidStoreIterResult *iter_result;
 
-	Assert(vacrel->do_index_vacuuming);
-	Assert(vacrel->do_index_cleanup);
-	Assert(vacrel->num_index_scans > 0);
-
-	/* Report that we are now vacuuming the heap */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
-
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrel, &saved_err_info,
-							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 InvalidBlockNumber, InvalidOffsetNumber);
+	/* LVSavedErrInfo saved_err_info; */
+	TidStoreIterResult *iter_result;
 
-	iter = TidStoreBeginIterate(vacrel->dead_items);
 	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
@@ -2439,26 +2414,100 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		UnlockReleaseBuffer(buf);
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-		vacuumed_pages++;
+		vacrel->scan_stats->vacuumed_pages++;
 	}
-	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
+}
+
+/*
+ *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
+ *
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
+ *
+ * We may also be able to truncate the line pointer array of the heap pages we
+ * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
+ * array, it can be reclaimed as free space.  These LP_UNUSED items usually
+ * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
+ * each page to LP_UNUSED, and then consider if it's possible to truncate the
+ * page's line pointer array).
+ *
+ * Note: the reason for doing this as a second pass is we cannot remove the
+ * tuples until we've removed their index entries, and we want to process
+ * index entry removal in batches as large as possible.
+ */
+static void
+lazy_vacuum_heap_rel(LVRelState *vacrel)
+{
+	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
+	Assert(vacrel->num_index_scans > 0);
+
+	/* Report that we are now vacuuming the heap */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+							 InvalidBlockNumber, InvalidOffsetNumber);
+
+	vacrel->scan_stats->vacuumed_pages = 0;
+
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		PHVState   *phvstate = vacrel->phvstate;
+
+		iter = TidStoreBeginIterateShared(vacrel->dead_items);
+
+		phvstate->shared->do_heap_vacuum = true;
+		phvstate->shared->shared_iter_handle = TidStoreGetSharedIterHandle(iter);
+
+		/* launch workers */
+		vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+	}
+	else
+		iter = TidStoreBeginIterate(vacrel->dead_items);
+
+	/* do the real work */
+	do_lazy_vacuum_heap_rel(vacrel, iter);
+
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		PHVState   *phvstate = vacrel->phvstate;
+
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+
+		/* Gather the heap vacuum statistics that workers collected */
+		for (int i = 0; i < phvstate->nworkers_launched; i++)
+		{
+			LVRelScanStats *ss = &(phvstate->shared->worker_scan_stats[i]);
+
+			vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
+		}
+	}
+
+	TidStoreEndIterate(iter);
+
 	/*
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
 		   (vacrel->dead_items_info->num_items == vacrel->scan_stats->lpdead_items &&
-			vacuumed_pages == vacrel->scan_stats->lpdead_item_pages));
+			vacrel->scan_stats->vacuumed_pages == vacrel->scan_stats->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
 					vacrel->relname, (long long) vacrel->dead_items_info->num_items,
-					vacuumed_pages)));
+					vacrel->scan_stats->vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3514,6 +3563,7 @@ heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
 	shared->NewRelfrozenXid = vacrel->scan_stats->NewRelfrozenXid;
 	shared->NewRelminMxid = vacrel->scan_stats->NewRelminMxid;
 	shared->skippedallvis = vacrel->scan_stats->skippedallvis;
+	shared->do_index_vacuuming = vacrel->do_index_vacuuming;
 
 	/*
 	 * XXX: we copy the contents of vistest to the shared area, but in order
@@ -3566,7 +3616,6 @@ heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
 	PHVScanWorkerState *scanstate;
 	LVRelScanStats *scan_stats;
 	ErrorContextCallback errcallback;
-	bool		scan_done;
 
 	phvstate = palloc(sizeof(PHVState));
 
@@ -3603,10 +3652,11 @@ heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
 	/* initialize per-worker relation statistics */
 	MemSet(scan_stats, 0, sizeof(LVRelScanStats));
 
-	/* Set fields necessary for heap scan */
+	/* Set fields necessary for heap scan and vacuum */
 	vacrel.scan_stats->NewRelfrozenXid = shared->NewRelfrozenXid;
 	vacrel.scan_stats->NewRelminMxid = shared->NewRelminMxid;
 	vacrel.scan_stats->skippedallvis = shared->skippedallvis;
+	vacrel.do_index_vacuuming = shared->do_index_vacuuming;
 
 	/* Initialize the per-worker scan state if not yet */
 	if (!phvstate->myscanstate->initialized)
@@ -3628,25 +3678,44 @@ heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
 	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
 	vacrel.relname = pstrdup(RelationGetRelationName(rel));
 	vacrel.indname = NULL;
-	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = &vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
-	scan_done = do_lazy_scan_heap(&vacrel);
+	if (shared->do_heap_vacuum)
+	{
+		TidStoreIter *iter;
+
+		iter = TidStoreAttachIterateShared(vacrel.dead_items, shared->shared_iter_handle);
+
+		/* Join parallel heap vacuum */
+		vacrel.phase = VACUUM_ERRCB_PHASE_VACUUM_HEAP;
+		do_lazy_vacuum_heap_rel(&vacrel, iter);
+
+		TidStoreEndIterate(iter);
+	}
+	else
+	{
+		bool		scan_done;
+
+		/* Join parallel heap scan */
+		vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+		scan_done = do_lazy_scan_heap(&vacrel);
+
+		/*
+		 * If the leader or a worker finishes the heap scan because dead_items
+		 * TIDs is close to the limit, it might have some allocated blocks in
+		 * its scan state. Since this scan state might not be used in the next
+		 * heap scan, we remember that it might have some unconsumed blocks so
+		 * that the leader complete the scans after the heap scan phase
+		 * finishes.
+		 */
+		phvstate->myscanstate->maybe_have_blocks = !scan_done;
+	}
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
-
-	/*
-	 * If the leader or a worker finishes the heap scan because dead_items
-	 * TIDs is close to the limit, it might have some allocated blocks in its
-	 * scan state. Since this scan state might not be used in the next heap
-	 * scan, we remember that it might have some unconsumed blocks so that the
-	 * leader complete the scans after the heap scan phase finishes.
-	 */
-	phvstate->myscanstate->maybe_have_blocks = !scan_done;
 }
 
 /*
@@ -3736,7 +3805,6 @@ parallel_heap_vacuum_gather_scan_stats(LVRelState *vacrel)
 		vacrel->scan_stats->frozen_pages += ss->frozen_pages;
 		vacrel->scan_stats->lpdead_item_pages += ss->lpdead_item_pages;
 		vacrel->scan_stats->missed_dead_pages += ss->missed_dead_pages;
-		vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
 		vacrel->scan_stats->tuples_deleted += ss->tuples_deleted;
 		vacrel->scan_stats->tuples_frozen += ss->tuples_frozen;
 		vacrel->scan_stats->lpdead_items += ss->lpdead_items;
@@ -3774,6 +3842,7 @@ do_parallel_lazy_scan_heap(LVRelState *vacrel)
 	Assert(!IsParallelWorker());
 
 	/* launcher workers */
+	vacrel->phvstate->shared->do_heap_vacuum = false;
 	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
 
 	/* initialize parallel scan description to join as a worker */
-- 
2.43.5

v4-0001-Support-parallel-heap-scan-during-lazy-vacuum.patchapplication/octet-stream; name=v4-0001-Support-parallel-heap-scan-during-lazy-vacuum.patchDownload

From 00a4337e8bd74a4764d9b4ed854c6684e92cb4f6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 1 Jul 2024 15:17:46 +0900
Subject: [PATCH v4 1/4] Support parallel heap scan during lazy vacuum.

Commit 40d964ec99 allowed vacuum command to process indexes in
parallel. This change extends the parallel vacuum to support parallel
heap scan during lazy vacuum.
---
 src/backend/access/heap/heapam_handler.c |    6 +
 src/backend/access/heap/vacuumlazy.c     | 1140 ++++++++++++++++++----
 src/backend/commands/vacuumparallel.c    |  311 +++++-
 src/backend/storage/ipc/procarray.c      |    9 -
 src/include/access/heapam.h              |    8 +
 src/include/access/tableam.h             |   87 ++
 src/include/commands/vacuum.h            |    8 +-
 src/include/utils/snapmgr.h              |   14 +-
 8 files changed, 1318 insertions(+), 265 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..c49eed81e24 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2659,6 +2659,12 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_scan_worker = heap_parallel_vacuum_scan_worker,
+
 	.scan_analyze_next_block = heapam_scan_analyze_next_block,
 	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 793bd33cb4d..10991666e0b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -48,6 +48,7 @@
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -115,10 +116,24 @@
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
 /*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
+ * DSM keys for heap parallel vacuum scan. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_SCAN_SHARED			0xFFFF0001
+#define LV_PARALLEL_SCAN_DESC			0xFFFF0002
+#define LV_PARALLEL_SCAN_DESC_WORKER	0xFFFF0003
+
+/*
+ * Macros to check if we are in parallel heap vacuuming, parallel index vacuuming,
+ * or both. If ParallelVacuumIsActive() is true, we are in the parallel mode, meaning
+ * that we have dead items TIDs on shared memory area.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelIndexVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_index((vacrel)->pvs) > 0)
+#define ParallelHeapVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_table((vacrel)->pvs) > 0)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -131,6 +146,109 @@ typedef enum
 	VACUUM_ERRCB_PHASE_TRUNCATE,
 } VacErrPhase;
 
+/*
+ * Relation statistics collected during heap scanning and need to be shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber frozen_pages;	/* # pages with newly frozen tuples */
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+}			LVRelScanStats;
+
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	bool		skippedallvis;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState vistest;
+
+	/* per-worker scan stats for parallel heap vacuum scan */
+	LVRelScanStats worker_scan_stats[FLEXIBLE_ARRAY_MEMBER];
+}			PHVShared;
+#define SizeOfPHVShared (offsetof(PHVShared, worker_scan_stats))
+
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerState
+{
+	bool		initialized;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData state;
+
+	/*
+	 * True if a parallel vacuum scan worker allocated blocks in state but
+	 * might have not scanned all of them. The leader process will take over
+	 * for scanning these remaining blocks.
+	 */
+	bool		maybe_have_blocks;
+
+	/* current block number being processed */
+	pg_atomic_uint32 cur_blkno;
+}			PHVScanWorkerState;
+
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pscandesc;
+
+	/* Shared information */
+	PHVShared  *shared;
+
+	/*
+	 * Points to all per-worker scan state array stored on DSM area.
+	 *
+	 * During parallel heap scan, each worker allocates some chunks of blocks
+	 * to scan in its scan state, and could exit while leaving some chunks
+	 * un-scanned if the size of dead_items TIDs is close to overrunning the
+	 * the available space. We store scan states on shared memory area so that
+	 * workers can resume heap scans from the previous point.
+	 */
+	PHVScanWorkerState *scanstates;
+
+	/* Assigned per-worker scan state */
+	PHVScanWorkerState *myscanstate;
+
+	/*
+	 * All blocks up to this value has been scanned, i.e. minimum of cur_blkno
+	 * among all PHVScanWorkerState. It's updated by
+	 * parallel_heap_vacuum_compute_min_blkno().
+	 */
+	BlockNumber min_blkno;
+
+	/* The number of workers launched for parallel heap vacuum */
+	int			nworkers_launched;
+}			PHVState;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -142,6 +260,9 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
+	/* Parallel heap vacuum state and sizes for each struct */
+	PHVState   *phvstate;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -157,10 +278,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -186,12 +303,10 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber frozen_pages;	/* # pages with newly frozen tuples */
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	BlockNumber next_fsm_block_to_vacuum;
+
+	/* Statistics collected during heap scan */
+	LVRelScanStats *scan_stats;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -201,13 +316,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -227,6 +335,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static bool do_lazy_scan_heap(LVRelState *vacrel);
 static bool heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 									 bool *all_visible_according_to_vm);
 static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
@@ -269,6 +378,12 @@ static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 static void update_relstats_all_indexes(LVRelState *vacrel);
+
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_heap_vacuum_compute_min_blkno(LVRelState *vacrel);
+static void parallel_heap_vacuum_gather_scan_stats(LVRelState *vacrel);
+static void parallel_heap_complete_unfinised_scan(LVRelState *vacrel);
+
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
 									 LVSavedErrInfo *saved_vacrel,
@@ -294,6 +409,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVRelScanStats *scan_stats;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -404,14 +520,28 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		Assert(params->index_cleanup == VACOPTVALUE_AUTO);
 	}
 
+	vacrel->next_fsm_block_to_vacuum = 0;
+
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->frozen_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	scan_stats = palloc(sizeof(LVRelScanStats));
+	scan_stats->scanned_pages = 0;
+	scan_stats->removed_pages = 0;
+	scan_stats->frozen_pages = 0;
+	scan_stats->lpdead_item_pages = 0;
+	scan_stats->missed_dead_pages = 0;
+	scan_stats->nonempty_pages = 0;
+
+	/* Initialize remaining counters (be tidy) */
+	scan_stats->tuples_deleted = 0;
+	scan_stats->tuples_frozen = 0;
+	scan_stats->lpdead_items = 0;
+	scan_stats->live_tuples = 0;
+	scan_stats->recently_dead_tuples = 0;
+	scan_stats->missed_dead_tuples = 0;
+
+	vacrel->scan_stats = scan_stats;
+
+	vacrel->num_index_scans = 0;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -419,14 +549,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
-	/* Initialize remaining counters (be tidy) */
-	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -448,9 +571,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
+	vacrel->scan_stats->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_stats->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_stats->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -531,15 +654,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_stats->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_stats->NewRelfrozenXid));
+	Assert(vacrel->scan_stats->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_stats->NewRelminMxid));
+	if (vacrel->scan_stats->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -547,8 +670,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_stats->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_stats->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
@@ -569,7 +692,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_stats->NewRelfrozenXid, vacrel->scan_stats->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -585,8 +708,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples);
+						 vacrel->scan_stats->recently_dead_tuples +
+						 vacrel->scan_stats->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	if (instrument)
@@ -659,21 +782,21 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_stats->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_stats->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->scan_stats->scanned_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->scan_stats->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->scan_stats->recently_dead_tuples);
+			if (vacrel->scan_stats->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->scan_stats->missed_dead_tuples,
+								 vacrel->scan_stats->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -681,25 +804,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_stats->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_stats->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_stats->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_stats->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->frozen_pages,
+							 vacrel->scan_stats->frozen_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->frozen_pages / orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 100.0 * vacrel->scan_stats->frozen_pages / orig_rel_pages,
+							 (long long) vacrel->scan_stats->tuples_frozen);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -719,10 +842,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_stats->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->scan_stats->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->scan_stats->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -823,14 +946,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_fsm_block_to_vacuum = 0;
-	bool		all_visible_according_to_vm;
-
-	TidStore   *dead_items = vacrel->dead_items;
+	BlockNumber rel_pages = vacrel->rel_pages;
 	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
-	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -850,6 +967,72 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_allvis = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/*
+	 * Do the actual work. If parallel heap vacuum is active, we scan and
+	 * vacuum heap with parallel workers.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel);
+
+	/* report that everything is now scanned */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_stats->scanned_pages,
+													 vacrel->scan_stats->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_stats->recently_dead_tuples +
+		vacrel->scan_stats->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+								rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * Return true if we processed all blocks, otherwise false if we exit from this function
+ * while not completing the heap scan due to full of dead item TIDs. In serial heap scan
+ * case, this function always returns true. In parallel heap vacuum scan, this function
+ * is called by both worker processes and the leader process, and could return false.
+ */
+static bool
+do_lazy_scan_heap(LVRelState *vacrel)
+{
+	bool		all_visible_according_to_vm;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
+	BlockNumber blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		scan_done = true;
+
 	while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
 	{
 		Buffer		buf;
@@ -857,13 +1040,20 @@ lazy_scan_heap(LVRelState *vacrel)
 		bool		has_lpdead_items;
 		bool		got_cleanup_lock = false;
 
-		vacrel->scanned_pages++;
+		vacrel->scan_stats->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
 
+		/*
+		 * If parallel vacuum scan is enabled, advertise the current block
+		 * number
+		 */
+		if (ParallelHeapVacuumIsActive(vacrel))
+			pg_atomic_write_u32(&(vacrel->phvstate->myscanstate->cur_blkno), (uint32) blkno);
+
 		vacuum_delay_point();
 
 		/*
@@ -875,46 +1065,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (!IsParallelWorker() &&
+			vacrel->scan_stats->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
-		/*
-		 * Consider if we definitely have enough space to process TIDs on page
-		 * already.  If we are close to overrunning the available space for
-		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page.
-		 */
-		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
-		{
-			/*
-			 * Before beginning index vacuuming, we release any pin we may
-			 * hold on the visibility map page.  This isn't necessary for
-			 * correctness, but we do it anyway to avoid holding the pin
-			 * across a lengthy, unrelated operation.
-			 */
-			if (BufferIsValid(vmbuffer))
-			{
-				ReleaseBuffer(vmbuffer);
-				vmbuffer = InvalidBuffer;
-			}
-
-			/* Perform a round of index and heap vacuuming */
-			vacrel->consider_bypass_optimization = false;
-			lazy_vacuum(vacrel);
-
-			/*
-			 * Vacuum the Free Space Map to make newly-freed space visible on
-			 * upper-level FSM pages.  Note we have not yet processed blkno.
-			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-									blkno);
-			next_fsm_block_to_vacuum = blkno;
-
-			/* Report that we are once again scanning the heap */
-			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
-		}
-
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
@@ -1003,9 +1157,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1019,57 +1174,172 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				BlockNumber fsm_vac_up_to;
+
+				/*
+				 * If parallel heap vacuum scan is active, compute the minimum
+				 * block number we scanned so far.
+				 */
+				if (ParallelHeapVacuumIsActive(vacrel))
+				{
+					parallel_heap_vacuum_compute_min_blkno(vacrel);
+					fsm_vac_up_to = vacrel->phvstate->min_blkno;
+				}
+				else
+				{
+					/* blkno is already processed */
+					fsm_vac_up_to = blkno + 1;
+				}
+
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+										fsm_vac_up_to);
+				vacrel->next_fsm_block_to_vacuum = fsm_vac_up_to;
 			}
 		}
 		else
 			UnlockReleaseBuffer(buf);
+
+		/*
+		 * Consider if we definitely have enough space to process TIDs on page
+		 * already.  If we are close to overrunning the available space for
+		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
+		 * this page.
+		 */
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
+		{
+			/*
+			 * Before beginning index vacuuming, we release any pin we may
+			 * hold on the visibility map page.  This isn't necessary for
+			 * correctness, but we do it anyway to avoid holding the pin
+			 * across a lengthy, unrelated operation.
+			 */
+			if (BufferIsValid(vmbuffer))
+			{
+				ReleaseBuffer(vmbuffer);
+				vmbuffer = InvalidBuffer;
+			}
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/* Remember we might have some unprocessed blocks */
+				scan_done = false;
+
+				/*
+				 * Pause the heap scan without invoking index and heap
+				 * vacuuming. The leader process also skips FSM vacuum since
+				 * some blocks before blkno might have not processed yet. The
+				 * leader will wait for all workers to finish and perform
+				 * index and heap vacuuming, and then perform FSM vacuum.
+				 */
+				break;
+			}
+
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
+
+			/*
+			 * Vacuum the Free Space Map to make newly-freed space visible on
+			 * upper-level FSM pages.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = blkno;
+
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			continue;
+		}
 	}
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	return scan_done;
+}
 
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+/*
+ * A parallel scan variant of heap_vac_scan_next_block.
+ *
+ * In parallel vacuum scan, we don't use the SKIP_PAGES_THRESHOLD optimization.
+ */
+static bool
+heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno,
+								  bool *all_visible_according_to_vm)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+	BlockNumber next_block;
+	Buffer		vmbuffer = InvalidBuffer;
+	uint8		mapbits = 0;
 
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+	Assert(ParallelHeapVacuumIsActive(vacrel));
 
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
+	for (;;)
+	{
+		next_block = table_block_parallelscan_nextpage(vacrel->rel,
+													   &(phvstate->myscanstate->state),
+													   phvstate->pscandesc);
 
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+		/* Have we reached the end of the table? */
+		if (!BlockNumberIsValid(next_block) || next_block >= vacrel->rel_pages)
+		{
+			if (BufferIsValid(vmbuffer))
+				ReleaseBuffer(vmbuffer);
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+			*blkno = vacrel->rel_pages;
+			return false;
+		}
 
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
+		/* We always treat the last block as unsafe to skip */
+		if (next_block == vacrel->rel_pages - 1)
+			break;
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_block, &vmbuffer);
+
+		/*
+		 * A block is unskippable if it is not all visible according to the
+		 * visibility map.
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+			break;
+		}
+
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		if (!vacrel->skipwithvm)
+			break;
+
+		/*
+		 * Aggressive VACUUM caller can't skip pages just because they are
+		 * all-visible.
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * All-visible block is safe to skip in non-aggressive case. But
+			 * remember that the final range contains such a block for later.
+			 */
+			vacrel->scan_stats->skippedallvis = true;
+		}
+	}
+
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
+
+	*blkno = next_block;
+	*all_visible_according_to_vm = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
+
+	return true;
 }
 
 /*
@@ -1096,6 +1366,9 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 {
 	BlockNumber next_block;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		return heap_vac_scan_next_block_parallel(vacrel, blkno, all_visible_according_to_vm);
+
 	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
 	next_block = vacrel->current_block + 1;
 
@@ -1145,7 +1418,7 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_stats->skippedallvis = true;
 		}
 	}
 
@@ -1218,11 +1491,12 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 
 		/*
 		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
+		 * (caller must have the opportunity to set
+		 * vacrel->scan_stats->nonempty_pages). This rule avoids having
+		 * lazy_truncate_heap() take access-exclusive lock on rel to attempt a
+		 * truncation that fails anyway, just because there are tuples on the
+		 * last page (it is likely that there will be tuples on other nearby
+		 * pages as well, but those can be skipped).
 		 *
 		 * Implement this by always treating the last block as unsafe to skip.
 		 */
@@ -1447,10 +1721,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_stats->NewRelfrozenXid, &vacrel->scan_stats->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_stats->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_stats->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1459,7 +1733,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * nfrozen == 0, since it only counts pages with newly frozen tuples
 		 * (don't confuse that with pages newly set all-frozen in VM).
 		 */
-		vacrel->frozen_pages++;
+		vacrel->scan_stats->frozen_pages++;
 	}
 
 	/*
@@ -1494,7 +1768,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_stats->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -1509,15 +1783,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_stats->tuples_deleted += presult.ndeleted;
+	vacrel->scan_stats->tuples_frozen += presult.nfrozen;
+	vacrel->scan_stats->lpdead_items += presult.lpdead_items;
+	vacrel->scan_stats->live_tuples += presult.live_tuples;
+	vacrel->scan_stats->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_stats->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -1667,8 +1941,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_stats->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_stats->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1795,8 +2069,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_stats->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_stats->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -1823,25 +2097,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_stats->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_stats->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_stats->live_tuples += live_tuples;
+	vacrel->scan_stats->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_stats->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_stats->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_stats->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -1870,7 +2144,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_stats->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -1904,7 +2178,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_stats->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -1931,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		bypass = (vacrel->scan_stats->lpdead_item_pages < threshold &&
 				  (TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L)));
 	}
 
@@ -2024,7 +2298,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2069,7 +2343,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_stats->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2178,8 +2452,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_stats->lpdead_items &&
+			vacuumed_pages == vacrel->scan_stats->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2332,7 +2606,7 @@ lazy_check_wraparound_failsafe(LVRelState *vacrel)
 		vacrel->do_index_cleanup = false;
 		vacrel->do_rel_truncate = false;
 
-		/* Reset the progress counters */
+		/* Reset the progress scan_stats */
 		pgstat_progress_update_multi_param(2, progress_index, progress_val);
 
 		ereport(WARNING,
@@ -2360,7 +2634,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_stats->scanned_pages < vacrel->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -2383,7 +2657,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2407,7 +2681,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 											estimated_count);
 	}
 
-	/* Reset the progress counters */
+	/* Reset the progress scan_stats */
 	pgstat_progress_update_multi_param(2, progress_end_index, progress_end_val);
 }
 
@@ -2541,7 +2815,7 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->scan_stats->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
 		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
@@ -2567,7 +2841,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_stats->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -2668,7 +2942,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_stats->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
@@ -2676,7 +2950,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_stats->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -2704,7 +2978,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_stats->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2816,7 +3090,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_stats->nonempty_pages;
 }
 
 /*
@@ -2834,12 +3108,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -2857,11 +3127,20 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize parallel heap scan/vacuuming or index vacuuming
+			 * or both based on the table size and the number of indexes. Note
+			 * that only one worker can be used for an index, we invoke
+			 * parallelism for index vacuuming only if there are at least two
+			 * indexes on a table.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -2902,9 +3181,19 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	/*
+	 * Protect both dead_items and dead_items_info from concurrent updates in
+	 * parallel heap scan cases.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(dead_items);
+
 	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(dead_items);
@@ -3106,6 +3395,457 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of parallel workers for parallel vacuum heap scan.
+ *
+ * The calculation logic is borrowed from compute_parallel_worker().
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	int			parallel_workers = 0;
+	int			heap_parallel_threshold;
+	int			heap_pages;
+
+	if (nrequested == 0)
+	{
+		/*
+		 * Select the number of workers based on the log of the size of the
+		 * relation.  This probably needs to be a good deal more
+		 * sophisticated, but we need something here for now.  Note that the
+		 * upper limit of the min_parallel_table_scan_size GUC is chosen to
+		 * prevent overflow here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		heap_pages = RelationGetNumberOfBlocks(rel);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nrequested;
+
+	return parallel_workers;
+}
+
+/* Estimate shared memory sizes required for parallel heap vacuum */
+static inline void
+heap_parallel_estimate_shared_memory_size(Relation rel, int nworkers, Size *pscan_len,
+										  Size *shared_len, Size *pscanwork_len)
+{
+	Size		size = 0;
+
+	size = add_size(size, SizeOfPHVShared);
+	size = add_size(size, mul_size(sizeof(LVRelScanStats), nworkers));
+	*shared_len = size;
+
+	*pscan_len = table_block_parallelscan_estimate(rel);
+
+	*pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+}
+
+/*
+ * Compute the amount of space we'll need in the parallel heap vacuum
+ * DSM, and inform pcxt->estimator about our needs.
+ *
+ * nworkers is the number of workers for the table vacuum. Note that it could
+ * be different than pcxt->nworkers since it is the maximum of number of
+ * workers for table vacuum and index vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+							  int nworkers, void *state)
+{
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+
+	/* space for PHVShared */
+	shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for ParallelBlockTableScanDesc */
+	pscan_len = table_block_parallelscan_estimate(rel);
+	shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for per-worker scan state, PHVScanWorkerState */
+	pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+								int nworkers, void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	PHVState   *phvstate = vacrel->phvstate;
+	ParallelBlockTableScanDesc pscan;
+	PHVScanWorkerState *pscanwork;
+	PHVShared  *shared;
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	phvstate = (PHVState *) palloc(sizeof(PHVState));
+
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+
+	shared = shm_toc_allocate(pcxt->toc, shared_len);
+
+	/* Prepare the shared information */
+
+	MemSet(shared, 0, shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_stats->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_stats->NewRelminMxid;
+	shared->skippedallvis = vacrel->scan_stats->skippedallvis;
+
+	/*
+	 * XXX: we copy the contents of vistest to the shared area, but in order
+	 * to do that, we need to either expose GlobalVisTest or to provide
+	 * functions to copy contents of GlobalVisTest to somewhere. Currently we
+	 * do the former but not sure it's the best choice.
+	 *
+	 * Alternative idea is to have each worker determine cutoff and have their
+	 * own vistest. But we need to carefully consider it since parallel
+	 * workers end up having different cutoff and horizon.
+	 */
+	shared->vistest = *vacrel->vistest;
+
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_SHARED, shared);
+
+	phvstate->shared = shared;
+
+	/* prepare the  parallel block table scan description */
+	pscan = shm_toc_allocate(pcxt->toc, pscan_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_DESC, pscan);
+
+	/* initialize parallel scan description */
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pscan);
+
+	/* Disable sync scan to always start from the first block */
+	pscan->base.phs_syncscan = false;
+
+	phvstate->pscandesc = pscan;
+
+	/* prepare the workers' parallel block table scan state */
+	pscanwork = shm_toc_allocate(pcxt->toc, pscanwork_len);
+	MemSet(pscanwork, 0, pscanwork_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_SCAN_DESC_WORKER, pscanwork);
+	phvstate->scanstates = pscanwork;
+
+	vacrel->phvstate = phvstate;
+}
+
+/*
+ * Main function for parallel heap vacuum workers.
+ */
+void
+heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
+								 ParallelWorkerContext *pwcxt)
+{
+	LVRelState	vacrel = {0};
+	PHVState   *phvstate;
+	PHVShared  *shared;
+	ParallelBlockTableScanDesc pscandesc;
+	PHVScanWorkerState *scanstate;
+	LVRelScanStats *scan_stats;
+	ErrorContextCallback errcallback;
+	bool		scan_done;
+
+	phvstate = palloc(sizeof(PHVState));
+
+	pscandesc = (ParallelBlockTableScanDesc) shm_toc_lookup(pwcxt->toc,
+															LV_PARALLEL_SCAN_DESC,
+															false);
+	phvstate->pscandesc = pscandesc;
+
+	shared = (PHVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_SCAN_SHARED,
+										  false);
+	phvstate->shared = shared;
+
+	scanstate = (PHVScanWorkerState *) shm_toc_lookup(pwcxt->toc,
+													  LV_PARALLEL_SCAN_DESC_WORKER,
+													  false);
+
+	phvstate->myscanstate = &(scanstate[ParallelWorkerNumber]);
+	scan_stats = &(shared->worker_scan_stats[ParallelWorkerNumber]);
+
+	/* Prepare LVRelState */
+	vacrel.rel = rel;
+	vacrel.indrels = parallel_vacuum_get_table_indexes(pvs, &vacrel.nindexes);
+	vacrel.pvs = pvs;
+	vacrel.phvstate = phvstate;
+	vacrel.aggressive = shared->aggressive;
+	vacrel.skipwithvm = shared->skipwithvm;
+	vacrel.cutoffs = shared->cutoffs;
+	vacrel.vistest = &(shared->vistest);
+	vacrel.dead_items = parallel_vacuum_get_dead_items(pvs,
+													   &vacrel.dead_items_info);
+	vacrel.rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel.scan_stats = scan_stats;
+
+	/* initialize per-worker relation statistics */
+	MemSet(scan_stats, 0, sizeof(LVRelScanStats));
+
+	/* Set fields necessary for heap scan */
+	vacrel.scan_stats->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel.scan_stats->NewRelminMxid = shared->NewRelminMxid;
+	vacrel.scan_stats->skippedallvis = shared->skippedallvis;
+
+	/* Initialize the per-worker scan state if not yet */
+	if (!phvstate->myscanstate->initialized)
+	{
+		table_block_parallelscan_startblock_init(rel,
+												 &(phvstate->myscanstate->state),
+												 phvstate->pscandesc);
+
+		pg_atomic_init_u32(&(phvstate->myscanstate->cur_blkno), 0);
+		phvstate->myscanstate->maybe_have_blocks = false;
+		phvstate->myscanstate->initialized = true;
+	}
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel.dbname = get_database_name(MyDatabaseId);
+	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel.relname = pstrdup(RelationGetRelationName(rel));
+	vacrel.indname = NULL;
+	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	scan_done = do_lazy_scan_heap(&vacrel);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	/*
+	 * If the leader or a worker finishes the heap scan because dead_items
+	 * TIDs is close to the limit, it might have some allocated blocks in its
+	 * scan state. Since this scan state might not be used in the next heap
+	 * scan, we remember that it might have some unconsumed blocks so that the
+	 * leader complete the scans after the heap scan phase finishes.
+	 */
+	phvstate->myscanstate->maybe_have_blocks = !scan_done;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+parallel_heap_complete_unfinised_scan(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		PHVScanWorkerState *wstate = &(vacrel->phvstate->scanstates[i]);
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		if (!wstate->maybe_have_blocks)
+
+			continue;
+
+		/* Attache the worker's scan state and do heap scan */
+		vacrel->phvstate->myscanstate = wstate;
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		Assert(scan_done);
+	}
+
+	/*
+	 * We don't need to gather the scan statistics here because statistics
+	 * have already been accumulated the leaders statistics directly.
+	 */
+}
+
+/*
+ * Compute the minimum block number we have scanned so far and update
+ * vacrel->min_blkno.
+ */
+static void
+parallel_heap_vacuum_compute_min_blkno(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * We check all worker scan states here to compute the minimum block
+	 * number among all scan states.
+	 */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		PHVScanWorkerState *wstate = &(phvstate->scanstates[i]);
+		BlockNumber blkno;
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!wstate->initialized)
+			continue;
+
+		blkno = pg_atomic_read_u32(&(wstate->cur_blkno));
+		if (blkno < phvstate->min_blkno)
+			phvstate->min_blkno = blkno;
+	}
+}
+
+/*
+ * Accumulate relation scan_stats that parallel workers collected into the
+ * leader's counters.
+ */
+static void
+parallel_heap_vacuum_gather_scan_stats(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the scan statistics that workers collected */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		LVRelScanStats *ss = &(phvstate->shared->worker_scan_stats[i]);
+
+		vacrel->scan_stats->scanned_pages += ss->scanned_pages;
+		vacrel->scan_stats->removed_pages += ss->removed_pages;
+		vacrel->scan_stats->frozen_pages += ss->frozen_pages;
+		vacrel->scan_stats->lpdead_item_pages += ss->lpdead_item_pages;
+		vacrel->scan_stats->missed_dead_pages += ss->missed_dead_pages;
+		vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
+		vacrel->scan_stats->tuples_deleted += ss->tuples_deleted;
+		vacrel->scan_stats->tuples_frozen += ss->tuples_frozen;
+		vacrel->scan_stats->lpdead_items += ss->lpdead_items;
+		vacrel->scan_stats->live_tuples += ss->live_tuples;
+		vacrel->scan_stats->recently_dead_tuples += ss->recently_dead_tuples;
+		vacrel->scan_stats->missed_dead_tuples += ss->missed_dead_tuples;
+
+		if (ss->nonempty_pages < vacrel->scan_stats->nonempty_pages)
+			vacrel->scan_stats->nonempty_pages = ss->nonempty_pages;
+
+		if (TransactionIdPrecedes(ss->NewRelfrozenXid, vacrel->scan_stats->NewRelfrozenXid))
+			vacrel->scan_stats->NewRelfrozenXid = ss->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(ss->NewRelminMxid, vacrel->scan_stats->NewRelminMxid))
+			vacrel->scan_stats->NewRelminMxid = ss->NewRelminMxid;
+
+		if (!vacrel->scan_stats->skippedallvis && ss->skippedallvis)
+			vacrel->scan_stats->skippedallvis = true;
+	}
+
+	/* Also, compute the minimum block number we scanned so far */
+	parallel_heap_vacuum_compute_min_blkno(vacrel);
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches parallel
+ * workers to scan the heap in parallel.
+ */
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	PHVScanWorkerState *scanstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+	/* initialize parallel scan description to join as a worker */
+	scanstate = palloc(sizeof(PHVScanWorkerState));
+	table_block_parallelscan_startblock_init(vacrel->rel, &(scanstate->state),
+											 vacrel->phvstate->pscandesc);
+	vacrel->phvstate->myscanstate = scanstate;
+
+	for (;;)
+	{
+		bool		scan_done;
+
+		/*
+		 * Scan the table until either we are close to overrunning the
+		 * available space for dead_items TIDs or we reach the end of the
+		 * table.
+		 */
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/* stop parallel workers and gather the collected stats */
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+		parallel_heap_vacuum_gather_scan_stats(vacrel);
+
+		/*
+		 * If the heap scan paused in the middle of the table due to full of
+		 * dead_items TIDs, perform a round of index and heap vacuuming.
+		 */
+		if (!scan_done)
+		{
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
+
+			/*
+			 * Vacuum the Free Space Map to make newly-freed space visible on
+			 * upper-level FSM pages.
+			 */
+			if (vacrel->phvstate->min_blkno > vacrel->next_fsm_block_to_vacuum)
+			{
+				/*
+				 * min_blkno should have already been updated when gathering
+				 * statistics
+				 */
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+										vacrel->phvstate->min_blkno + 1);
+				vacrel->next_fsm_block_to_vacuum = vacrel->phvstate->min_blkno;
+			}
+
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			/* re-launcher workers */
+			vacrel->phvstate->nworkers_launched =
+				parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+			continue;
+		}
+
+		/* We reach the end of the table */
+		break;
+	}
+
+	/*
+	 * The parallel heap vacuum finished, but it's possible that some workers
+	 * have allocated blocks but not processed yet. This can happen for
+	 * example when workers exit because of full of dead_items TIDs and the
+	 * leader process could launch fewer workers in the next cycle.
+	 */
+	parallel_heap_complete_unfinised_scan(vacrel);
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 4fd6574e129..3aea80a29c4 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -6,15 +6,24 @@
  * This file contains routines that are intended to support setting up, using,
  * and tearing down a ParallelVacuumState.
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan or both index bulk deletion and
+ * index cleanup or all of them with parallel worker processes. Different
+ * numbers of workers are launched for the table vacuuming and index processing.
+ * ParallelVacuumState contains shared information as well as the memory space
+ * for storing dead items allocated in the DSA area.
+ *
+ * When initializing parallel table vacuum scan, we invoke table AM routines for
+ * estimating DSM sizes and initializing DSM memory. Parallel table vacuum
+ * workers invoke the table AM routine for vacuuming the table.
+ *
+ * For processing indexes in parallel, individual indexes are processed by one
+ * vacuum process. We launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the parallel
+ * worker processes exit.
+ *
+ * Each time we process table or indexes in parallel, the parallel context is
+ * re-initialized so that the same DSM can be used for multiple passes of table vacuum
+ * or index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -28,6 +37,7 @@
 
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * True if the caller wants parallel workers to invoke vacuum table scan
+	 * callback.
+	 */
+	bool		do_vacuum_table_scan;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -101,6 +117,13 @@ typedef struct PVShared
 	 */
 	pg_atomic_uint32 cost_balance;
 
+	/*
+	 * The number of workers for parallel table scan/vacuuming and index
+	 * vacuuming, respectively.
+	 */
+	int			nworkers_for_table;
+	int			nworkers_for_index;
+
 	/*
 	 * Number of active parallel workers.  This is used for computing the
 	 * minimum threshold of the vacuum cost balance before a worker sleeps for
@@ -164,6 +187,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Passed to parallel table scan workers. NULL for leader process */
+	ParallelWorkerContext *pwcxt;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -193,6 +219,9 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/* How many times parallel table vacuum scan is called? */
+	int			num_table_scans;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
@@ -221,8 +250,9 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static void parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_table,
+											int *nworkers_index, bool *will_parallel_vacuum);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -242,7 +272,7 @@ static void parallel_vacuum_error_callback(void *arg);
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -256,6 +286,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_table;
+	int			nworkers_index;
 	int			querylen;
 
 	/*
@@ -263,15 +295,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	 * relation
 	 */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
-													   nrequested_workers,
-													   will_parallel_vacuum);
+	parallel_vacuum_compute_workers(rel, indrels, nindexes, nrequested_workers,
+									&nworkers_table, &nworkers_index,
+									will_parallel_vacuum);
+
+	parallel_workers = Max(nworkers_table, nworkers_index);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
@@ -327,6 +361,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (nworkers_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, nworkers_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -371,6 +409,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shared->relid = RelationGetRelid(rel);
 	shared->elevel = elevel;
 	shared->queryid = pgstat_get_my_query_id();
+	shared->nworkers_for_table = nworkers_table;
+	shared->nworkers_for_index = nworkers_index;
 	shared->maintenance_work_mem_worker =
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
@@ -419,6 +459,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Prepare AM-specific DSM for parallel table vacuum */
+	if (nworkers_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, nworkers_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -534,33 +578,47 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.
+ *
+ * For parallel table vacuum, it asks AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to *nworkers_table.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel index vacuum, The index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. nrequested is the number of
+ * parallel workers that user requested.  If nrequested is 0, we compute the
+ * parallel degree based on nindexes, that is the number of indexes that
+ * support parallel vacuum.  This function also sets will_parallel_vacuum to
+ * remember indexes that participate in parallel vacuum.
  */
-static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+static void
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table,
+								int *nworkers_index, bool *will_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			parallel_workers_table = 0;
+	int			parallel_workers_index = 0;
+
+	*nworkers_table = 0;
+	*nworkers_index = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
 	 * when parallelism is disabled.
 	 */
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
-		return 0;
+		return;
+
+	/*
+	 * Compute the number of workers for parallel table scan. Cap by
+	 * max_parallel_maintenance_workers.
+	 */
+	parallel_workers_table = Min(table_paralle_vacuum_compute_workers(rel, nrequested),
+								 max_parallel_maintenance_workers);
 
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
@@ -591,17 +649,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	nindexes_parallel--;
 
 	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Compute the parallel degree for parallel index vacuum */
+		parallel_workers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		parallel_workers_index = Min(parallel_workers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table = parallel_workers_table;
+	*nworkers_index = parallel_workers_index;
 }
 
 /*
@@ -671,7 +730,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	if (nworkers > 0)
 	{
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (num_index_scans > 0 || pvs->num_table_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -980,6 +1039,139 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Prepare DSM and shared vacuum delays, and launch parallel workers for parallel
+ * table vacuum. Return the number of parallel workers launched.
+ *
+ * The caller must call parallel_vacuum_table_scan_end() to finish the parallel
+ * table vacuum.
+ */
+int
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->shared->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->do_vacuum_table_scan = true;
+
+	if (pvs->num_table_scans > 0)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->shared->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	ereport(pvs->shared->elevel,
+			(errmsg(ngettext("launched %d parallel vacuum worker for table processing (planned: %d)",
+							 "launched %d parallel vacuum workers for table processing (planned: %d)",
+							 pvs->pcxt->nworkers_launched),
+					pvs->pcxt->nworkers_launched, pvs->shared->nworkers_for_table)));
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel table vacuum scan, and gather statistics.
+ */
+void
+parallel_vacuum_table_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->shared->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	pvs->shared->do_vacuum_table_scan = false;
+	pvs->num_table_scans++;
+}
+
+/* Return the array of indexes associated to the given table to be vacuumed */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/* Return the number of workers for parallel table vacuum */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->shared->nworkers_for_table;
+}
+
+/* Return the number of workers for parallel index processing */
+int
+parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs)
+{
+	return pvs->shared->nworkers_for_index;
+}
+
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+	Assert(VacuumActiveNWorkers);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do table vacuum scan */
+	table_parallel_vacuum_scan(pvs->heaprel, pvs, pvs->pwcxt);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -999,7 +1191,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
-	ErrorContextCallback errcallback;
 
 	/*
 	 * A parallel vacuum worker must have only PROC_IN_VACUUM flag since we
@@ -1031,7 +1222,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	if (shared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = shared->maintenance_work_mem_worker;
@@ -1062,6 +1252,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
 	pvs.heaprel = rel;
 
+	pvs.pwcxt = palloc(sizeof(ParallelWorkerContext));
+	pvs.pwcxt->toc = toc;
+	pvs.pwcxt->seg = seg;
+
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
 	pvs.status = PARALLEL_INDVAC_STATUS_INITIAL;
@@ -1070,17 +1264,29 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
-	/* Setup error traceback support for ereport() */
-	errcallback.callback = parallel_vacuum_error_callback;
-	errcallback.arg = &pvs;
-	errcallback.previous = error_context_stack;
-	error_context_stack = &errcallback;
-
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	if (pvs.shared->do_vacuum_table_scan)
+	{
+		parallel_vacuum_process_table(&pvs);
+	}
+	else
+	{
+		ErrorContextCallback errcallback;
+
+		/* Setup error traceback support for ereport() */
+		errcallback.callback = parallel_vacuum_error_callback;
+		errcallback.arg = &pvs;
+		errcallback.previous = error_context_stack;
+		error_context_stack = &errcallback;
+
+		/* Process indexes to perform vacuum/cleanup */
+		parallel_vacuum_process_safe_indexes(&pvs);
+
+		/* Pop the error context stack */
+		error_context_stack = errcallback.previous;
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
@@ -1090,9 +1296,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 
 	TidStoreDetach(dead_items);
 
-	/* Pop the error context stack */
-	error_context_stack = errcallback.previous;
-
 	vac_close_indexes(nindexes, indrels, RowExclusiveLock);
 	table_close(rel, ShareUpdateExclusiveLock);
 	FreeAccessStrategy(pvs.bstrategy);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 36610a1c7e7..5b2b08a844c 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -164,15 +164,6 @@ typedef struct ProcArrayStruct
  *
  * The typedef is in the header.
  */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 96cf82f97b7..427a2f97105 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -21,6 +21,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "commands/vacuum.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -401,6 +402,13 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int requested);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+										  int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_scan_worker(Relation rel, ParallelVacuumState *pvs,
+											 ParallelWorkerContext *pwcxt);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..26e36d90790 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -654,6 +655,46 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * function must return 0 to disable parallel table vacuum.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel, int requested);
+
+	/*
+	 * Compute the amount of DSM space AM need in the parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * This callback is called for parallel table vacuum workers.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_scan_worker) (Relation rel,
+												ParallelVacuumState *pvs,
+												ParallelWorkerContext *pwcxt);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1719,6 +1760,52 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							   void *state)
+{
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Start a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_scan(Relation rel, ParallelVacuumState *pvs,
+						   ParallelWorkerContext *pwcxt)
+{
+	rel->rd_tableam->parallel_vacuum_scan_worker(rel, pvs, pwcxt);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d38..a225f314290 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -360,7 +360,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -372,6 +373,11 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051c..6ccb19a29ff 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -102,8 +102,20 @@ extern char *ExportSnapshot(Snapshot snapshot);
 /*
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
+ *
+ * XXX the struct definition is temporarily moved from procarray.c for
+ * parallel table vacuum development. We need to find a suitable way for
+ * parallel table vacuum workers to share the GlobalVisState.
  */
-typedef struct GlobalVisState GlobalVisState;
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
-- 
2.43.5

v4-0003-Support-shared-itereation-on-TidStore.patchapplication/octet-stream; name=v4-0003-Support-shared-itereation-on-TidStore.patchDownload

From 8d802ea873622abc43265615b8e6537da70987b7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:34:57 -0700
Subject: [PATCH v4 3/4] Support shared itereation on TidStore.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/common/tidstore.c | 59 ++++++++++++++++++++++++++++
 src/include/access/tidstore.h        |  3 ++
 2 files changed, 62 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index a7179759d67..637d26012d2 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -483,6 +483,7 @@ TidStoreBeginIterate(TidStore *ts)
 	iter = palloc0(sizeof(TidStoreIter));
 	iter->ts = ts;
 
+	/* begin iteration on the radix tree */
 	if (TidStoreIsShared(ts))
 		iter->tree_iter.shared = shared_ts_begin_iterate(ts->tree.shared);
 	else
@@ -533,6 +534,56 @@ TidStoreEndIterate(TidStoreIter *iter)
 	pfree(iter);
 }
 
+/*
+ * Prepare to iterate through a shared TidStore in shared mode. This function
+ * is aimed to start the iteration on the given TidStore with parallel workers.
+ *
+ * The TidStoreIter struct is created in the caller's memory context, and it
+ * will be freed in TidStoreEndIterate.
+ *
+ * The caller is responsible for locking TidStore until the iteration is
+ * finished.
+ */
+TidStoreIter *
+TidStoreBeginIterateShared(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	if (!TidStoreIsShared(ts))
+		elog(ERROR, "cannot begin shared iteration on local TidStore");
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* begin the shared iteration on radix tree */
+	iter->tree_iter.shared =
+		(shared_ts_iter *) shared_ts_begin_iterate_shared(ts->tree.shared);
+
+	return iter;
+}
+
+/*
+ * Attach to the shared TidStore iterator. 'iter_handle' is the dsa_pointer
+ * returned by TidStoreGetSharedIterHandle(). The returned object is allocated
+ * in backend-local memory using CurrentMemoryContext.
+ */
+TidStoreIter *
+TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle)
+{
+	TidStoreIter *iter;
+
+	Assert(TidStoreIsShared(ts));
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* Attach to the shared iterator */
+	iter->tree_iter.shared = shared_ts_attach_iterate_shared(ts->tree.shared,
+															 iter_handle);
+
+	return iter;
+}
+
 /*
  * Return the memory usage of TidStore.
  */
@@ -564,6 +615,14 @@ TidStoreGetHandle(TidStore *ts)
 	return (dsa_pointer) shared_ts_get_handle(ts->tree.shared);
 }
 
+dsa_pointer
+TidStoreGetSharedIterHandle(TidStoreIter *iter)
+{
+	Assert(TidStoreIsShared(iter->ts));
+
+	return (dsa_pointer) shared_ts_get_iter_handle(iter->tree_iter.shared);
+}
+
 /*
  * Given a TidStoreIterResult returned by TidStoreIterateNext(), extract the
  * offset numbers.  Returns the number of offsets filled in, if <=
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index aeaf563b6a9..f20c9a92e55 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -37,6 +37,9 @@ extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
 extern void TidStoreUnlock(TidStore *ts);
+extern TidStoreIter *TidStoreBeginIterateShared(TidStore *ts);
+extern TidStoreIter *TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle);
+extern dsa_pointer TidStoreGetSharedIterHandle(TidStoreIter *iter);
 extern void TidStoreDestroy(TidStore *ts);
 extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 									int num_offsets);
-- 
2.43.5

v4-0002-raidxtree.h-support-shared-iteration.patchapplication/octet-stream; name=v4-0002-raidxtree.h-support-shared-iteration.patchDownload

From 57e745ab91adbc41b08ae821f1fd5e5e2024349e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:29:51 -0700
Subject: [PATCH v4 2/4] raidxtree.h: support shared iteration.

This commit supports a shared iteration operation on a radix tree with
multiple processes. The radix tree must be in shared mode to start a
shared itereation. Parallel workers can attach the shared iteration
using the iterator handle given by the leader process. Same as the
normal interation, it's guarnteed that the shared iteration returns
key-values in an ascending order.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/include/lib/radixtree.h | 221 +++++++++++++++++++++++++++++++-----
 1 file changed, 190 insertions(+), 31 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 88bf695e3f3..bd5b8eed1bf 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -177,6 +177,9 @@
 #define RT_ATTACH RT_MAKE_NAME(attach)
 #define RT_DETACH RT_MAKE_NAME(detach)
 #define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#define RT_BEGIN_ITERATE_SHARED RT_MAKE_NAME(begin_iterate_shared)
+#define RT_ATTACH_ITERATE_SHARED RT_MAKE_NAME(attach_iterate_shared)
+#define RT_GET_ITER_HANDLE RT_MAKE_NAME(get_iter_handle)
 #define RT_LOCK_EXCLUSIVE RT_MAKE_NAME(lock_exclusive)
 #define RT_LOCK_SHARE RT_MAKE_NAME(lock_share)
 #define RT_UNLOCK RT_MAKE_NAME(unlock)
@@ -236,15 +239,19 @@
 #define RT_SHRINK_NODE_16 RT_MAKE_NAME(shrink_child_16)
 #define RT_SHRINK_NODE_48 RT_MAKE_NAME(shrink_child_48)
 #define RT_SHRINK_NODE_256 RT_MAKE_NAME(shrink_child_256)
+#define RT_INITIALIZE_ITER RT_MAKE_NAME(initialize_iter)
 #define RT_NODE_ITERATE_NEXT RT_MAKE_NAME(node_iterate_next)
 #define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
 
 /* type declarations */
 #define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
 #define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER_CONTROL RT_MAKE_NAME(iter_control)
 #define RT_ITER RT_MAKE_NAME(iter)
 #ifdef RT_SHMEM
 #define RT_HANDLE RT_MAKE_NAME(handle)
+#define RT_ITER_CONTROL_SHARED RT_MAKE_NAME(iter_control_shared)
+#define RT_ITER_HANDLE RT_MAKE_NAME(iter_handle)
 #endif
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_CHILD_PTR RT_MAKE_NAME(child_ptr)
@@ -270,6 +277,7 @@ typedef struct RT_ITER RT_ITER;
 
 #ifdef RT_SHMEM
 typedef dsa_pointer RT_HANDLE;
+typedef dsa_pointer RT_ITER_HANDLE;
 #endif
 
 #ifdef RT_SHMEM
@@ -687,6 +695,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 	RT_HANDLE	handle;
 	uint32		magic;
 	LWLock		lock;
+	int			tranche_id;
 #endif
 
 	RT_PTR_ALLOC root;
@@ -740,11 +749,9 @@ typedef struct RT_NODE_ITER
 	int			idx;
 }			RT_NODE_ITER;
 
-/* state for iterating over the whole radix tree */
-struct RT_ITER
+/* Contain the iteration state data */
+typedef struct RT_ITER_CONTROL
 {
-	RT_RADIX_TREE *tree;
-
 	/*
 	 * A stack to track iteration for each level. Level 0 is the lowest (or
 	 * leaf) level
@@ -755,8 +762,36 @@ struct RT_ITER
 
 	/* The key constructed during iteration */
 	uint64		key;
-};
+}			RT_ITER_CONTROL;
+
+#ifdef RT_SHMEM
+/* Contain the shared iteration state data */
+typedef struct RT_ITER_CONTROL_SHARED
+{
+	/* Actual shared iteration state data */
+	RT_ITER_CONTROL common;
+
+	/* protect the control data */
+	LWLock		lock;
+
+	RT_ITER_HANDLE handle;
+	pg_atomic_uint32 refcnt;
+}			RT_ITER_CONTROL_SHARED;
+#endif
+
+/* state for iterating over the whole radix tree */
+struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
 
+	/* pointing to either local memory or DSA */
+	RT_ITER_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	/* True if the iterator is for shared iteration */
+	bool		shared;
+#endif
+};
 
 /* verification (available only in assert-enabled builds) */
 static void RT_VERIFY_NODE(RT_NODE * node);
@@ -1848,6 +1883,7 @@ RT_CREATE(MemoryContext ctx)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
 	tree->ctl->handle = dp;
 	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	tree->ctl->tranche_id = tranche_id;
 	LWLockInitialize(&tree->ctl->lock, tranche_id);
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
@@ -1900,6 +1936,9 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 	dsa_pointer control;
 
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->iter_context = AllocSetContextCreate(CurrentMemoryContext,
+											   RT_STR(RT_PREFIX) "_radix_tree iter context",
+											   ALLOCSET_SMALL_SIZES);
 
 	/* Find the control object in shared memory */
 	control = handle;
@@ -2072,35 +2111,86 @@ RT_FREE(RT_RADIX_TREE * tree)
 
 /***************** ITERATION *****************/
 
+/* Common routine to initialize the given iterator */
+static void
+RT_INITIALIZE_ITER(RT_RADIX_TREE * tree, RT_ITER * iter)
+{
+	RT_CHILD_PTR root;
+
+	iter->tree = tree;
+
+	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
+	root.alloc = iter->tree->ctl->root;
+	RT_PTR_SET_LOCAL(tree, &root);
+
+	iter->ctl->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+
+	/* Set the root to start */
+	iter->ctl->cur_level = iter->ctl->top_level;
+	iter->ctl->node_iters[iter->ctl->cur_level].node = root;
+	iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
+}
+
 /*
  * Create and return the iterator for the given radix tree.
  *
- * Taking a lock in shared mode during the iteration is the caller's
- * responsibility.
+ * Taking a lock on a radix tree in shared mode during the iteration is the
+ * caller's responsibility.
  */
 RT_SCOPE	RT_ITER *
 RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 {
 	RT_ITER    *iter;
-	RT_CHILD_PTR root;
 
 	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
 											  sizeof(RT_ITER));
-	iter->tree = tree;
+	iter->ctl = (RT_ITER_CONTROL *) MemoryContextAllocZero(tree->iter_context,
+														   sizeof(RT_ITER_CONTROL));
 
-	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
-	root.alloc = iter->tree->ctl->root;
-	RT_PTR_SET_LOCAL(tree, &root);
+	RT_INITIALIZE_ITER(tree, iter);
 
-	iter->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+#ifdef RT_SHMEM
+	/* we will non-shared iteration on a shared radix tree */
+	iter->shared = false;
+#endif
 
-	/* Set the root to start */
-	iter->cur_level = iter->top_level;
-	iter->node_iters[iter->cur_level].node = root;
-	iter->node_iters[iter->cur_level].idx = 0;
+	return iter;
+}
+
+#ifdef RT_SHMEM
+/*
+ * Create and return the shared iterator for the given shard radix tree.
+ *
+ * Taking a lock on a radix tree in shared mode during the shared iteration to
+ * prevent concurrent writes is the caller's responsibility.
+ */
+RT_SCOPE	RT_ITER *
+RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl_shared;
+	dsa_pointer dp;
+
+	/* The radix tree must be in shared mode */
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	dp = dsa_allocate0(tree->dsa, sizeof(RT_ITER_CONTROL_SHARED));
+	ctl_shared = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, dp);
+	ctl_shared->handle = dp;
+	LWLockInitialize(&ctl_shared->lock, tree->ctl->tranche_id);
+	pg_atomic_init_u32(&ctl_shared->refcnt, 1);
+
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
+											  sizeof(RT_ITER));
+
+	iter->ctl = (RT_ITER_CONTROL *) ctl_shared;
+	iter->shared = true;
+
+	RT_INITIALIZE_ITER(tree, iter);
 
 	return iter;
 }
+#endif
 
 /*
  * Scan the inner node and return the next child pointer if one exists, otherwise
@@ -2114,12 +2204,18 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	RT_CHILD_PTR node;
 	RT_PTR_ALLOC *slot = NULL;
 
+	node_iter = &(iter->ctl->node_iters[level]);
+	node = node_iter->node;
+
 #ifdef RT_SHMEM
-	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
-#endif
 
-	node_iter = &(iter->node_iters[level]);
-	node = node_iter->node;
+	/*
+	 * Since the iterator is shared, the local pointer of the node might be
+	 * set by other backends, we need to make sure to use the local pointer.
+	 */
+	if (iter->shared)
+		RT_PTR_SET_LOCAL(iter->tree, &node);
+#endif
 
 	Assert(node.local != NULL);
 
@@ -2192,8 +2288,8 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	}
 
 	/* Update the key */
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
-	iter->key |= (((uint64) key_chunk) << (level * RT_SPAN));
+	iter->ctl->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
+	iter->ctl->key |= (((uint64) key_chunk) << (level * RT_SPAN));
 
 	return slot;
 }
@@ -2207,18 +2303,29 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 {
 	RT_PTR_ALLOC *slot = NULL;
 
-	while (iter->cur_level <= iter->top_level)
+#ifdef RT_SHMEM
+	/* Prevent the shared iterator from being updated concurrently */
+	if (iter->shared)
+		LWLockAcquire(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock, LW_EXCLUSIVE);
+#endif
+
+	while (iter->ctl->cur_level <= iter->ctl->top_level)
 	{
 		RT_CHILD_PTR node;
 
-		slot = RT_NODE_ITERATE_NEXT(iter, iter->cur_level);
+		slot = RT_NODE_ITERATE_NEXT(iter, iter->ctl->cur_level);
 
-		if (iter->cur_level == 0 && slot != NULL)
+		if (iter->ctl->cur_level == 0 && slot != NULL)
 		{
 			/* Found a value at the leaf node */
-			*key_p = iter->key;
+			*key_p = iter->ctl->key;
 			node.alloc = *slot;
 
+#ifdef RT_SHMEM
+			if (iter->shared)
+				LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 			if (RT_CHILDPTR_IS_VALUE(*slot))
 				return (RT_VALUE_TYPE *) slot;
 			else
@@ -2234,17 +2341,23 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 			node.alloc = *slot;
 			RT_PTR_SET_LOCAL(iter->tree, &node);
 
-			iter->cur_level--;
-			iter->node_iters[iter->cur_level].node = node;
-			iter->node_iters[iter->cur_level].idx = 0;
+			iter->ctl->cur_level--;
+			iter->ctl->node_iters[iter->ctl->cur_level].node = node;
+			iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
 		}
 		else
 		{
 			/* Not found the child slot, move up the tree */
-			iter->cur_level++;
+			iter->ctl->cur_level++;
 		}
+
 	}
 
+#ifdef RT_SHMEM
+	if (iter->shared)
+		LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 	/* We've visited all nodes, so the iteration finished */
 	return NULL;
 }
@@ -2255,9 +2368,45 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 RT_SCOPE void
 RT_END_ITERATE(RT_ITER * iter)
 {
+#ifdef RT_SHMEM
+	RT_ITER_CONTROL_SHARED *ctl = (RT_ITER_CONTROL_SHARED *) iter->ctl;;
+
+	if (iter->shared &&
+		pg_atomic_sub_fetch_u32(&ctl->refcnt, 1) == 0)
+		dsa_free(iter->tree->dsa, ctl->handle);
+#endif
 	pfree(iter);
 }
 
+#ifdef	RT_SHMEM
+RT_SCOPE	RT_ITER_HANDLE
+RT_GET_ITER_HANDLE(RT_ITER * iter)
+{
+	Assert(iter->shared);
+	return ((RT_ITER_CONTROL_SHARED *) iter->ctl)->handle;
+
+}
+
+RT_SCOPE	RT_ITER *
+RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl;
+
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
+											  sizeof(RT_ITER));
+	iter->tree = tree;
+	ctl = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, handle);
+	iter->ctl = (RT_ITER_CONTROL *) ctl;
+	iter->shared = true;
+
+	/* For every iterator, increase the refcnt by 1 */
+	pg_atomic_add_fetch_u32(&ctl->refcnt, 1);
+
+	return iter;
+}
+#endif
+
 /***************** DELETION *****************/
 
 #ifdef RT_USE_DELETE
@@ -2957,7 +3106,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_PTR_ALLOC
 #undef RT_INVALID_PTR_ALLOC
 #undef RT_HANDLE
+#undef RT_ITER_HANDLE
+#undef RT_ITER_CONTROL
+#undef RT_ITER_HANDLE
 #undef RT_ITER
+#undef RT_SHARED_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
 #undef RT_NODE_KIND_4
@@ -2994,6 +3147,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_LOCK_SHARE
 #undef RT_UNLOCK
 #undef RT_GET_HANDLE
+#undef RT_BEGIN_ITERATE_SHARED
+#undef RT_ATTACH_ITERATE_SHARED
+#undef RT_GET_ITER_HANDLE
+#undef RT_ATTACH_ITER
+#undef RT_GET_ITER_HANDLE
 #undef RT_FIND
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
@@ -3050,5 +3208,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_SHRINK_NODE_256
 #undef RT_NODE_DELETE
 #undef RT_NODE_INSERT
+#undef RT_INITIALIZE_ITER
 #undef RT_NODE_ITERATE_NEXT
 #undef RT_VERIFY_NODE
-- 
2.43.5

#12

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 1 year ago

In reply to: Masahiko Sawada (#11)

RE: Parallel heap vacuum

Dear Sawda-san,

I've attached new version patches that fixes failures reported by
cfbot. I hope these changes make cfbot happy.

Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot
for Linux/Windows said ok.
I'm still learning the feature so I can post only one comment :-(.

I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

Another approach is to restrict TidStoreBeginIterate() to support only the local one.
How do you think?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#13

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#12)

Re: Parallel heap vacuum

On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawda-san,

I've attached new version patches that fixes failures reported by
cfbot. I hope these changes make cfbot happy.

Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot
for Linux/Windows said ok.
I'm still learning the feature so I can post only one comment :-(.

I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.

On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#14

vignesh C

vignesh21@gmail.com

about 1 year ago

In reply to: Masahiko Sawada (#11)

Re: Parallel heap vacuum

On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached new version patches that fixes failures reported by
cfbot. I hope these changes make cfbot happy.

I just started reviewing the patch and found the following comments
while going through the patch:
1) I felt we should add some documentation for this at [1]https://www.postgresql.org/docs/devel/sql-vacuum.html.

2) Can we add some tests in vacuum_parallel with tables having no
indexes and having dead tuples.

3) This should be included in typedefs.list:
3.a)
+/*
+ * Relation statistics collected during heap scanning and need to be
shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats
+{
+       BlockNumber scanned_pages;      /* # pages examined (not
skipped via VM) */
+       BlockNumber removed_pages;      /* # pages removed by relation
truncation */
+       BlockNumber frozen_pages;       /* # pages with newly frozen tuples */

3.b) Similarly this too:
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+       bool            aggressive;
+       bool            skipwithvm;
+

3.c) Similarly this too:
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerState
+{
+       bool            initialized;

3.d)  Similarly this too:
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+       /* Parallel scan description shared among parallel workers */

4) Since we are initializing almost all the members of structure,
should we use palloc0 in this case:
+       scan_stats = palloc(sizeof(LVRelScanStats));
+       scan_stats->scanned_pages = 0;
+       scan_stats->removed_pages = 0;
+       scan_stats->frozen_pages = 0;
+       scan_stats->lpdead_item_pages = 0;
+       scan_stats->missed_dead_pages = 0;
+       scan_stats->nonempty_pages = 0;
+
+       /* Initialize remaining counters (be tidy) */
+       scan_stats->tuples_deleted = 0;
+       scan_stats->tuples_frozen = 0;
+       scan_stats->lpdead_items = 0;
+       scan_stats->live_tuples = 0;
+       scan_stats->recently_dead_tuples = 0;
+       scan_stats->missed_dead_tuples = 0;

5) typo paralle should be parallel
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+       return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}

[1]: https://www.postgresql.org/docs/devel/sql-vacuum.html

Regards,
Vignesh

#15

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 1 year ago

In reply to: Masahiko Sawada (#13)

RE: Parallel heap vacuum

Dear Sawada-san,

TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.

On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.

Okay, thanks for the description. I felt it is OK to keep.

I read 0001 again and here are comments.

01. vacuumlazy.c
```
+#define LV_PARALLEL_SCAN_SHARED         0xFFFF0001
+#define LV_PARALLEL_SCAN_DESC           0xFFFF0002
+#define LV_PARALLEL_SCAN_DESC_WORKER    0xFFFF0003
```

I checked other DMS keys used for parallel work, and they seems to have name
like PARALEL_KEY_XXX. Can we follow it?

02. LVRelState
```
+ BlockNumber next_fsm_block_to_vacuum;
```

Only the attribute does not have comments Can we add like:
"Next freespace map page to be checked"?

03. parallel_heap_vacuum_gather_scan_stats
```
+ vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
```

Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined
in 0004. Can you move it?

04. heap_parallel_vacuum_estimate
```
+
+    heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+                                              &shared_len, &pscanwork_len);
+
+    /* space for PHVShared */
+    shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+    /* space for ParallelBlockTableScanDesc */
+    pscan_len = table_block_parallelscan_estimate(rel);
+    shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+    /* space for per-worker scan state, PHVScanWorkerState */
+    pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+    shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
```

I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size().
Can we remove table_block_parallelscan_estimate() and mul_size() from here?

05. Idea

Can you update documentations?

06. Idea

AFAICS pg_stat_progress_vacuum does not contain information related with the
parallel execution. How do you think adding an attribute which shows a list of pids?
Not sure it is helpful for users but it can show the parallelism.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#16

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Hayato Kuroda (Fujitsu) (#15)

Re: Parallel heap vacuum

On Wed, Nov 13, 2024 at 3:10 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.

On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.

Okay, thanks for the description. I felt it is OK to keep.

I read 0001 again and here are comments.

Thank you for the review comments!

01. vacuumlazy.c
```
+#define LV_PARALLEL_SCAN_SHARED         0xFFFF0001
+#define LV_PARALLEL_SCAN_DESC           0xFFFF0002
+#define LV_PARALLEL_SCAN_DESC_WORKER    0xFFFF0003
```
I checked other DMS keys used for parallel work, and they seems to have name
like PARALEL_KEY_XXX. Can we follow it?

Yes. How about LV_PARALLEL_KEY_XXX?

02. LVRelState
```
+ BlockNumber next_fsm_block_to_vacuum;
```

Only the attribute does not have comments Can we add like:
"Next freespace map page to be checked"?

Agreed. I'll add a comment "next block to check for FSM vacuum*.

03. parallel_heap_vacuum_gather_scan_stats
```
+ vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
```

Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined
in 0004. Can you move it?

Will remove.

04. heap_parallel_vacuum_estimate
```
+
+    heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+                                              &shared_len, &pscanwork_len);
+
+    /* space for PHVShared */
+    shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+    /* space for ParallelBlockTableScanDesc */
+    pscan_len = table_block_parallelscan_estimate(rel);
+    shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+    /* space for per-worker scan state, PHVScanWorkerState */
+    pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+    shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
```

I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size().
Can we remove table_block_parallelscan_estimate() and mul_size() from here?

Yes, it's an oversight. Will remove.

05. Idea

Can you update documentations?

Will update the doc as well.

06. Idea

AFAICS pg_stat_progress_vacuum does not contain information related with the
parallel execution. How do you think adding an attribute which shows a list of pids?
Not sure it is helpful for users but it can show the parallelism.

I think it's possible to show the parallelism even today (for parallel
index vacuuming):

=# select leader.pid, leader.query, array_agg(worker.pid) from
pg_stat_activity as leader, pg_stat_activity as worker,
pg_stat_progress_vacuum as v where leader.pid = worker.leader_pid and
leader.pid = v.pid group by 1, 2;
pid | query | array_agg
---------+---------------------+-------------------
2952103 | vacuum (verbose) t; | {2952257,2952258}
(1 row)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#17

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: vignesh C (#14)

Re: Parallel heap vacuum

On Tue, Nov 12, 2024 at 3:21 AM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached new version patches that fixes failures reported by
cfbot. I hope these changes make cfbot happy.

I just started reviewing the patch and found the following comments
while going through the patch:
1) I felt we should add some documentation for this at [1].

2) Can we add some tests in vacuum_parallel with tables having no
indexes and having dead tuples.

3) This should be included in typedefs.list:
3.a)
+/*
+ * Relation statistics collected during heap scanning and need to be
shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats
+{
+       BlockNumber scanned_pages;      /* # pages examined (not
skipped via VM) */
+       BlockNumber removed_pages;      /* # pages removed by relation
truncation */
+       BlockNumber frozen_pages;       /* # pages with newly frozen tuples */

3.b) Similarly this too:
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+       bool            aggressive;
+       bool            skipwithvm;
+

3.c) Similarly this too:
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerState
+{
+       bool            initialized;

3.d)  Similarly this too:
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+       /* Parallel scan description shared among parallel workers */

4) Since we are initializing almost all the members of structure,
should we use palloc0 in this case:
+       scan_stats = palloc(sizeof(LVRelScanStats));
+       scan_stats->scanned_pages = 0;
+       scan_stats->removed_pages = 0;
+       scan_stats->frozen_pages = 0;
+       scan_stats->lpdead_item_pages = 0;
+       scan_stats->missed_dead_pages = 0;
+       scan_stats->nonempty_pages = 0;
+
+       /* Initialize remaining counters (be tidy) */
+       scan_stats->tuples_deleted = 0;
+       scan_stats->tuples_frozen = 0;
+       scan_stats->lpdead_items = 0;
+       scan_stats->live_tuples = 0;
+       scan_stats->recently_dead_tuples = 0;
+       scan_stats->missed_dead_tuples = 0;

5) typo paralle should be parallel
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+       return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}

Thank you for the comments! I'll address these comments in the next
version patch.

BTW while updating the patch, I found that we might want to launch
different numbers of workers for scanning heap and vacuuming heap. The
number of parallel workers is determined based on the number of blocks
in the table. However, even if this number is high, it could happen
that we want to launch fewer workers to vacuum heap pages when there
are not many pages having garbage. And the number of workers for
vacuuming heap could vary on each vacuum pass. I'm considering
implementing it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#18

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 1 year ago

In reply to: Masahiko Sawada (#17)

RE: Parallel heap vacuum

Dear Swada-san,

BTW while updating the patch, I found that we might want to launch
different numbers of workers for scanning heap and vacuuming heap. The
number of parallel workers is determined based on the number of blocks
in the table. However, even if this number is high, it could happen
that we want to launch fewer workers to vacuum heap pages when there
are not many pages having garbage. And the number of workers for
vacuuming heap could vary on each vacuum pass. I'm considering
implementing it.

Just to clarify - this idea looks good to me. I imagine you will add new APIs for
tableam like parallel_vacuum_compute_workers_for_scaning and parallel_vacuum_compute_workers_for_vacuuming.
If other tableam developers want to use the same number of workers as scanning,
they can pass the same function to the pointer. Is it right?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#19

Peter Smith

smithpb2250@gmail.com

about 1 year ago

In reply to: Masahiko Sawada (#11)

Re: Parallel heap vacuum

Hi Sawada-San,

FYI, the patch 0001 fails to build stand-alone

vacuumlazy.c: In function ‘parallel_heap_vacuum_gather_scan_stats’:
vacuumlazy.c:3739:21: error: ‘LVRelScanStats’ has no member named
‘vacuumed_pages’
vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
^
vacuumlazy.c:3739:43: error: ‘LVRelScanStats’ has no member named
‘vacuumed_pages’
vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
^
make[4]: *** [vacuumlazy.o] Error 1

It appears to be using a struct field which is not even introduced
until the patch 0004 of the patch set.

======
Kind Regards,
Peter Smith.
Fujitsu Australia.

#20

Peter Smith

smithpb2250@gmail.com

about 1 year ago

In reply to: Masahiko Sawada (#11)

Re: Parallel heap vacuum

Hi Sawada-San,

I started to look at patch v4-0001 in this thread.

It is quite a big patch so this is a WIP, and these below are just the
comments I have so far.

======
src/backend/access/heap/vacuumlazy.c

1.1.
+/*
+ * Relation statistics collected during heap scanning and need to be
shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats

The comment wording is not quite right.

/Relation statistics collected during heap scanning/Relation
statistics that are collected during heap scanning/

~~~

1.2
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared

The comment wording is not quite right.

/that need to be shared/that needs to be shared/

~~~

1.3.
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+ /* Parallel scan description shared among parallel workers */
+ ParallelBlockTableScanDesc pscandesc;
+
+ /* Shared information */
+ PHVShared  *shared;

If 'pscandesc' is described as 'shared among parallel workers', should
that field be within 'PHVShared' instead?

~~~

1.4.
  /* Initialize page counters explicitly (be tidy) */
- vacrel->scanned_pages = 0;
- vacrel->removed_pages = 0;
- vacrel->frozen_pages = 0;
- vacrel->lpdead_item_pages = 0;
- vacrel->missed_dead_pages = 0;
- vacrel->nonempty_pages = 0;
- /* dead_items_alloc allocates vacrel->dead_items later on */
+ scan_stats = palloc(sizeof(LVRelScanStats));
+ scan_stats->scanned_pages = 0;
+ scan_stats->removed_pages = 0;
+ scan_stats->frozen_pages = 0;
+ scan_stats->lpdead_item_pages = 0;
+ scan_stats->missed_dead_pages = 0;
+ scan_stats->nonempty_pages = 0;
+
+ /* Initialize remaining counters (be tidy) */
+ scan_stats->tuples_deleted = 0;
+ scan_stats->tuples_frozen = 0;
+ scan_stats->lpdead_items = 0;
+ scan_stats->live_tuples = 0;
+ scan_stats->recently_dead_tuples = 0;
+ scan_stats->missed_dead_tuples = 0;
+
+ vacrel->scan_stats = scan_stats;

1.4a.
Or, maybe just palloc0 this and provide a comment to say all counters
have been zapped to 0.

1.4b.
Maybe you don't need that 'scan_stats' variable; just assign the
palloc0 directly to the field instead.

~~~

1.5.
- vacrel->missed_dead_tuples = 0;
+ /* dead_items_alloc allocates vacrel->dead_items later on */

The patch seems to have moved this "dead_items_alloc" comment to now
be below the "Allocate/initialize output statistics state" stuff (??).

======
src/backend/commands/vacuumparallel.c

parallel_vacuum_init:

1.6.
  int parallel_workers = 0;
+ int nworkers_table;
+ int nworkers_index;

The local vars and function params are named like this (here and in
other functions). But the struct field names say 'nworkers_for_XXX'
e.g.
shared->nworkers_for_table = nworkers_table;
shared->nworkers_for_index = nworkers_index;

It may be better to use these 'nworkers_for_table' and
'nworkers_for_index' names consistently everywhere.

~~~

parallel_vacuum_compute_workers:

1.7.
- int parallel_workers;
+ int parallel_workers_table = 0;
+ int parallel_workers_index = 0;
+
+ *nworkers_table = 0;
+ *nworkers_index = 0;

The local variables 'parallel_workers_table' and
'parallel_workers_index; are hardly needed because AFAICT those
results can be directly assigned to *nworkers_table and
*nworkers_index.

~~~

parallel_vacuum_process_all_indexes:

1.8.
  /* Reinitialize parallel context to relaunch parallel workers */
- if (num_index_scans > 0)
+ if (num_index_scans > 0 || pvs->num_table_scans > 0)
  ReinitializeParallelDSM(pvs->pcxt);

I don't know if it is feasible or even makes sense to change, but
somehow it seemed strange that the 'num_index_scans' field is not
co-located with the 'num_table_scans' in the ParallelVacuumState. If
this is doable, then lots of functions also would no longer need to
pass 'num_index_scans' since they are already passing 'pvs'.

~~~

parallel_vacuum_table_scan_begin:

1.9.
+ (errmsg(ngettext("launched %d parallel vacuum worker for table
processing (planned: %d)",
+ "launched %d parallel vacuum workers for table processing (planned: %d)",
+ pvs->pcxt->nworkers_launched),

Isn't this the same as errmsg_plural?

~~~

1.10.

+/* Return the array of indexes associated to the given table to be vacuumed */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)

Even though the function comment can fit on one line it is nicer to
use a block-style comment with a period, like below. It then will be
consistent with other function comments (e.g.
parallel_vacuum_table_scan_end, parallel_vacuum_process_table, etc).
There are multiple places that this review comment can apply to.

(also typo /associated to/associated with/)

SUGGESTION
/*
* Return the array of indexes associated with the given table to be vacuumed.
*/

~~~

parallel_vacuum_get_nworkers_table:
parallel_vacuum_get_nworkers_index:

1.11.
+/* Return the number of workers for parallel table vacuum */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+ return pvs->shared->nworkers_for_table;
+}
+
+/* Return the number of workers for parallel index processing */
+int
+parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs)
+{
+ return pvs->shared->nworkers_for_index;
+}
+

Are these functions needed? AFAICT, they are called only from macros
where it seems just as easy to reference the pvs fields directly.

~~~

parallel_vacuum_process_table:

1.12.
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+ Assert(VacuumActiveNWorkers);

Maybe here also we should Assert(pvs.shared->do_vacuum_table_scan);

~~~

1.13.
- /* Process indexes to perform vacuum/cleanup */
- parallel_vacuum_process_safe_indexes(&pvs);
+ if (pvs.shared->do_vacuum_table_scan)
+ {
+ parallel_vacuum_process_table(&pvs);
+ }
+ else
+ {
+ ErrorContextCallback errcallback;
+
+ /* Setup error traceback support for ereport() */
+ errcallback.callback = parallel_vacuum_error_callback;
+ errcallback.arg = &pvs;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* Process indexes to perform vacuum/cleanup */
+ parallel_vacuum_process_safe_indexes(&pvs);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ }

There are still some functions following this code (like
'shm_toc_lookup') that could potentially raise ERRORs. But, now the
error_context_stack is getting assigned/reset earlier than previously
was the case. Is that going to be a potential problem?

======
src/include/access/tableam.h

1.14.
+ /*
+ * Compute the amount of DSM space AM need in the parallel table vacuum.
+ *

Maybe reword this comment to be more like table_parallelscan_estimate.

SUGGESTION
Estimate the size of shared memory that the parallel table vacuum needs for AM.

~~~

1.15.
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
int nworkers,
+    void *state)
+{
+ rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
int nworkers,
+ void *state)
+{
+ rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Start a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_scan(Relation rel, ParallelVacuumState *pvs,
+    ParallelWorkerContext *pwcxt)
+{
+ rel->rd_tableam->parallel_vacuum_scan_worker(rel, pvs, pwcxt);
+}
+

All of the "Callbacks for parallel table vacuum." had comments saying
"Not called if parallel table vacuum is disabled.". So, IIUC that
means all of these table_parallel_vacuum_XXX functions (other than the
compute_workers one) could have Assert(nworkers > 0); just to
double-check that is true.

~~~

table_paralle_vacuum_compute_workers:

1.16.
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+ return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}

Typo in function name. /paralle/parallel/

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#21

Peter Smith

smithpb2250@gmail.com

about 1 year ago

In reply to: Peter Smith (#20)

Re: Parallel heap vacuum

Hi Sawada-San.

FWIW, here are the remainder of my review comments for the patch v4-0001

======
src/backend/access/heap/vacuumlazy.c

lazy_scan_heap:

2.1.

+ /*
+ * Do the actual work. If parallel heap vacuum is active, we scan and
+ * vacuum heap with parallel workers.
+ */

/with/using/

~~~

2.2.
+ if (ParallelHeapVacuumIsActive(vacrel))
+ do_parallel_lazy_scan_heap(vacrel);
+ else
+ do_lazy_scan_heap(vacrel);

The do_lazy_scan_heap() returns a boolean and according to that
function comment it should always be true if it is not using the
parallel heap scan. So should we get the function return value here
and Assert that it is true?

~~~

2.3.

Start uppercase even for all the single line comments for consistency
with exasiting code.

e.g.
+ /* report that everything is now scanned */

e.g
+ /* now we can compute the new value for pg_class.reltuples */

e.g.
+ /* report all blocks vacuumed */

~~~

heap_vac_scan_next_block_parallel:

2.4.
+/*
+ * A parallel scan variant of heap_vac_scan_next_block.
+ *
+ * In parallel vacuum scan, we don't use the SKIP_PAGES_THRESHOLD optimization.
+ */
+static bool
+heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno,
+   bool *all_visible_according_to_vm)

The function comment should explain the return value.

~~~

2.5.
+ if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+ {
+
+ if (vacrel->aggressive)
+ break;

Unnecessary whitespace.

~~~

dead_items_alloc:

2.6.
+ /*
+ * We initialize parallel heap scan/vacuuming or index vacuuming
+ * or both based on the table size and the number of indexes. Note
+ * that only one worker can be used for an index, we invoke
+ * parallelism for index vacuuming only if there are at least two
+ * indexes on a table.
+ */
  vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
     vacrel->nindexes, nworkers,
     vac_work_mem,
     vacrel->verbose ? INFO : DEBUG2,
-    vacrel->bstrategy);
+    vacrel->bstrategy, (void *) vacrel);

Is this information misplaced? Why describe here "only one worker" and
"at least two indexes on a table" I don't see anything here checking
those conditions.

~~~

heap_parallel_vacuum_compute_workers:

2.7.
+ /*
+ * Select the number of workers based on the log of the size of the
+ * relation.  This probably needs to be a good deal more
+ * sophisticated, but we need something here for now.  Note that the
+ * upper limit of the min_parallel_table_scan_size GUC is chosen to
+ * prevent overflow here.
+ */

The "This probably needs to..." part maybe should have an "XXX" marker
in the comment which AFAIK is used to highlight current decisions and
potential for future changes.

~~~

heap_parallel_vacuum_initialize:

2.8.
There is inconsistent capitalization of the single-line comments in
this function. The same occurs in many functions in this file. but it
is just a bit more obvious in this one. Please see all the others too.

~~~

parallel_heap_complete_unfinised_scan:

2.9.
+static void
+parallel_heap_complete_unfinised_scan(LVRelState *vacrel)

TYPO in function name /unfinised/unfinished/

~~~

2.10.
+ if (!wstate->maybe_have_blocks)
+
+ continue;

Unnecessary blank line.

~~~

2.11.
+
+ /* Attache the worker's scan state and do heap scan */
+ vacrel->phvstate->myscanstate = wstate;
+ scan_done = do_lazy_scan_heap(vacrel);

/Attache/Attach/

~~~

2.12.
+ /*
+ * We don't need to gather the scan statistics here because statistics
+ * have already been accumulated the leaders statistics directly.
+ */

"have already been accumulated the leaders" -- missing word there somewhere?

~~~

do_parallel_lazy_scan_heap:

2.13.
+ /*
+ * If the heap scan paused in the middle of the table due to full of
+ * dead_items TIDs, perform a round of index and heap vacuuming.
+ */
+ if (!scan_done)
+ {
+ /* Perform a round of index and heap vacuuming */
+ vacrel->consider_bypass_optimization = false;
+ lazy_vacuum(vacrel);
+
+ /*
+ * Vacuum the Free Space Map to make newly-freed space visible on
+ * upper-level FSM pages.
+ */
+ if (vacrel->phvstate->min_blkno > vacrel->next_fsm_block_to_vacuum)
+ {
+ /*
+ * min_blkno should have already been updated when gathering
+ * statistics
+ */
+ FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+ vacrel->phvstate->min_blkno + 1);
+ vacrel->next_fsm_block_to_vacuum = vacrel->phvstate->min_blkno;
+ }
+
+ /* Report that we are once again scanning the heap */
+ pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+ PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+ /* re-launcher workers */
+ vacrel->phvstate->nworkers_launched =
+ parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+ continue;
+ }
+
+ /* We reach the end of the table */
+ break;

Instead of:

if (!scan_done)
{
<other code ...>
continue;
}

break;

Won't it be better to refactor like:

SUGGESTION
if (scan_done)
break;

~~~

2.14.
+ /*
+ * The parallel heap vacuum finished, but it's possible that some workers
+ * have allocated blocks but not processed yet. This can happen for
+ * example when workers exit because of full of dead_items TIDs and the
+ * leader process could launch fewer workers in the next cycle.
+ */

There seem to be some missing words:

e.g. /not processed yet./not processed them yet./
e.g. /because of full of dead_items/because they are full of dead_items/

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#22

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Masahiko Sawada (#11)

Re: Parallel heap vacuum

Hi,

Thanks for working on this. I took a quick look at this today, to do
some basic review. I plan to do a bunch of testing, but that's mostly to
get a better idea of what kind of improvements to expect - the initial
results look quite nice and sensible.

A couple basic comments:

1) I really like the idea of introducing "compute_workers" callback to
the heap AM interface. I faced a similar issue with calculating workers
for index builds, because right now plan_create_index_workers is doing
that the logic works for btree, but really not brin etc. It didn't occur
to me we might make this part of the index AM ...

2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h
because it really has nothing to do with planning / paths. I realize it
needs the min_parallel_table_scan_size, but it doesn't seem right. I
guess it's a sign this bit of code (calculating parallel workers based
on log of relation size) should in some "shared" location.

3) The difference in naming ParallelVacuumState vs. PHVState is a bit
weird. I suggest ParallelIndexVacuumState and ParallelHeapVacuumState to
make it consistent and clear.

4) I think it would be good to have some sort of README explaining how
the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
while to realize how the workers coordinate which blocks to scan.

5) Wouldn't it be better to introduce the scan_stats (grouping some of
the fields in a separate patch)? Seems entirely independent from the
parallel part, so doing it separately would make it easier to review.
Also, maybe reference the fields through scan_stats->x, instead of
through vacrel->scan_stats->x, when there's the pointer.

6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats?
AFAIK it's not a statistic, it's actually a parameter affecting the
decisions, right?

7) I find it a bit strange that heap_vac_scan_next_block() needs to
check if it's a parallel scan, and redirect to the parallel callback. I
mean, shouldn't the caller know which callback to invoke? Why should the
serial callback care about this?

8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the
current block. Can you explain?

9) I'm a bit confused why the code checks IsParallelWorker() in so many
places. Doesn't that mean the leader can't participate in the vacuum
like a regular worker?

10) I'm not quite sure I understand the comments at the end of
do_lazy_scan_heap - it says "do a cycle of vacuuming" but I guess that
means "index vacuuming", right? And then it says "pause without invoking
index and heap vacuuming" but isn't the whole point of this block to do
that cleanup so that the TidStore can be discarded? Maybe I just don't
understand how the work is divided between the leader and workers ...

11) Why does GlobalVisState need to move to snapmgr.h? If I undo this
the patch still builds fine for me.

thanks

--
Tomas Vondra

#23

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 1 year ago

In reply to: Tomas Vondra (#22)

RE: Parallel heap vacuum

Dear Tomas,

1) I really like the idea of introducing "compute_workers" callback to
the heap AM interface. I faced a similar issue with calculating workers
for index builds, because right now plan_create_index_workers is doing
that the logic works for btree, but really not brin etc. It didn't occur
to me we might make this part of the index AM ...

+1, so let's keep the proposed style. Or, can we even propose the idea
to table/index access method API?
I've considered bit and the point seemed that which arguments should be required.

4) I think it would be good to have some sort of README explaining how
the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
while to realize how the workers coordinate which blocks to scan.

I love the idea, it is quite helpful for reviewers like me.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#24

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#22)

Re: Parallel heap vacuum

On Mon, Dec 9, 2024 at 2:11 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

Thanks for working on this. I took a quick look at this today, to do
some basic review. I plan to do a bunch of testing, but that's mostly to
get a better idea of what kind of improvements to expect - the initial
results look quite nice and sensible.

Thank you for reviewing the patch!

A couple basic comments:

1) I really like the idea of introducing "compute_workers" callback to
the heap AM interface. I faced a similar issue with calculating workers
for index builds, because right now plan_create_index_workers is doing
that the logic works for btree, but really not brin etc. It didn't occur
to me we might make this part of the index AM ...

Thanks.

2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h
because it really has nothing to do with planning / paths. I realize it
needs the min_parallel_table_scan_size, but it doesn't seem right. I
guess it's a sign this bit of code (calculating parallel workers based
on log of relation size) should in some "shared" location.

True. The same is actually true also for vacuumparallel.c. It includes
optimizer/paths.h to use min_parallel_index_scan_size.

3) The difference in naming ParallelVacuumState vs. PHVState is a bit
weird. I suggest ParallelIndexVacuumState and ParallelHeapVacuumState to
make it consistent and clear.

With the patch, since ParallelVacuumState is no longer dedicated for
parallel index vacuuming we cannot rename them in this way. Both
parallel table scanning/vacuuming and parallel index vacuuming can use
the same ParallelVacuumState instance. The heap-specific necessary
data for parallel heap scanning and vacuuming are stored in PHVState.

4) I think it would be good to have some sort of README explaining how
the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
while to realize how the workers coordinate which blocks to scan.

+1. I will add README in the next version patch.

5) Wouldn't it be better to introduce the scan_stats (grouping some of
the fields in a separate patch)? Seems entirely independent from the
parallel part, so doing it separately would make it easier to review.
Also, maybe reference the fields through scan_stats->x, instead of
through vacrel->scan_stats->x, when there's the pointer.

Agreed.

6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats?
AFAIK it's not a statistic, it's actually a parameter affecting the
decisions, right?

Right. It would be better to move them to a separate struct or somewhere.

7) I find it a bit strange that heap_vac_scan_next_block() needs to
check if it's a parallel scan, and redirect to the parallel callback. I
mean, shouldn't the caller know which callback to invoke? Why should the
serial callback care about this?

do_lazy_scan_heap(), the sole caller of heap_vac_scan_next_block(), is
called in serial vacuum and parallel vacuum cases. I wanted to make
heap_vac_scan_next_block() workable in both cases. I think it also
makes sense to have do_lazy_scan_heap() calls either function
depending on parallel scan being enabled.

8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the
current block. Can you explain?

The workers' current block numbers are used to calculate the minimum
block number where we've scanned so far. In serial scan case, we
vacuum FSM of the particular block range for every
VACUUM_FSM_EVERY_PAGES pages . On the other hand, in parallel scan
case, it doesn't make sense to vacuum FSM in that way because we might
not have processed some blocks in the block range. So the idea is to
calculate the minimum block number where we've scanned so far and
vacuum FSM of the range of consecutive already-scanned blocks.

9) I'm a bit confused why the code checks IsParallelWorker() in so many
places. Doesn't that mean the leader can't participate in the vacuum
like a regular worker?

I used '!isParallelWorker()' for some jobs that should be done only by
the leader process. For example, checking failsafe mode, vacuuming FSM
etc.

10) I'm not quite sure I understand the comments at the end of
do_lazy_scan_heap - it says "do a cycle of vacuuming" but I guess that
means "index vacuuming", right?

It means both index vacuuming and heap vacuuming.

And then it says "pause without invoking
index and heap vacuuming" but isn't the whole point of this block to do
that cleanup so that the TidStore can be discarded? Maybe I just don't
understand how the work is divided between the leader and workers ...

The comment needs to be updated. But what the patch does is that when
the memory usage of the shared TidStore reaches the limit, worker
processes exit after updating the shared statistics, and then the
leader invokes (parallel) index vacuuming and parallel heap vacuuming.
Since the different number workers could be used for parallel heap
scan, parallel index vacuuming, and parallel heap vacuuming, the
leader process waits for all workers to finish at end of each phase.

11) Why does GlobalVisState need to move to snapmgr.h? If I undo this
the patch still builds fine for me.

Oh, I might have missed something. I'll check if it's really necessary.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#25

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Tomas Vondra (#22)

3 attachment(s)

Re: Parallel heap vacuum

On 12/9/24 19:47, Tomas Vondra wrote:

Hi,

Thanks for working on this. I took a quick look at this today, to do
some basic review. I plan to do a bunch of testing, but that's mostly to
get a better idea of what kind of improvements to expect - the initial
results look quite nice and sensible.

I worked on the benchmarks/testing, mostly to get an idea of how
effective this vacuum parallelism is. But I noticed something a bit
weird ...

Attached is a bash script I used for the testing - it measures vacuum
with varying numbers of indexes, number of deleted tuples, WAL logging,
etc. And it does that both with master and patched builds, with
different number of vacuum workers.

It does expect databases "test-small-logged" and "test-small-unlogged",
initialized like this:

create [unlogged] table test_vacuum (a bigint)
with (autovacuum_enabled=off);

insert into test_vacuum select i from generate_series(1,100000000) s(i);

create index idx_0 on test_vacuum (a);
create index idx_1 on test_vacuum (a);
create index idx_2 on test_vacuum (a);
create index idx_3 on test_vacuum (a);
create index idx_4 on test_vacuum (a);

That's ~2GB table, with a bunch of indexes. Not massive, not tiny.

I wanted to run this on larger datasets too, but for now I have the
small dataset.

One of the things the tests change is the fraction of pages with deleted
rows. The DELETE has

... WHERE mod(id,M) = 0

where "id" is a bigint column with sequential values. There are ~230
rows per page, so the M determines what fraction of pages gets a DELETE.
With M=100, each page gets ~2 deleted rows, with M=500 we get a page
with a delete, then a clean page, etc. Similar for 1000 and 5000.

Attached are results.csv with raw data, and a PDF showing the difference
between master and patched build with varying number of workers. The
columns on the right show timing relative to master (with no parallel
workers). Green means "faster" and "red" would be "slower" (but there
are no such cases). 50% means "half the time" i.e. "twice as fast".

And for M=100 and M=500 the results look quite sensible. But for higher
values of M (i.e. smaller fraction of the table DELETED) things get a
bit strange, especially for the runs with 0 indexes.

Consider for example these runs from i5 machine with M=5000:

master patched
indexes 0 0 1 2 3 4 6 8
-------------------------------------------------------------------
0 2.58 2.75 0.17 0.19 0.16 0.24 0.20 0.19

On master it takes 2.58s, and on patched build (0 workers) it's ~2.75s,
so about the same (single run, so the difference is just noise).

But then with 1 worker it drops to 0.17s. That's ~15x faster, but we
only added one worker, so the best we could expect is 2x. Either there's
a bug that skips some work, or the master code is horribly inefficient.

The reason for the difference is this - on master, the vacuum verbose
log looks like this:

---
INFO: vacuuming "test.public.test_vacuum"
INFO: finished vacuuming "test.public.test_vacuum": index scans: 0
pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
tuples: 10000 removed, 49590000 remain, 0 are dead but not yet removable
removable cutoff: 20088, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
avg read rate: 642.429 MB/s, avg write rate: 30.650 MB/s
buffer usage: 231616 hits, 210965 reads, 10065 dirtied
WAL usage: 30058 records, 10065 full page images, 72101687 bytes
system usage: CPU: user: 2.29 s, system: 0.27 s, elapsed: 2.56 s
---

and on patched with no parallelism it's almost the same:

---
INFO: vacuuming "test.public.test_vacuum"
INFO: finished vacuuming "test.public.test_vacuum": index scans: 0
pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
tuples: 10000 removed, 49570000 remain, 0 are dead but not yet removable
removable cutoff: 20094, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
avg read rate: 602.557 MB/s, avg write rate: 28.748 MB/s
buffer usage: 231620 hits, 210961 reads, 10065 dirtied
WAL usage: 30058 records, 10065 full page images, 71578455 bytes
system usage: CPU: user: 2.42 s, system: 0.30 s, elapsed: 2.73 s
---

But then for vacuum (parallel 1) it changes like this:

---
INFO: vacuuming "test.public.test_vacuum"
INFO: launched 1 parallel vacuum worker for table processing (planned: 1)
INFO: finished vacuuming "test.public.test_vacuum": index scans: 0
pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
tuples: 10000 removed, 49137961 remain, 0 are dead but not yet removable
removable cutoff: 20107, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
buffer usage: 25175 hits, 0 reads, 10065 dirtied
WAL usage: 30058 records, 10065 full page images, 70547639 bytes
system usage: CPU: user: 0.07 s, system: 0.02 s, elapsed: 0.14 s
---

The main difference is here:

master / no parallel workers:

pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)

1 parallel worker:

pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)

Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
essentially just those with deleted tuples, which is ~1/20 of pages.
That's close to the 15x speedup.

This effect is clearest without indexes, but it does affect even runs
with indexes - having to scan the indexes makes it much less pronounced,
though. However, these indexes are pretty massive (about the same size
as the table) - multiple times larger than the table. Chances are it'd
be clearer on realistic data sets.

So the question is - is this correct? And if yes, why doesn't the
regular (serial) vacuum do that?

There's some more strange things, though. For example, how come the avg
read rate is 0.000 MB/s?

avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s

It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
pages are in shared buffers (thanks to the DELETE earlier in that session).

regards

--
Tomas Vondra

#26

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Tomas Vondra (#25)

1 attachment(s)

Re: Parallel heap vacuum

On 12/13/24 00:04, Tomas Vondra wrote:

...
Attached are results.csv with raw data, and a PDF showing the difference
between master and patched build with varying number of workers. The
columns on the right show timing relative to master (with no parallel
workers). Green means "faster" and "red" would be "slower" (but there
are no such cases). 50% means "half the time" i.e. "twice as fast".

...

Apologies, forgot the PDF with results, so here it is.

regards

--
Tomas Vondra

#27

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Tomas Vondra (#25)

Re: Parallel heap vacuum

On 12/13/24 00:04, Tomas Vondra wrote:

...

The main difference is here:

master / no parallel workers:

pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)

1 parallel worker:

pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)

Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
essentially just those with deleted tuples, which is ~1/20 of pages.
That's close to the 15x speedup.

This effect is clearest without indexes, but it does affect even runs
with indexes - having to scan the indexes makes it much less pronounced,
though. However, these indexes are pretty massive (about the same size
as the table) - multiple times larger than the table. Chances are it'd
be clearer on realistic data sets.

So the question is - is this correct? And if yes, why doesn't the
regular (serial) vacuum do that?

There's some more strange things, though. For example, how come the avg
read rate is 0.000 MB/s?

avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s

It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
pages are in shared buffers (thanks to the DELETE earlier in that session).

OK, after looking into this a bit more I think the reason is rather
simple - SKIP_PAGES_THRESHOLD.

With serial runs, we end up scanning all pages, because even with an
update every 5000 tuples, that's still only ~25 pages apart, well within
the 32-page window. So we end up skipping no pages, scan and vacuum all
everything.

But parallel runs have this skipping logic disabled, or rather the logic
that switches to sequential scans if the gap is less than 32 pages.

IMHO this raises two questions:

1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
sequential scans is the pages are close enough. Maybe there is a reason
for this difference? Workers can reduce the difference between random
and sequential I/0, similarly to prefetching. But that just means the
workers should use a lower threshold, e.g. as

SKIP_PAGES_THRESHOLD / nworkers

or something like that? I don't see this discussed in this thread.

2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
storage. If I can get an order of magnitude improvement (or more than
that) by disabling the threshold, and just doing random I/O, maybe
there's time to adjust it a bit.

regards

--
Tomas Vondra

#28

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#27)

Re: Parallel heap vacuum

On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote:

On 12/13/24 00:04, Tomas Vondra wrote:

...

The main difference is here:

master / no parallel workers:

pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)

1 parallel worker:

pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)

Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
essentially just those with deleted tuples, which is ~1/20 of pages.
That's close to the 15x speedup.

This effect is clearest without indexes, but it does affect even runs
with indexes - having to scan the indexes makes it much less pronounced,
though. However, these indexes are pretty massive (about the same size
as the table) - multiple times larger than the table. Chances are it'd
be clearer on realistic data sets.

So the question is - is this correct? And if yes, why doesn't the
regular (serial) vacuum do that?

There's some more strange things, though. For example, how come the avg
read rate is 0.000 MB/s?

avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s

It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
pages are in shared buffers (thanks to the DELETE earlier in that session).

OK, after looking into this a bit more I think the reason is rather
simple - SKIP_PAGES_THRESHOLD.

With serial runs, we end up scanning all pages, because even with an
update every 5000 tuples, that's still only ~25 pages apart, well within
the 32-page window. So we end up skipping no pages, scan and vacuum all
everything.

But parallel runs have this skipping logic disabled, or rather the logic
that switches to sequential scans if the gap is less than 32 pages.

IMHO this raises two questions:

1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
sequential scans is the pages are close enough. Maybe there is a reason
for this difference? Workers can reduce the difference between random
and sequential I/0, similarly to prefetching. But that just means the
workers should use a lower threshold, e.g. as

SKIP_PAGES_THRESHOLD / nworkers

or something like that? I don't see this discussed in this thread.

Each parallel heap scan worker allocates a chunk of blocks which is
8192 blocks at maximum, so we would need to use the
SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
need to evaluate the differences anyway. WIll do the benchmark test
and share the results.

2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
storage. If I can get an order of magnitude improvement (or more than
that) by disabling the threshold, and just doing random I/O, maybe
there's time to adjust it a bit.

Yeah, you've started a thread for this so let's discuss it there.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#29

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Masahiko Sawada (#24)

8 attachment(s)

Re: Parallel heap vacuum

On Wed, Dec 11, 2024 at 12:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 9, 2024 at 2:11 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

Thanks for working on this. I took a quick look at this today, to do
some basic review. I plan to do a bunch of testing, but that's mostly to
get a better idea of what kind of improvements to expect - the initial
results look quite nice and sensible.

Thank you for reviewing the patch!

I've attached the updated patches. Here some comments for some review comments:

2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h
because it really has nothing to do with planning / paths. I realize it
needs the min_parallel_table_scan_size, but it doesn't seem right. I
guess it's a sign this bit of code (calculating parallel workers based
on log of relation size) should in some "shared" location.

True. The same is actually true also for vacuumparallel.c. It includes
optimizer/paths.h to use min_parallel_index_scan_size.

I left this change for now. Since vacuumparallel.c already has this
issue, I think we can create a separate patch to address this issue.

4) I think it would be good to have some sort of README explaining how
the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
while to realize how the workers coordinate which blocks to scan.

+1. I will add README in the next version patch.

I've added the comment at the top of vacuumlazy.c to explain the
overall of how parallel vacuum works (done in 0008 patch).

5) Wouldn't it be better to introduce the scan_stats (grouping some of
the fields in a separate patch)? Seems entirely independent from the
parallel part, so doing it separately would make it easier to review.
Also, maybe reference the fields through scan_stats->x, instead of
through vacrel->scan_stats->x, when there's the pointer.

Agreed.

Done in 0001 patch.

6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats?
AFAIK it's not a statistic, it's actually a parameter affecting the
decisions, right?

Right. It would be better to move them to a separate struct or somewhere.

I've renamed it to LVRelScanState.

8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the
current block. Can you explain?

The workers' current block numbers are used to calculate the minimum
block number where we've scanned so far. In serial scan case, we
vacuum FSM of the particular block range for every
VACUUM_FSM_EVERY_PAGES pages . On the other hand, in parallel scan
case, it doesn't make sense to vacuum FSM in that way because we might
not have processed some blocks in the block range. So the idea is to
calculate the minimum block number where we've scanned so far and
vacuum FSM of the range of consecutive already-scanned blocks.

I've simplified the logic to calculate the minimum scanned block. We
didn't actually need to advertise the current block.

11) Why does GlobalVisState need to move to snapmgr.h? If I undo this
the patch still builds fine for me.

Oh, I might have missed something. I'll check if it's really necessary.

I've tried to undo that change, but now that we copy the contents of
GlobalVisState in vacuumlazy.c it seems we need to expose the
declaration of GlobalVisState.

The attached patches address all comments I got so far including
comments from Peter[1]/messages/by-id/CAHut+PtnyLVkgg7BsfXy0ciVeyCBaXNRSSi0h8AVdx9cTL9_ug@mail.gmail.com[2]/messages/by-id/CAHut+PsA=9UOFKd52A41DSTgeUreMuuweWHmxsokqLzTMao=Rw@mail.gmail.com. From the previous version, I've made many
changes not only for fixing bugs but also for improving parallel
vacuum logic itself and comments. So some review comments about typos
and clarifying the comments are not addressed where I've removed these
comments themself.

I'm doing some benchmark tests so I will share the results.

Feedback is very welcome!

Regards,

[1]: /messages/by-id/CAHut+PtnyLVkgg7BsfXy0ciVeyCBaXNRSSi0h8AVdx9cTL9_ug@mail.gmail.com
[2]: /messages/by-id/CAHut+PsA=9UOFKd52A41DSTgeUreMuuweWHmxsokqLzTMao=Rw@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v5-0006-radixtree.h-Add-RT_NUM_KEY-API-to-get-the-number-.patchapplication/octet-stream; name=v5-0006-radixtree.h-Add-RT_NUM_KEY-API-to-get-the-number-.patchDownload

From c95bc0f1241c3196dbe09e3ecc617a450a0c094a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Dec 2024 16:54:46 -0800
Subject: [PATCH v5 6/8] radixtree.h: Add RT_NUM_KEY API to get the number of
 keys.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/include/lib/radixtree.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5767f31c55..3e36f7577b7 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -126,6 +126,7 @@
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
  * RT_END_ITERATE	- End iteration
  * RT_MEMORY_USAGE	- Get the memory as measured by space in memory context blocks
+ * RT_NUM_KEYS		- Get the number of key-value pairs in radix tree
  *
  * Interface for Shared Memory
  * ---------
@@ -197,6 +198,7 @@
 #define RT_DELETE RT_MAKE_NAME(delete)
 #endif
 #define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_NUM_KEYS RT_MAKE_NAME(num_keys)
 #define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
 #define RT_STATS RT_MAKE_NAME(stats)
 
@@ -313,6 +315,7 @@ RT_SCOPE	RT_VALUE_TYPE *RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p);
 RT_SCOPE void RT_END_ITERATE(RT_ITER * iter);
 
 RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE * tree);
+RT_SCOPE int64 RT_NUM_KEYS(RT_RADIX_TREE * tree);
 
 #ifdef RT_DEBUG
 RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree);
@@ -2844,6 +2847,15 @@ RT_MEMORY_USAGE(RT_RADIX_TREE * tree)
 	return total;
 }
 
+RT_SCOPE int64
+RT_NUM_KEYS(RT_RADIX_TREE * tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	return tree->ctl->num_keys;
+}
+
 /*
  * Perform some sanity checks on the given node.
  */
@@ -3167,6 +3179,7 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_END_ITERATE
 #undef RT_DELETE
 #undef RT_MEMORY_USAGE
+#undef RT_NUM_KEYS
 #undef RT_DUMP_NODE
 #undef RT_STATS
 
-- 
2.43.5

v5-0005-Support-shared-itereation-on-TidStore.patchapplication/octet-stream; name=v5-0005-Support-shared-itereation-on-TidStore.patchDownload

From 7298cb4e3e43ba2355e13e258e244fa62e8d4b13 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:34:57 -0700
Subject: [PATCH v5 5/8] Support shared itereation on TidStore.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/common/tidstore.c          | 59 ++++++++++++++++++
 src/include/access/tidstore.h                 |  3 +
 .../modules/test_tidstore/test_tidstore.c     | 62 ++++++++++++++-----
 3 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index a7179759d67..637d26012d2 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -483,6 +483,7 @@ TidStoreBeginIterate(TidStore *ts)
 	iter = palloc0(sizeof(TidStoreIter));
 	iter->ts = ts;
 
+	/* begin iteration on the radix tree */
 	if (TidStoreIsShared(ts))
 		iter->tree_iter.shared = shared_ts_begin_iterate(ts->tree.shared);
 	else
@@ -533,6 +534,56 @@ TidStoreEndIterate(TidStoreIter *iter)
 	pfree(iter);
 }
 
+/*
+ * Prepare to iterate through a shared TidStore in shared mode. This function
+ * is aimed to start the iteration on the given TidStore with parallel workers.
+ *
+ * The TidStoreIter struct is created in the caller's memory context, and it
+ * will be freed in TidStoreEndIterate.
+ *
+ * The caller is responsible for locking TidStore until the iteration is
+ * finished.
+ */
+TidStoreIter *
+TidStoreBeginIterateShared(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	if (!TidStoreIsShared(ts))
+		elog(ERROR, "cannot begin shared iteration on local TidStore");
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* begin the shared iteration on radix tree */
+	iter->tree_iter.shared =
+		(shared_ts_iter *) shared_ts_begin_iterate_shared(ts->tree.shared);
+
+	return iter;
+}
+
+/*
+ * Attach to the shared TidStore iterator. 'iter_handle' is the dsa_pointer
+ * returned by TidStoreGetSharedIterHandle(). The returned object is allocated
+ * in backend-local memory using CurrentMemoryContext.
+ */
+TidStoreIter *
+TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle)
+{
+	TidStoreIter *iter;
+
+	Assert(TidStoreIsShared(ts));
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* Attach to the shared iterator */
+	iter->tree_iter.shared = shared_ts_attach_iterate_shared(ts->tree.shared,
+															 iter_handle);
+
+	return iter;
+}
+
 /*
  * Return the memory usage of TidStore.
  */
@@ -564,6 +615,14 @@ TidStoreGetHandle(TidStore *ts)
 	return (dsa_pointer) shared_ts_get_handle(ts->tree.shared);
 }
 
+dsa_pointer
+TidStoreGetSharedIterHandle(TidStoreIter *iter)
+{
+	Assert(TidStoreIsShared(iter->ts));
+
+	return (dsa_pointer) shared_ts_get_iter_handle(iter->tree_iter.shared);
+}
+
 /*
  * Given a TidStoreIterResult returned by TidStoreIterateNext(), extract the
  * offset numbers.  Returns the number of offsets filled in, if <=
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index aeaf563b6a9..f20c9a92e55 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -37,6 +37,9 @@ extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
 extern void TidStoreUnlock(TidStore *ts);
+extern TidStoreIter *TidStoreBeginIterateShared(TidStore *ts);
+extern TidStoreIter *TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle);
+extern dsa_pointer TidStoreGetSharedIterHandle(TidStoreIter *iter);
 extern void TidStoreDestroy(TidStore *ts);
 extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 									int num_offsets);
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 25077caf8f1..2dc649124bb 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -33,6 +33,7 @@ PG_FUNCTION_INFO_V1(test_is_full);
 PG_FUNCTION_INFO_V1(test_destroy);
 
 static TidStore *tidstore = NULL;
+static bool tidstore_is_shared;
 static size_t tidstore_empty_size;
 
 /* array for verification of some tests */
@@ -107,6 +108,7 @@ test_create(PG_FUNCTION_ARGS)
 		LWLockRegisterTranche(tranche_id, "test_tidstore");
 
 		tidstore = TidStoreCreateShared(tidstore_max_size, tranche_id);
+		tidstore_is_shared = true;
 
 		/*
 		 * Remain attached until end of backend or explicitly detached so that
@@ -115,8 +117,11 @@ test_create(PG_FUNCTION_ARGS)
 		dsa_pin_mapping(TidStoreGetDSA(tidstore));
 	}
 	else
+	{
 		/* VACUUM uses insert only, so we test the other option. */
 		tidstore = TidStoreCreateLocal(tidstore_max_size, false);
+		tidstore_is_shared = false;
+	}
 
 	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
 
@@ -212,14 +217,42 @@ do_set_block_offsets(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(blkno);
 }
 
+/* Collect TIDs stored in the tidstore, in order */
+static void
+check_iteration(TidStore *tidstore, int *num_iter_tids, bool shared_iter)
+{
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+
+	TidStoreLockShare(tidstore);
+
+	if (shared_iter)
+		iter = TidStoreBeginIterateShared(tidstore);
+	else
+		iter = TidStoreBeginIterate(tidstore);
+
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	{
+		OffsetNumber offsets[MaxOffsetNumber];
+		int			num_offsets;
+
+		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
+		Assert(num_offsets <= lengthof(offsets));
+		for (int i = 0; i < num_offsets; i++)
+			ItemPointerSet(&(items.iter_tids[(*num_iter_tids)++]), iter_result->blkno,
+						   offsets[i]);
+	}
+
+	TidStoreEndIterate(iter);
+	TidStoreUnlock(tidstore);
+}
+
 /*
  * Verify TIDs in store against the array.
  */
 Datum
 check_set_block_offsets(PG_FUNCTION_ARGS)
 {
-	TidStoreIter *iter;
-	TidStoreIterResult *iter_result;
 	int			num_iter_tids = 0;
 	int			num_lookup_tids = 0;
 	BlockNumber prevblkno = 0;
@@ -261,22 +294,23 @@ check_set_block_offsets(PG_FUNCTION_ARGS)
 	}
 
 	/* Collect TIDs stored in the tidstore, in order */
+	check_iteration(tidstore, &num_iter_tids, false);
 
-	TidStoreLockShare(tidstore);
-	iter = TidStoreBeginIterate(tidstore);
-	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	/* If the tidstore is shared, check the shared-iteration as well */
+	if (tidstore_is_shared)
 	{
-		OffsetNumber offsets[MaxOffsetNumber];
-		int			num_offsets;
+		int			num_iter_tids_shared = 0;
 
-		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
-		Assert(num_offsets <= lengthof(offsets));
-		for (int i = 0; i < num_offsets; i++)
-			ItemPointerSet(&(items.iter_tids[num_iter_tids++]), iter_result->blkno,
-						   offsets[i]);
+		check_iteration(tidstore, &num_iter_tids_shared, true);
+
+		/*
+		 * verify that normal iteration and shared iteration returned the
+		 * number of TIDs.
+		 */
+		if (num_lookup_tids != num_iter_tids_shared)
+			elog(ERROR, "shared-iteration should have %d TIDs, have %d aaa",
+				 items.num_tids, num_iter_tids_shared);
 	}
-	TidStoreEndIterate(iter);
-	TidStoreUnlock(tidstore);
 
 	/*
 	 * Sort verification and lookup arrays and test that all arrays are the
-- 
2.43.5

v5-0008-Support-parallel-heap-vacuum-during-lazy-vacuum.patchapplication/octet-stream; name=v5-0008-Support-parallel-heap-vacuum-during-lazy-vacuum.patchDownload

From 8a0e97565e6bcc1b952de0d2b3034a7dce35a62d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:37:45 -0700
Subject: [PATCH v5 8/8] Support parallel heap vacuum during lazy vacuum.

This commit further extends parallel vacuum to perform the heap vacuum
phase with parallel workers. It leverages the shared TidStore iteration.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/ref/vacuum.sgml          |  17 +-
 src/backend/access/heap/vacuumlazy.c  | 280 +++++++++++++++++++-------
 src/backend/commands/vacuumparallel.c |  10 +-
 src/include/commands/vacuum.h         |   2 +-
 4 files changed, 223 insertions(+), 86 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index aae0bbcd577..104157b5a56 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -278,20 +278,21 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
       <para>
-       Perform scanning heap, index vacuum, and index cleanup phases of
-       <command>VACUUM</command> in parallel using
+       Perform scanning heap, vacuuming heap, index vacuum, and index cleanup
+       phases of <command>VACUUM</command> in parallel using
        <replaceable class="parameter">integer</replaceable> background workers
        (for the details of each vacuum phase, please refer to
        <xref linkend="vacuum-phases"/>).
       </para>
       <para>
        For heap tables, the number of workers used to perform the scanning
-       heap is determined based on the size of table. A table can participate in
-       parallel scanning heap if and only if the size of the table is more than
-       <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
-       the heap table's blocks will be divided into ranges and shared among the
-       cooperating processes. Each worker process will complete the scanning of
-       its given range of blocks before requesting an additional range of blocks.
+       heap and vacuuming heap is determined based on the size of table. A table
+       can participate in parallel scanning heap if and only if the size of the
+       table is more than <xref linkend="guc-min-parallel-table-scan-size"/>.
+       During scanning heap, the heap table's blocks will be divided into ranges
+       and shared among the cooperating processes. Each worker process will
+       complete the scanning of its given range of blocks before requesting an
+       additional range of blocks.
       </para>
       <para>
        The number of workers used to perform parallel index vacuum and index
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2e70bc68d2c..67516391d89 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -20,6 +20,41 @@
  * that there only needs to be one call to lazy_vacuum, after the initial pass
  * completes.
  *
+ * Parallel Vacuum
+ * ----------------
+ * Lazy vacuum on heap tables supports parallel processing for three vacuum
+ * phases: scanning heap, vacuuming indexes, and vacuuming heap. Before the
+ * scanning heap phase, we initialize parallel vacuum state, ParallelVacuumState,
+ * and allocate the TID store in a DSA area if we can use parallel mode for any
+ * of these three phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size, number of indexes, and the number
+ * of pages having dead tuples. Parallel workers are launched at the beginning
+ * of each phase and exit at the end of each phase.
+ *
+ * For scanning the heap table with parallel workers, we utilize the
+ * table_block_parallelscan_xxx facility which splits the table into several
+ * chunks and parallel workers allocate chunks to scan. If dead_items TIDs is
+ * close to overrunning the available space during parallel heap scan, parallel
+ * workers exit and leader process gathers the scan results. Then, it performs
+ * a round of index and heap vacuuming that could also use the parallelism. After
+ * vacuuming both indexes and heap table, the leader process vacuums FSM to make
+ * newly-freed space visible. Then, it relaunches parallel workers to resume the
+ * scanning heap phase with parallel workers again. In order to be able to resume
+ * the parallel heap scan from the previous status, the workers' parallel scan
+ * descriptions are stored in the shared memory (DSM) space to share among parallel
+ * workers. If the leader could launch fewer workers to resume the parallel heap
+ * scan, some blocks are remained as un-scanned. The leader process serially deals
+ * with such blocks at the end of scanning heap phase (see
+ * parallel_heap_complete_unfinished_scan()).
+ *
+ * At the beginning of the vacuuming heap phase, the leader launches parallel
+ * workers and initiates the shared iteration on the shared TID store. At the
+ * end of the phase, the leader process waits for all workers to finish and gather
+ * the workers' results.
+ *
+ *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -172,6 +207,7 @@ typedef struct LVRelScanState
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	BlockNumber vacuumed_pages; /* # pages vacuumed in one second pass */
 
 	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
@@ -205,11 +241,15 @@ typedef struct PHVShared
 	 * The final value is OR of worker's skippedallvis.
 	 */
 	bool		skippedallvis;
+	bool		do_index_vacuuming;
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState vistest;
 
+	dsa_pointer shared_iter_handle;
+	bool		do_heap_vacuum;
+
 	/* per-worker scan stats for parallel heap vacuum scan */
 	LVRelScanState worker_scan_state[FLEXIBLE_ARRAY_MEMBER];
 } PHVShared;
@@ -257,6 +297,14 @@ typedef struct PHVState
 	/* Assigned per-worker scan state */
 	PHVScanWorkerState *myscanstate;
 
+	/*
+	 * The number of parallel workers to launch for parallel heap scanning.
+	 * Note that the number of parallel workers for parallel heap vacuuming
+	 * could vary but is less than num_heapscan_workers. So this works also as
+	 * the maximum number of workers for parallel heap scanning and vacuuming.
+	 */
+	int			num_heapscan_workers;
+
 	/*
 	 * All blocks up to this value has been scanned, i.e. the minimum of all
 	 * PHVScanWorkerState->last_blkno. This field is updated by
@@ -374,6 +422,7 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
+static void do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter);
 static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, OffsetNumber *deadoffsets,
 								  int num_offsets, Buffer vmbuffer);
@@ -404,6 +453,7 @@ static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
 static void parallel_heap_vacuum_compute_min_scanned_blkno(LVRelState *vacrel);
 static void parallel_heap_vacuum_gather_scan_results(LVRelState *vacrel);
 static void parallel_heap_complete_unfinished_scan(LVRelState *vacrel);
+static int	compute_heap_vacuum_parallel_workers(Relation rel, BlockNumber nblocks);
 
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
@@ -551,6 +601,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	scan_state->lpdead_item_pages = 0;
 	scan_state->missed_dead_pages = 0;
 	scan_state->nonempty_pages = 0;
+	scan_state->vacuumed_pages = 0;
 	scan_state->tuples_deleted = 0;
 	scan_state->tuples_frozen = 0;
 	scan_state->lpdead_items = 0;
@@ -2456,46 +2507,14 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	return allindexes;
 }
 
-/*
- *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
- *
- * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
- * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
- *
- * We may also be able to truncate the line pointer array of the heap pages we
- * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
- * array, it can be reclaimed as free space.  These LP_UNUSED items usually
- * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
- * each page to LP_UNUSED, and then consider if it's possible to truncate the
- * page's line pointer array).
- *
- * Note: the reason for doing this as a second pass is we cannot remove the
- * tuples until we've removed their index entries, and we want to process
- * index entry removal in batches as large as possible.
- */
 static void
-lazy_vacuum_heap_rel(LVRelState *vacrel)
+do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter)
 {
-	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
-	LVSavedErrInfo saved_err_info;
-	TidStoreIter *iter;
-	TidStoreIterResult *iter_result;
-
-	Assert(vacrel->do_index_vacuuming);
-	Assert(vacrel->do_index_cleanup);
-	Assert(vacrel->num_index_scans > 0);
-
-	/* Report that we are now vacuuming the heap */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
 
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrel, &saved_err_info,
-							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 InvalidBlockNumber, InvalidOffsetNumber);
+	/* LVSavedErrInfo saved_err_info; */
+	TidStoreIterResult *iter_result;
 
-	iter = TidStoreBeginIterate(vacrel->dead_items);
 	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
@@ -2533,26 +2552,106 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		UnlockReleaseBuffer(buf);
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-		vacuumed_pages++;
+		vacrel->scan_state->vacuumed_pages++;
 	}
-	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
+}
+
+/*
+ *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
+ *
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
+ *
+ * We may also be able to truncate the line pointer array of the heap pages we
+ * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
+ * array, it can be reclaimed as free space.  These LP_UNUSED items usually
+ * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
+ * each page to LP_UNUSED, and then consider if it's possible to truncate the
+ * page's line pointer array).
+ *
+ * Note: the reason for doing this as a second pass is we cannot remove the
+ * tuples until we've removed their index entries, and we want to process
+ * index entry removal in batches as large as possible.
+ */
+static void
+lazy_vacuum_heap_rel(LVRelState *vacrel)
+{
+	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	int			nworkers = 0;
+
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
+	Assert(vacrel->num_index_scans > 0);
+
+	/* Report that we are now vacuuming the heap */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+							 InvalidBlockNumber, InvalidOffsetNumber);
+
+	vacrel->scan_state->vacuumed_pages = 0;
+
+	/* Compute parallel workers required to scan blocks to vacuum */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		nworkers = compute_heap_vacuum_parallel_workers(vacrel->rel,
+														TidStoreNumBlocks(vacrel->dead_items));
+
+	if (nworkers > 0)
+	{
+		PHVState   *phvstate = vacrel->phvstate;
+
+		iter = TidStoreBeginIterateShared(vacrel->dead_items);
+
+		/* launch workers */
+		phvstate->shared->do_heap_vacuum = true;
+		phvstate->shared->shared_iter_handle = TidStoreGetSharedIterHandle(iter);
+		phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs,
+																	   nworkers);
+	}
+	else
+		iter = TidStoreBeginIterate(vacrel->dead_items);
+
+	/* do the real work */
+	do_lazy_vacuum_heap_rel(vacrel, iter);
+
+	if (ParallelHeapVacuumIsActive(vacrel) && nworkers > 0)
+	{
+		PHVState   *phvstate = vacrel->phvstate;
+
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+
+		/* Gather the heap vacuum statistics that workers collected */
+		for (int i = 0; i < phvstate->nworkers_launched; i++)
+		{
+			LVRelScanState *ss = &(phvstate->shared->worker_scan_state[i]);
+
+			vacrel->scan_state->vacuumed_pages += ss->vacuumed_pages;
+		}
+	}
+
+	TidStoreEndIterate(iter);
+
 	/*
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
 		   (vacrel->dead_items_info->num_items == vacrel->scan_state->lpdead_items &&
-			vacuumed_pages == vacrel->scan_state->lpdead_item_pages));
+			vacrel->scan_state->vacuumed_pages == vacrel->scan_state->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
 					vacrel->relname, (long long) vacrel->dead_items_info->num_items,
-					vacuumed_pages)));
+					vacrel->scan_state->vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3261,6 +3360,11 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		{
 			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
 																&vacrel->dead_items_info);
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+				vacrel->phvstate->num_heapscan_workers =
+					parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
 			return;
 		}
 	}
@@ -3508,37 +3612,41 @@ update_relstats_all_indexes(LVRelState *vacrel)
  *
  * The calculation logic is borrowed from compute_parallel_worker().
  */
-int
-heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+static int
+compute_heap_vacuum_parallel_workers(Relation rel, BlockNumber nblocks)
 {
 	int			parallel_workers = 0;
 	int			heap_parallel_threshold;
 	int			heap_pages;
 
-	if (nrequested == 0)
+	/*
+	 * Select the number of workers based on the log of the size of the
+	 * relation.Note that the upper limit of the min_parallel_table_scan_size
+	 * GUC is chosen to prevent overflow here.
+	 */
+	heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+	heap_pages = BlockNumberIsValid(nblocks) ?
+		nblocks : RelationGetNumberOfBlocks(rel);
+	while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
 	{
-		/*
-		 * Select the number of workers based on the log of the size of the
-		 * relation. Note that the upper limit of the
-		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
-		 * here.
-		 */
-		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
-		heap_pages = RelationGetNumberOfBlocks(rel);
-		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
-		{
-			parallel_workers++;
-			heap_parallel_threshold *= 3;
-			if (heap_parallel_threshold > INT_MAX / 3)
-				break;
-		}
+		parallel_workers++;
+		heap_parallel_threshold *= 3;
+		if (heap_parallel_threshold > INT_MAX / 3)
+			break;
 	}
-	else
-		parallel_workers = nrequested;
 
 	return parallel_workers;
 }
 
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	if (nrequested == 0)
+		return compute_heap_vacuum_parallel_workers(rel, InvalidBlockNumber);
+	else
+		return nrequested;
+}
+
 /* Estimate shared memory sizes required for parallel heap vacuum */
 static inline void
 heap_parallel_estimate_shared_memory_size(Relation rel, int nworkers, Size *pscan_len,
@@ -3620,6 +3728,7 @@ heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
 	shared->NewRelfrozenXid = vacrel->scan_state->NewRelfrozenXid;
 	shared->NewRelminMxid = vacrel->scan_state->NewRelminMxid;
 	shared->skippedallvis = vacrel->scan_state->skippedallvis;
+	shared->do_index_vacuuming = vacrel->do_index_vacuuming;
 
 	/*
 	 * XXX: we copy the contents of vistest to the shared area, but in order
@@ -3672,7 +3781,6 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	PHVScanWorkerState *scanstate;
 	LVRelScanState *scan_state;
 	ErrorContextCallback errcallback;
-	bool		scan_done;
 
 	phvstate = palloc(sizeof(PHVState));
 
@@ -3709,10 +3817,11 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	/* initialize per-worker relation statistics */
 	MemSet(scan_state, 0, sizeof(LVRelScanState));
 
-	/* Set fields necessary for heap scan */
+	/* Set fields necessary for heap scan and vacuum */
 	vacrel.scan_state->NewRelfrozenXid = shared->NewRelfrozenXid;
 	vacrel.scan_state->NewRelminMxid = shared->NewRelminMxid;
 	vacrel.scan_state->skippedallvis = shared->skippedallvis;
+	vacrel.do_index_vacuuming = shared->do_index_vacuuming;
 
 	/* Initialize the per-worker scan state if not yet */
 	if (!phvstate->myscanstate->initialized)
@@ -3734,25 +3843,44 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
 	vacrel.relname = pstrdup(RelationGetRelationName(rel));
 	vacrel.indname = NULL;
-	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = &vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
-	scan_done = do_lazy_scan_heap(&vacrel);
+	if (shared->do_heap_vacuum)
+	{
+		TidStoreIter *iter;
+
+		iter = TidStoreAttachIterateShared(vacrel.dead_items, shared->shared_iter_handle);
+
+		/* Join parallel heap vacuum */
+		vacrel.phase = VACUUM_ERRCB_PHASE_VACUUM_HEAP;
+		do_lazy_vacuum_heap_rel(&vacrel, iter);
+
+		TidStoreEndIterate(iter);
+	}
+	else
+	{
+		bool		scan_done;
+
+		/* Join parallel heap scan */
+		vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+		scan_done = do_lazy_scan_heap(&vacrel);
+
+		/*
+		 * If the leader or a worker finishes the heap scan because dead_items
+		 * TIDs is close to the limit, it might have some allocated blocks in
+		 * its scan state. Since this scan state might not be used in the next
+		 * heap scan, we remember that it might have some unconsumed blocks so
+		 * that the leader complete the scans after the heap scan phase
+		 * finishes.
+		 */
+		phvstate->myscanstate->maybe_have_blocks = !scan_done;
+	}
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
-
-	/*
-	 * If the leader or a worker finishes the heap scan because dead_items
-	 * TIDs is close to the limit, it might have some allocated blocks in its
-	 * scan state. Since this scan state might not be used in the next heap
-	 * scan, we remember that it might have some unconsumed blocks so that the
-	 * leader complete the scans after the heap scan phase finishes.
-	 */
-	phvstate->myscanstate->maybe_have_blocks = !scan_done;
 }
 
 /*
@@ -3874,7 +4002,10 @@ do_parallel_lazy_scan_heap(LVRelState *vacrel)
 	Assert(!IsParallelWorker());
 
 	/* launcher workers */
-	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+	vacrel->phvstate->shared->do_heap_vacuum = false;
+	vacrel->phvstate->nworkers_launched =
+		parallel_vacuum_table_scan_begin(vacrel->pvs,
+										 vacrel->phvstate->num_heapscan_workers);
 
 	/* initialize parallel scan description to join as a worker */
 	scanstate = palloc0(sizeof(PHVScanWorkerState));
@@ -3933,7 +4064,8 @@ do_parallel_lazy_scan_heap(LVRelState *vacrel)
 
 		/* Re-launch workers to restart parallel heap scan */
 		vacrel->phvstate->nworkers_launched =
-			parallel_vacuum_table_scan_begin(vacrel->pvs);
+			parallel_vacuum_table_scan_begin(vacrel->pvs,
+											 vacrel->phvstate->num_heapscan_workers);
 	}
 
 	/*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 3001be84ddf..fd897ddadf3 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1054,8 +1054,10 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
  * table vacuum.
  */
 int
-parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs, int nworkers_request)
 {
+	int			nworkers;
+
 	Assert(!IsParallelWorker());
 
 	if (pvs->shared->nworkers_for_table == 0)
@@ -1069,11 +1071,13 @@ parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
 	if (pvs->num_table_scans > 0)
 		ReinitializeParallelDSM(pvs->pcxt);
 
+	nworkers = Min(nworkers_request, pvs->shared->nworkers_for_table);
+
 	/*
 	 * The number of workers might vary between table vacuum and index
 	 * processing
 	 */
-	ReinitializeParallelWorkers(pvs->pcxt, pvs->shared->nworkers_for_table);
+	ReinitializeParallelWorkers(pvs->pcxt, nworkers);
 	LaunchParallelWorkers(pvs->pcxt);
 
 	if (pvs->pcxt->nworkers_launched > 0)
@@ -1097,7 +1101,7 @@ parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
 			(errmsg(ngettext("launched %d parallel vacuum worker for table processing (planned: %d)",
 							 "launched %d parallel vacuum workers for table processing (planned: %d)",
 							 pvs->pcxt->nworkers_launched),
-					pvs->pcxt->nworkers_launched, pvs->shared->nworkers_for_table)));
+					pvs->pcxt->nworkers_launched, nworkers)));
 
 	return pvs->pcxt->nworkers_launched;
 }
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index b70e50133fa..ab6b6cde759 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -371,7 +371,7 @@ extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												bool estimated_count);
-extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs, int nworkers_request);
 extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
 extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
 extern int	parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs);
-- 
2.43.5

v5-0004-raidxtree.h-support-shared-iteration.patchapplication/octet-stream; name=v5-0004-raidxtree.h-support-shared-iteration.patchDownload

From 8d4f5c162f19b080f117294c9f089a95cb731a99 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:29:51 -0700
Subject: [PATCH v5 4/8] raidxtree.h: support shared iteration.

This commit supports a shared iteration operation on a radix tree with
multiple processes. The radix tree must be in shared mode to start a
shared itereation. Parallel workers can attach the shared iteration
using the iterator handle given by the leader process. Same as the
normal interation, it's guarnteed that the shared iteration returns
key-values in an ascending order.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/include/lib/radixtree.h                   | 227 +++++++++++++++---
 .../modules/test_radixtree/test_radixtree.c   | 128 ++++++----
 2 files changed, 281 insertions(+), 74 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 1301f3fee44..d5767f31c55 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -136,6 +136,9 @@
  * RT_LOCK_SHARE 	- Lock the radix tree in share mode
  * RT_UNLOCK		- Unlock the radix tree
  * RT_GET_HANDLE	- Return the handle of the radix tree
+ * RT_BEGIN_ITERATE_SHARED	- Begin iterating in shared mode.
+ * RT_ATTACH_ITERATE_SHARED	- Attach to the shared iterator.
+ * RT_GET_ITER_HANDLE		- Get the handle of the shared iterator.
  *
  * Optional Interface
  * ---------
@@ -179,6 +182,9 @@
 #define RT_ATTACH RT_MAKE_NAME(attach)
 #define RT_DETACH RT_MAKE_NAME(detach)
 #define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#define RT_BEGIN_ITERATE_SHARED RT_MAKE_NAME(begin_iterate_shared)
+#define RT_ATTACH_ITERATE_SHARED RT_MAKE_NAME(attach_iterate_shared)
+#define RT_GET_ITER_HANDLE RT_MAKE_NAME(get_iter_handle)
 #define RT_LOCK_EXCLUSIVE RT_MAKE_NAME(lock_exclusive)
 #define RT_LOCK_SHARE RT_MAKE_NAME(lock_share)
 #define RT_UNLOCK RT_MAKE_NAME(unlock)
@@ -238,15 +244,19 @@
 #define RT_SHRINK_NODE_16 RT_MAKE_NAME(shrink_child_16)
 #define RT_SHRINK_NODE_48 RT_MAKE_NAME(shrink_child_48)
 #define RT_SHRINK_NODE_256 RT_MAKE_NAME(shrink_child_256)
+#define RT_INITIALIZE_ITER RT_MAKE_NAME(initialize_iter)
 #define RT_NODE_ITERATE_NEXT RT_MAKE_NAME(node_iterate_next)
 #define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
 
 /* type declarations */
 #define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
 #define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER_CONTROL RT_MAKE_NAME(iter_control)
 #define RT_ITER RT_MAKE_NAME(iter)
 #ifdef RT_SHMEM
 #define RT_HANDLE RT_MAKE_NAME(handle)
+#define RT_ITER_CONTROL_SHARED RT_MAKE_NAME(iter_control_shared)
+#define RT_ITER_HANDLE RT_MAKE_NAME(iter_handle)
 #endif
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_CHILD_PTR RT_MAKE_NAME(child_ptr)
@@ -272,6 +282,7 @@ typedef struct RT_ITER RT_ITER;
 
 #ifdef RT_SHMEM
 typedef dsa_pointer RT_HANDLE;
+typedef dsa_pointer RT_ITER_HANDLE;
 #endif
 
 #ifdef RT_SHMEM
@@ -282,6 +293,9 @@ RT_SCOPE	RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_LOCK_EXCLUSIVE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_LOCK_SHARE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_UNLOCK(RT_RADIX_TREE * tree);
+RT_SCOPE	RT_ITER *RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree);
+RT_SCOPE	RT_ITER_HANDLE RT_GET_ITER_HANDLE(RT_ITER * iter);
+RT_SCOPE	RT_ITER *RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle);
 #else
 RT_SCOPE	RT_RADIX_TREE *RT_CREATE(MemoryContext ctx);
 #endif
@@ -689,6 +703,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 	RT_HANDLE	handle;
 	uint32		magic;
 	LWLock		lock;
+	int			tranche_id;
 #endif
 
 	RT_PTR_ALLOC root;
@@ -742,11 +757,9 @@ typedef struct RT_NODE_ITER
 	int			idx;
 }			RT_NODE_ITER;
 
-/* state for iterating over the whole radix tree */
-struct RT_ITER
+/* Contain the iteration state data */
+typedef struct RT_ITER_CONTROL
 {
-	RT_RADIX_TREE *tree;
-
 	/*
 	 * A stack to track iteration for each level. Level 0 is the lowest (or
 	 * leaf) level
@@ -757,8 +770,36 @@ struct RT_ITER
 
 	/* The key constructed during iteration */
 	uint64		key;
-};
+}			RT_ITER_CONTROL;
+
+#ifdef RT_SHMEM
+/* Contain the shared iteration state data */
+typedef struct RT_ITER_CONTROL_SHARED
+{
+	/* Actual shared iteration state data */
+	RT_ITER_CONTROL common;
+
+	/* protect the control data */
+	LWLock		lock;
+
+	RT_ITER_HANDLE handle;
+	pg_atomic_uint32 refcnt;
+}			RT_ITER_CONTROL_SHARED;
+#endif
+
+/* state for iterating over the whole radix tree */
+struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
 
+	/* pointing to either local memory or DSA */
+	RT_ITER_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	/* True if the iterator is for shared iteration */
+	bool		shared;
+#endif
+};
 
 /* verification (available only in assert-enabled builds) */
 static void RT_VERIFY_NODE(RT_NODE * node);
@@ -1850,6 +1891,7 @@ RT_CREATE(MemoryContext ctx)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
 	tree->ctl->handle = dp;
 	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	tree->ctl->tranche_id = tranche_id;
 	LWLockInitialize(&tree->ctl->lock, tranche_id);
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
@@ -1902,6 +1944,9 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 	dsa_pointer control;
 
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->iter_context = AllocSetContextCreate(CurrentMemoryContext,
+											   RT_STR(RT_PREFIX) "_radix_tree iter context",
+											   ALLOCSET_SMALL_SIZES);
 
 	/* Find the control object in shared memory */
 	control = handle;
@@ -2074,35 +2119,86 @@ RT_FREE(RT_RADIX_TREE * tree)
 
 /***************** ITERATION *****************/
 
+/* Common routine to initialize the given iterator */
+static void
+RT_INITIALIZE_ITER(RT_RADIX_TREE * tree, RT_ITER * iter)
+{
+	RT_CHILD_PTR root;
+
+	iter->tree = tree;
+
+	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
+	root.alloc = iter->tree->ctl->root;
+	RT_PTR_SET_LOCAL(tree, &root);
+
+	iter->ctl->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+
+	/* Set the root to start */
+	iter->ctl->cur_level = iter->ctl->top_level;
+	iter->ctl->node_iters[iter->ctl->cur_level].node = root;
+	iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
+}
+
 /*
  * Create and return the iterator for the given radix tree.
  *
- * Taking a lock in shared mode during the iteration is the caller's
- * responsibility.
+ * Taking a lock on a radix tree in shared mode during the iteration is the
+ * caller's responsibility.
  */
 RT_SCOPE	RT_ITER *
 RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 {
 	RT_ITER    *iter;
-	RT_CHILD_PTR root;
 
 	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
 											  sizeof(RT_ITER));
-	iter->tree = tree;
+	iter->ctl = (RT_ITER_CONTROL *) MemoryContextAllocZero(tree->iter_context,
+														   sizeof(RT_ITER_CONTROL));
 
-	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
-	root.alloc = iter->tree->ctl->root;
-	RT_PTR_SET_LOCAL(tree, &root);
+	RT_INITIALIZE_ITER(tree, iter);
 
-	iter->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+#ifdef RT_SHMEM
+	/* we will non-shared iteration on a shared radix tree */
+	iter->shared = false;
+#endif
 
-	/* Set the root to start */
-	iter->cur_level = iter->top_level;
-	iter->node_iters[iter->cur_level].node = root;
-	iter->node_iters[iter->cur_level].idx = 0;
+	return iter;
+}
+
+#ifdef RT_SHMEM
+/*
+ * Create and return the shared iterator for the given shard radix tree.
+ *
+ * Taking a lock on a radix tree in shared mode during the shared iteration to
+ * prevent concurrent writes is the caller's responsibility.
+ */
+RT_SCOPE	RT_ITER *
+RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl_shared;
+	dsa_pointer dp;
+
+	/* The radix tree must be in shared mode */
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	dp = dsa_allocate0(tree->dsa, sizeof(RT_ITER_CONTROL_SHARED));
+	ctl_shared = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, dp);
+	ctl_shared->handle = dp;
+	LWLockInitialize(&ctl_shared->lock, tree->ctl->tranche_id);
+	pg_atomic_init_u32(&ctl_shared->refcnt, 1);
+
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
+											  sizeof(RT_ITER));
+
+	iter->ctl = (RT_ITER_CONTROL *) ctl_shared;
+	iter->shared = true;
+
+	RT_INITIALIZE_ITER(tree, iter);
 
 	return iter;
 }
+#endif
 
 /*
  * Scan the inner node and return the next child pointer if one exists, otherwise
@@ -2116,12 +2212,18 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	RT_CHILD_PTR node;
 	RT_PTR_ALLOC *slot = NULL;
 
+	node_iter = &(iter->ctl->node_iters[level]);
+	node = node_iter->node;
+
 #ifdef RT_SHMEM
-	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
-#endif
 
-	node_iter = &(iter->node_iters[level]);
-	node = node_iter->node;
+	/*
+	 * Since the iterator is shared, the local pointer of the node might be
+	 * set by other backends, we need to make sure to use the local pointer.
+	 */
+	if (iter->shared)
+		RT_PTR_SET_LOCAL(iter->tree, &node);
+#endif
 
 	Assert(node.local != NULL);
 
@@ -2194,8 +2296,8 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	}
 
 	/* Update the key */
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
-	iter->key |= (((uint64) key_chunk) << (level * RT_SPAN));
+	iter->ctl->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
+	iter->ctl->key |= (((uint64) key_chunk) << (level * RT_SPAN));
 
 	return slot;
 }
@@ -2209,18 +2311,29 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 {
 	RT_PTR_ALLOC *slot = NULL;
 
-	while (iter->cur_level <= iter->top_level)
+#ifdef RT_SHMEM
+	/* Prevent the shared iterator from being updated concurrently */
+	if (iter->shared)
+		LWLockAcquire(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock, LW_EXCLUSIVE);
+#endif
+
+	while (iter->ctl->cur_level <= iter->ctl->top_level)
 	{
 		RT_CHILD_PTR node;
 
-		slot = RT_NODE_ITERATE_NEXT(iter, iter->cur_level);
+		slot = RT_NODE_ITERATE_NEXT(iter, iter->ctl->cur_level);
 
-		if (iter->cur_level == 0 && slot != NULL)
+		if (iter->ctl->cur_level == 0 && slot != NULL)
 		{
 			/* Found a value at the leaf node */
-			*key_p = iter->key;
+			*key_p = iter->ctl->key;
 			node.alloc = *slot;
 
+#ifdef RT_SHMEM
+			if (iter->shared)
+				LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 			if (RT_CHILDPTR_IS_VALUE(*slot))
 				return (RT_VALUE_TYPE *) slot;
 			else
@@ -2236,17 +2349,23 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 			node.alloc = *slot;
 			RT_PTR_SET_LOCAL(iter->tree, &node);
 
-			iter->cur_level--;
-			iter->node_iters[iter->cur_level].node = node;
-			iter->node_iters[iter->cur_level].idx = 0;
+			iter->ctl->cur_level--;
+			iter->ctl->node_iters[iter->ctl->cur_level].node = node;
+			iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
 		}
 		else
 		{
 			/* Not found the child slot, move up the tree */
-			iter->cur_level++;
+			iter->ctl->cur_level++;
 		}
+
 	}
 
+#ifdef RT_SHMEM
+	if (iter->shared)
+		LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 	/* We've visited all nodes, so the iteration finished */
 	return NULL;
 }
@@ -2257,9 +2376,45 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 RT_SCOPE void
 RT_END_ITERATE(RT_ITER * iter)
 {
+#ifdef RT_SHMEM
+	RT_ITER_CONTROL_SHARED *ctl = (RT_ITER_CONTROL_SHARED *) iter->ctl;
+
+	if (iter->shared &&
+		pg_atomic_sub_fetch_u32(&ctl->refcnt, 1) == 0)
+		dsa_free(iter->tree->dsa, ctl->handle);
+#endif
 	pfree(iter);
 }
 
+#ifdef	RT_SHMEM
+RT_SCOPE	RT_ITER_HANDLE
+RT_GET_ITER_HANDLE(RT_ITER * iter)
+{
+	Assert(iter->shared);
+	return ((RT_ITER_CONTROL_SHARED *) iter->ctl)->handle;
+
+}
+
+RT_SCOPE	RT_ITER *
+RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl;
+
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
+											  sizeof(RT_ITER));
+	iter->tree = tree;
+	ctl = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, handle);
+	iter->ctl = (RT_ITER_CONTROL *) ctl;
+	iter->shared = true;
+
+	/* For every iterator, increase the refcnt by 1 */
+	pg_atomic_add_fetch_u32(&ctl->refcnt, 1);
+
+	return iter;
+}
+#endif
+
 /***************** DELETION *****************/
 
 #ifdef RT_USE_DELETE
@@ -2959,7 +3114,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_PTR_ALLOC
 #undef RT_INVALID_PTR_ALLOC
 #undef RT_HANDLE
+#undef RT_ITER_HANDLE
+#undef RT_ITER_CONTROL
+#undef RT_ITER_HANDLE
 #undef RT_ITER
+#undef RT_SHARED_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
 #undef RT_NODE_KIND_4
@@ -2996,6 +3155,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_LOCK_SHARE
 #undef RT_UNLOCK
 #undef RT_GET_HANDLE
+#undef RT_BEGIN_ITERATE_SHARED
+#undef RT_ATTACH_ITERATE_SHARED
+#undef RT_GET_ITER_HANDLE
+#undef RT_ATTACH_ITER
+#undef RT_GET_ITER_HANDLE
 #undef RT_FIND
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
@@ -3052,5 +3216,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_SHRINK_NODE_256
 #undef RT_NODE_DELETE
 #undef RT_NODE_INSERT
+#undef RT_INITIALIZE_ITER
 #undef RT_NODE_ITERATE_NEXT
 #undef RT_VERIFY_NODE
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 3e5aa3720c7..ef9cc6eb507 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -161,13 +161,87 @@ test_empty(void)
 #endif
 }
 
+/* Iteration test for test_basic() */
+static void
+test_iterate_basic(rt_radix_tree *radixtree, uint64 *keys, int children,
+				   bool asc, bool shared)
+{
+	rt_iter    *iter;
+
+#ifdef TEST_SHARED_RT
+	if (!shared)
+		iter = rt_begin_iterate(radixtree);
+	else
+		iter = rt_begin_iterate_shared(radixtree);
+#else
+	iter = rt_begin_iterate(radixtree);
+#endif
+
+	for (int i = 0; i < children; i++)
+	{
+		uint64		expected;
+		uint64		iterkey;
+		TestValueType *iterval;
+
+		/* iteration is ordered by key, so adjust expected value accordingly */
+		if (asc)
+			expected = keys[i];
+		else
+			expected = keys[children - 1 - i];
+
+		iterval = rt_iterate_next(iter, &iterkey);
+
+		EXPECT_TRUE(iterval != NULL);
+		EXPECT_EQ_U64(iterkey, expected);
+		EXPECT_EQ_U64(*iterval, expected);
+	}
+
+	rt_end_iterate(iter);
+}
+
+/* Iteration test for test_random() */
+static void
+test_iterate_random(rt_radix_tree *radixtree, uint64 *keys, int num_keys,
+					bool shared)
+{
+	rt_iter    *iter;
+
+#ifdef TEST_SHARED_RT
+	if (!shared)
+		iter = rt_begin_iterate(radixtree);
+	else
+		iter = rt_begin_iterate_shared(radixtree);
+#else
+	iter = rt_begin_iterate(radixtree);
+#endif
+
+	for (int i = 0; i < num_keys; i++)
+	{
+		uint64		expected;
+		uint64		iterkey;
+		TestValueType *iterval;
+
+		/* skip duplicate keys */
+		if (i < num_keys - 1 && keys[i + 1] == keys[i])
+			continue;
+
+		expected = keys[i];
+		iterval = rt_iterate_next(iter, &iterkey);
+
+		EXPECT_TRUE(iterval != NULL);
+		EXPECT_EQ_U64(iterkey, expected);
+		EXPECT_EQ_U64(*iterval, expected);
+	}
+
+	rt_end_iterate(iter);
+}
+
 /* Basic set, find, and delete tests */
 static void
 test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 {
 	MemoryContext radixtree_ctx;
 	rt_radix_tree *radixtree;
-	rt_iter    *iter;
 	uint64	   *keys;
 	int			children = test_info->nkeys;
 #ifdef TEST_SHARED_RT
@@ -250,28 +324,12 @@ test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 	}
 
 	/* test that iteration returns the expected keys and values */
-	iter = rt_begin_iterate(radixtree);
-
-	for (int i = 0; i < children; i++)
-	{
-		uint64		expected;
-		uint64		iterkey;
-		TestValueType *iterval;
-
-		/* iteration is ordered by key, so adjust expected value accordingly */
-		if (asc)
-			expected = keys[i];
-		else
-			expected = keys[children - 1 - i];
-
-		iterval = rt_iterate_next(iter, &iterkey);
-
-		EXPECT_TRUE(iterval != NULL);
-		EXPECT_EQ_U64(iterkey, expected);
-		EXPECT_EQ_U64(*iterval, expected);
-	}
+	test_iterate_basic(radixtree, keys, children, asc, false);
 
-	rt_end_iterate(iter);
+#ifdef TEST_SHARED_RT
+	/* test shared-iteration as well */
+	test_iterate_basic(radixtree, keys, children, asc, true);
+#endif
 
 	/* delete all keys again */
 	for (int i = 0; i < children; i++)
@@ -302,7 +360,6 @@ test_random(void)
 {
 	MemoryContext radixtree_ctx;
 	rt_radix_tree *radixtree;
-	rt_iter    *iter;
 	pg_prng_state state;
 
 	/* limit memory usage by limiting the key space */
@@ -395,27 +452,12 @@ test_random(void)
 	}
 
 	/* test that iteration returns the expected keys and values */
-	iter = rt_begin_iterate(radixtree);
-
-	for (int i = 0; i < num_keys; i++)
-	{
-		uint64		expected;
-		uint64		iterkey;
-		TestValueType *iterval;
+	test_iterate_random(radixtree, keys, num_keys, false);
 
-		/* skip duplicate keys */
-		if (i < num_keys - 1 && keys[i + 1] == keys[i])
-			continue;
-
-		expected = keys[i];
-		iterval = rt_iterate_next(iter, &iterkey);
-
-		EXPECT_TRUE(iterval != NULL);
-		EXPECT_EQ_U64(iterkey, expected);
-		EXPECT_EQ_U64(*iterval, expected);
-	}
-
-	rt_end_iterate(iter);
+#ifdef TEST_SHARED_RT
+	/* test shared-iteration as well */
+	test_iterate_random(radixtree, keys, num_keys, true);
+#endif
 
 	/* reset random number generator for deletion */
 	pg_prng_seed(&state, seed);
-- 
2.43.5

v5-0007-Add-TidStoreNumBlocks-API-to-get-the-number-of-bl.patchapplication/octet-stream; name=v5-0007-Add-TidStoreNumBlocks-API-to-get-the-number-of-bl.patchDownload

From 275f367616e608794e9da6869428d55027c55368 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Dec 2024 16:55:52 -0800
Subject: [PATCH v5 7/8] Add TidStoreNumBlocks API to get the number of blocks
 in TidStore.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/common/tidstore.c | 12 ++++++++++++
 src/include/access/tidstore.h        |  1 +
 2 files changed, 13 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 637d26012d2..18d0e855ab2 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -596,6 +596,18 @@ TidStoreMemoryUsage(TidStore *ts)
 		return local_ts_memory_usage(ts->tree.local);
 }
 
+/*
+ * Return the number of entries in TidStore.
+ */
+BlockNumber
+TidStoreNumBlocks(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+		return shared_ts_num_keys(ts->tree.shared);
+	else
+		return local_ts_num_keys(ts->tree.local);
+}
+
 /*
  * Return the DSA area where the TidStore lives.
  */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index f20c9a92e55..1566cb47593 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -51,6 +51,7 @@ extern int	TidStoreGetBlockOffsets(TidStoreIterResult *result,
 									int max_offsets);
 extern void TidStoreEndIterate(TidStoreIter *iter);
 extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern BlockNumber TidStoreNumBlocks(TidStore *ts);
 extern dsa_pointer TidStoreGetHandle(TidStore *ts);
 extern dsa_area *TidStoreGetDSA(TidStore *ts);
 
-- 
2.43.5

v5-0002-Remember-the-number-of-times-parallel-index-vacuu.patchapplication/octet-stream; name=v5-0002-Remember-the-number-of-times-parallel-index-vacuu.patchDownload

From eda66e24fbf9357e9c4af262ebb8e2e0d2c46ac0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Dec 2024 15:54:32 -0800
Subject: [PATCH v5 2/8] Remember the number of times parallel index
 vacuuming/cleanup is executed in ParallelVacuumState.

Previously, the caller can passes an arbitrary value for
'num_index_scans' to parallel index vacuuming or cleaning up APIs, but
it didn't make sense since if the caller needs to be careful about
counting how many times it executed index vacuuming or cleaning
up. Otherwise, it fails to reinitialize parallel DSM.

This commit changes parallel vacuum APIs so that ParallelVacuumState
has the counter num_index_scans and re-initialize parallel DSM based
on that.

An upcoming patch for parallel table scan will do a similar thing.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c  |  4 +---
 src/backend/commands/vacuumparallel.c | 27 +++++++++++++++------------
 src/include/commands/vacuum.h         |  4 +---
 3 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 05406a0bc5a..61b77af09b1 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2143,8 +2143,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	else
 	{
 		/* Outsource everything to parallel variant */
-		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, old_live_tuples,
-											vacrel->num_index_scans);
+		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, old_live_tuples);
 
 		/*
 		 * Do a postcheck to consider applying wraparound failsafe now.  Note
@@ -2514,7 +2513,6 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	{
 		/* Outsource everything to parallel variant */
 		parallel_vacuum_cleanup_all_indexes(vacrel->pvs, reltuples,
-											vacrel->num_index_scans,
 											estimated_count);
 	}
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a564..50dd3d7d14d 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -200,6 +200,9 @@ struct ParallelVacuumState
 	 */
 	bool	   *will_parallel_vacuum;
 
+	/* How many time index vacuuming or cleaning up is executed? */
+	int			num_index_scans;
+
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
 	 * parallel index cleanup respectively.
@@ -223,8 +226,7 @@ struct ParallelVacuumState
 
 static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 											bool *will_parallel_vacuum);
-static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
-												bool vacuum);
+static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_unsafe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
@@ -497,8 +499,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
  * Do parallel index bulk-deletion with parallel workers.
  */
 void
-parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tuples,
-									int num_index_scans)
+parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tuples)
 {
 	Assert(!IsParallelWorker());
 
@@ -509,7 +510,7 @@ parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 	pvs->shared->reltuples = num_table_tuples;
 	pvs->shared->estimated_count = true;
 
-	parallel_vacuum_process_all_indexes(pvs, num_index_scans, true);
+	parallel_vacuum_process_all_indexes(pvs, true);
 }
 
 /*
@@ -517,7 +518,7 @@ parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tup
  */
 void
 parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tuples,
-									int num_index_scans, bool estimated_count)
+									bool estimated_count)
 {
 	Assert(!IsParallelWorker());
 
@@ -529,7 +530,7 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 	pvs->shared->reltuples = num_table_tuples;
 	pvs->shared->estimated_count = estimated_count;
 
-	parallel_vacuum_process_all_indexes(pvs, num_index_scans, false);
+	parallel_vacuum_process_all_indexes(pvs, false);
 }
 
 /*
@@ -608,8 +609,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
  * must be used by the parallel vacuum leader process.
  */
 static void
-parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
-									bool vacuum)
+parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 {
 	int			nworkers;
 	PVIndVacStatus new_status;
@@ -631,7 +631,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		nworkers = pvs->nindexes_parallel_cleanup;
 
 		/* Add conditionally parallel-aware indexes if in the first time call */
-		if (num_index_scans == 0)
+		if (pvs->num_index_scans == 0)
 			nworkers += pvs->nindexes_parallel_condcleanup;
 	}
 
@@ -659,7 +659,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		indstats->parallel_workers_can_process =
 			(pvs->will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
-													num_index_scans,
+													pvs->num_index_scans,
 													vacuum));
 	}
 
@@ -670,7 +670,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	if (nworkers > 0)
 	{
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (pvs->num_index_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -764,6 +764,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 	}
+
+	/* Increment the counter */
+	pvs->num_index_scans++;
 }
 
 /*
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d38..7613d00e26f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -366,11 +366,9 @@ extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
-												long num_table_tuples,
-												int num_index_scans);
+												long num_table_tuples);
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
-												int num_index_scans,
 												bool estimated_count);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
-- 
2.43.5

v5-0003-Support-parallel-heap-scan-during-lazy-vacuum.patchapplication/octet-stream; name=v5-0003-Support-parallel-heap-scan-during-lazy-vacuum.patchDownload

From 714952711dab3bb3940ed9792caafdfe7e4f7672 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 1 Jul 2024 15:17:46 +0900
Subject: [PATCH v5 3/8] Support parallel heap scan during lazy vacuum.

Commit 40d964ec99 allowed vacuum command to process indexes in
parallel. This change extends the parallel vacuum to support parallel
heap scan during lazy vacuum.
---
 doc/src/sgml/ref/vacuum.sgml             |  58 +-
 src/backend/access/heap/heapam_handler.c |   6 +
 src/backend/access/heap/vacuumlazy.c     | 929 ++++++++++++++++++++---
 src/backend/commands/vacuumparallel.c    | 305 ++++++--
 src/backend/storage/ipc/procarray.c      |  74 --
 src/include/access/heapam.h              |   8 +
 src/include/access/tableam.h             |  88 +++
 src/include/commands/vacuum.h            |   8 +-
 src/include/utils/snapmgr.h              |   2 +-
 src/include/utils/snapmgr_internal.h     |  89 +++
 src/tools/pgindent/typedefs.list         |   3 +
 11 files changed, 1330 insertions(+), 240 deletions(-)
 create mode 100644 src/include/utils/snapmgr_internal.h

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9110938fab6..aae0bbcd577 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -277,27 +277,43 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
    <varlistentry>
     <term><literal>PARALLEL</literal></term>
     <listitem>
-     <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
-     </para>
+      <para>
+       Perform scanning heap, index vacuum, and index cleanup phases of
+       <command>VACUUM</command> in parallel using
+       <replaceable class="parameter">integer</replaceable> background workers
+       (for the details of each vacuum phase, please refer to
+       <xref linkend="vacuum-phases"/>).
+      </para>
+      <para>
+       For heap tables, the number of workers used to perform the scanning
+       heap is determined based on the size of table. A table can participate in
+       parallel scanning heap if and only if the size of the table is more than
+       <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+       the heap table's blocks will be divided into ranges and shared among the
+       cooperating processes. Each worker process will complete the scanning of
+       its given range of blocks before requesting an additional range of blocks.
+      </para>
+      <para>
+       The number of workers used to perform parallel index vacuum and index
+       cleanup is equal to the number of indexes on the relation that support
+       parallel vacuum. An index can participate in parallel vacuum if and only
+       if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+       Only one worker can be used per index. So parallel workers for index vacuum
+       and index cleanup are launched only when there are at least <literal>2</literal>
+       indexes in the table.
+      </para>
+      <para>
+       Workers for vacuum are launched before the start of each phase and exit
+       at the end of the phase. The number of workers for each phase is limited by
+       the number of workers specified with <literal>PARALLEL</literal> option if
+       any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+       Please note that in any parallel vacuum phase, it is not guaanteed that the
+       number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+       will be used during execution. It is possible for a vacuum to run with fewer
+       workers than specified, or even with no workers at all. These behaviors might
+       change in a future release. This option can't be used with the <literal>FULL</literal>
+       option.
+      </para>
     </listitem>
    </varlistentry>
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2da4e4da13e..598fafae4a0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2662,6 +2662,12 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_relation_worker = heap_parallel_vacuum_worker,
+
 	.scan_analyze_next_block = heapam_scan_analyze_next_block,
 	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 61b77af09b1..2e70bc68d2c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -48,6 +48,7 @@
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -115,10 +116,24 @@
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
 /*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
+ * DSM keys for heap parallel vacuum scan. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_KEY_SCAN_SHARED			0xFFFF0001
+#define LV_PARALLEL_KEY_SCAN_DESC			0xFFFF0002
+#define LV_PARALLEL_KEY_SCAN_DESC_WORKER	0xFFFF0003
+
+/*
+ * Macros to check if we are in parallel heap vacuuming, parallel index vacuuming,
+ * or both. If ParallelVacuumIsActive() is true, we are in the parallel mode, meaning
+ * that we have dead items TIDs on shared memory area.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelIndexVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_index((vacrel)->pvs) > 0)
+#define ParallelHeapVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_table((vacrel)->pvs) > 0)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -172,6 +187,87 @@ typedef struct LVRelScanState
 	bool		skippedallvis;
 } LVRelScanState;
 
+/*
+ * Struct for information that needs to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/*
+	 * Have we skipped any all-visible pages?
+	 *
+	 * The final value is OR of worker's skippedallvis.
+	 */
+	bool		skippedallvis;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState vistest;
+
+	/* per-worker scan stats for parallel heap vacuum scan */
+	LVRelScanState worker_scan_state[FLEXIBLE_ARRAY_MEMBER];
+} PHVShared;
+#define SizeOfPHVShared (offsetof(PHVShared, worker_scan_state))
+
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerState
+{
+	bool		initialized;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData state;
+
+	/*
+	 * True if a parallel vacuum scan worker allocated blocks in state but
+	 * might have not scanned all of them. The leader process will take over
+	 * for scanning these remaining blocks.
+	 */
+	bool		maybe_have_blocks;
+
+	/* last block number the worker scanned */
+	BlockNumber last_blkno;
+} PHVScanWorkerState;
+
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pscandesc;
+
+	/* Shared information */
+	PHVShared  *shared;
+
+	/*
+	 * Points to all per-worker scan state array stored on DSM area.
+	 *
+	 * During parallel heap scan, each worker allocates some chunks of blocks
+	 * to scan in its scan state, and could exit while leaving some chunks
+	 * un-scanned if the size of dead_items TIDs is close to overrunning the
+	 * the available space. We store the scan states on shared memory area so
+	 * that workers can resume heap scans from the previous point.
+	 */
+	PHVScanWorkerState *scanstates;
+
+	/* Assigned per-worker scan state */
+	PHVScanWorkerState *myscanstate;
+
+	/*
+	 * All blocks up to this value has been scanned, i.e. the minimum of all
+	 * PHVScanWorkerState->last_blkno. This field is updated by
+	 * parallel_heap_vacuum_compute_min_scanned_blkno().
+	 */
+	BlockNumber min_scanned_blkno;
+
+	/* The number of workers launched for parallel heap vacuum */
+	int			nworkers_launched;
+} PHVState;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -183,6 +279,9 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
+	/* Parallel heap vacuum state and sizes for each struct */
+	PHVState   *phvstate;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -223,6 +322,8 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
+	BlockNumber next_fsm_block_to_vacuum;	/* next block to check for FSM
+											 * vacuum */
 
 	/* Working state for heap scanning and vacuuming */
 	LVRelScanState *scan_state;
@@ -254,8 +355,11 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static bool do_lazy_scan_heap(LVRelState *vacrel);
 static bool heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 									 bool *all_visible_according_to_vm);
+static bool heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno,
+											  bool *all_visible_according_to_vm);
 static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
@@ -296,6 +400,11 @@ static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 static void update_relstats_all_indexes(LVRelState *vacrel);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_heap_vacuum_compute_min_scanned_blkno(LVRelState *vacrel);
+static void parallel_heap_vacuum_gather_scan_results(LVRelState *vacrel);
+static void parallel_heap_complete_unfinished_scan(LVRelState *vacrel);
+
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
 									 LVSavedErrInfo *saved_vacrel,
@@ -432,6 +541,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		Assert(params->index_cleanup == VACOPTVALUE_AUTO);
 	}
 
+	vacrel->next_fsm_block_to_vacuum = 0;
+
 	/* Initialize page counters explicitly (be tidy) */
 	scan_state = palloc(sizeof(LVRelScanState));
 	scan_state->scanned_pages = 0;
@@ -452,6 +563,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->scan_state = scan_state;
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
+	/* dead_items_alloc allocates vacrel->dead_items later on */
+
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
 	vacrel->new_live_tuples = 0;
@@ -861,12 +974,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_fsm_block_to_vacuum = 0;
-	bool		all_visible_according_to_vm;
-
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -886,12 +994,93 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_allvis = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
-	while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
+	/*
+	 * Do the actual work. If parallel heap vacuum is active, we scan and
+	 * vacuum heap using parallel workers.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+	{
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/* We must have scanned all heap pages */
+		Assert(scan_done);
+	}
+
+	/* report that everything is now scanned */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_state->scanned_pages,
+													 vacrel->scan_state->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_state->recently_dead_tuples +
+		vacrel->scan_state->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+								rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * Return true if we processed all blocks, otherwise false if we exit from this function
+ * while not completing the heap scan due to full of dead item TIDs. In serial heap scan
+ * case, this function always returns true. In parallel heap vacuum scan, this function
+ * is called by both worker processes and the leader process, and could return false.
+ */
+static bool
+do_lazy_scan_heap(LVRelState *vacrel)
+{
+	bool		all_visible_according_to_vm;
+	BlockNumber blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		scan_done = true;
+
+	while (true)
 	{
 		Buffer		buf;
 		Page		page;
 		bool		has_lpdead_items;
 		bool		got_cleanup_lock = false;
+		bool		got_blkno;
+
+		/* Get the next block for vacuum to process */
+		if (ParallelHeapVacuumIsActive(vacrel))
+			got_blkno = heap_vac_scan_next_block_parallel(vacrel, &blkno, &all_visible_according_to_vm);
+		else
+			got_blkno = heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm);
+
+		if (!got_blkno)
+			break;
 
 		vacrel->scan_state->scanned_pages++;
 
@@ -911,46 +1100,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scan_state->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (!IsParallelWorker() &&
+			vacrel->scan_state->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
-		/*
-		 * Consider if we definitely have enough space to process TIDs on page
-		 * already.  If we are close to overrunning the available space for
-		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page.
-		 */
-		if (TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
-		{
-			/*
-			 * Before beginning index vacuuming, we release any pin we may
-			 * hold on the visibility map page.  This isn't necessary for
-			 * correctness, but we do it anyway to avoid holding the pin
-			 * across a lengthy, unrelated operation.
-			 */
-			if (BufferIsValid(vmbuffer))
-			{
-				ReleaseBuffer(vmbuffer);
-				vmbuffer = InvalidBuffer;
-			}
-
-			/* Perform a round of index and heap vacuuming */
-			vacrel->consider_bypass_optimization = false;
-			lazy_vacuum(vacrel);
-
-			/*
-			 * Vacuum the Free Space Map to make newly-freed space visible on
-			 * upper-level FSM pages.  Note we have not yet processed blkno.
-			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-									blkno);
-			next_fsm_block_to_vacuum = blkno;
-
-			/* Report that we are once again scanning the heap */
-			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
-		}
-
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
@@ -1039,9 +1192,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1055,57 +1209,178 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				BlockNumber fsm_vac_up_to;
+
+				/*
+				 * If parallel heap vacuum scan is active, compute the minimum
+				 * block number we scanned so far.
+				 */
+				if (ParallelHeapVacuumIsActive(vacrel))
+				{
+					parallel_heap_vacuum_compute_min_scanned_blkno(vacrel);
+					fsm_vac_up_to = vacrel->phvstate->min_scanned_blkno;
+				}
+				else
+				{
+					/* blkno is already processed */
+					fsm_vac_up_to = blkno + 1;
+				}
+
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+										fsm_vac_up_to);
+				vacrel->next_fsm_block_to_vacuum = fsm_vac_up_to;
 			}
 		}
 		else
 			UnlockReleaseBuffer(buf);
+
+		/*
+		 * Consider if we definitely have enough space to process TIDs on page
+		 * already.  If we are close to overrunning the available space for
+		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
+		 * this page.
+		 */
+		if (TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
+		{
+			/*
+			 * Before beginning index vacuuming, we release any pin we may
+			 * hold on the visibility map page.  This isn't necessary for
+			 * correctness, but we do it anyway to avoid holding the pin
+			 * across a lengthy, unrelated operation.
+			 */
+			if (BufferIsValid(vmbuffer))
+			{
+				ReleaseBuffer(vmbuffer);
+				vmbuffer = InvalidBuffer;
+			}
+
+			/*
+			 * In parallel heap scan, we pause the heap scan without invoking
+			 * index and heap vacuuming, and return to the caller with
+			 * scan_done being false. The parallel vacuum workers will exit as
+			 * theirs jobs are done. The leader process will wait for all
+			 * workers to finish and perform index and heap vacuuming, and
+			 * then performs FSM vacuum too.
+			 */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/* Remember the last scanned block */
+				vacrel->phvstate->myscanstate->last_blkno = blkno;
+
+				/* Remember we might have some unprocessed blocks */
+				scan_done = false;
+
+				break;
+			}
+
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
+
+			/*
+			 * Vacuum the Free Space Map to make newly-freed space visible on
+			 * upper-level FSM pages.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = blkno;
+
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			continue;
+		}
 	}
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	return scan_done;
+}
 
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_state->scanned_pages,
-													 vacrel->scan_state->live_tuples);
+/*
+ * A parallel scan variant of heap_vac_scan_next_block(). Similar to
+ * heap_vac_scan_next_block(), the block number and visibility status of the next
+ * block to process are set in *blkno and *all_visible_according_to_vm. The return
+ * value is false if there are no further blocks to process.
+ *
+ * In parallel vacuum scan, we don't use the SKIP_PAGES_THRESHOLD optimization.
+ */
+static bool
+heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno,
+								  bool *all_visible_according_to_vm)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+	BlockNumber next_block;
+	Buffer		vmbuffer = InvalidBuffer;
+	uint8		mapbits = 0;
 
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_state->recently_dead_tuples +
-		vacrel->scan_state->missed_dead_tuples;
+	Assert(ParallelHeapVacuumIsActive(vacrel));
 
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
+	for (;;)
+	{
+		next_block = table_block_parallelscan_nextpage(vacrel->rel,
+													   &(phvstate->myscanstate->state),
+													   phvstate->pscandesc);
 
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+		/* Have we reached the end of the table? */
+		if (!BlockNumberIsValid(next_block) || next_block >= vacrel->rel_pages)
+		{
+			if (BufferIsValid(vmbuffer))
+				ReleaseBuffer(vmbuffer);
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+			*blkno = vacrel->rel_pages;
+			return false;
+		}
 
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
+		/* We always treat the last block as unsafe to skip */
+		if (next_block == vacrel->rel_pages - 1)
+			break;
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_block, &vmbuffer);
+
+		/*
+		 * A block is unskippable if it is not all visible according to the
+		 * visibility map.
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+			break;
+		}
+
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		if (!vacrel->skipwithvm)
+			break;
+
+		/*
+		 * Aggressive VACUUM caller can't skip pages just because they are
+		 * all-visible.
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * All-visible block is safe to skip in non-aggressive case. But
+			 * remember that the final range contains such a block for later.
+			 */
+			vacrel->scan_state->skippedallvis = true;
+		}
+	}
+
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
+
+	*blkno = next_block;
+	*all_visible_according_to_vm = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
+
+	return true;
 }
 
 /*
@@ -1254,11 +1529,12 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 
 		/*
 		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
+		 * (caller must have the opportunity to set
+		 * vacrel->scan_state->nonempty_pages). This rule avoids having
+		 * lazy_truncate_heap() take access-exclusive lock on rel to attempt a
+		 * truncation that fails anyway, just because there are tuples on the
+		 * last page (it is likely that there will be tuples on other nearby
+		 * pages as well, but those can be skipped).
 		 *
 		 * Implement this by always treating the last block as unsafe to skip.
 		 */
@@ -2117,7 +2393,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2493,7 +2769,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2943,12 +3219,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -2966,11 +3238,20 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize parallel heap scan/vacuuming or index vacuuming
+			 * or both based on the table size and the number of indexes.
+			 * Since only one worker can be used for an index, we will invoke
+			 * parallelism for index vacuuming only if there are at least two
+			 * indexes on a table.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3010,9 +3291,19 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	/*
+	 * Protect both dead_items and dead_items_info from concurrent updates in
+	 * parallel heap scan cases.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
@@ -3212,6 +3503,448 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of parallel workers for parallel vacuum heap scan.
+ *
+ * The calculation logic is borrowed from compute_parallel_worker().
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	int			parallel_workers = 0;
+	int			heap_parallel_threshold;
+	int			heap_pages;
+
+	if (nrequested == 0)
+	{
+		/*
+		 * Select the number of workers based on the log of the size of the
+		 * relation. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		heap_pages = RelationGetNumberOfBlocks(rel);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nrequested;
+
+	return parallel_workers;
+}
+
+/* Estimate shared memory sizes required for parallel heap vacuum */
+static inline void
+heap_parallel_estimate_shared_memory_size(Relation rel, int nworkers, Size *pscan_len,
+										  Size *shared_len, Size *pscanwork_len)
+{
+	Size		size = 0;
+
+	size = add_size(size, SizeOfPHVShared);
+	size = add_size(size, mul_size(sizeof(LVRelScanState), nworkers));
+	*shared_len = size;
+
+	*pscan_len = table_block_parallelscan_estimate(rel);
+
+	*pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+}
+
+/*
+ * Compute the amount of space we'll need in the parallel heap vacuum
+ * DSM, and inform pcxt->estimator about our needs.
+ *
+ * nworkers is the number of workers for the table vacuum. Note that it could
+ * be different than pcxt->nworkers since it is the maximum of number of
+ * workers for table vacuum and index vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+							  int nworkers, void *state)
+{
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+
+	/* space for PHVShared */
+	shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for ParallelBlockTableScanDesc */
+	shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for per-worker scan state, PHVScanWorkerState */
+	shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+								int nworkers, void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	PHVState   *phvstate = vacrel->phvstate;
+	ParallelBlockTableScanDesc pscan;
+	PHVScanWorkerState *pscanwork;
+	PHVShared  *shared;
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	phvstate = (PHVState *) palloc0(sizeof(PHVState));
+	phvstate->min_scanned_blkno = InvalidBlockNumber;
+
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+
+	shared = shm_toc_allocate(pcxt->toc, shared_len);
+
+	/* Prepare the shared information */
+
+	MemSet(shared, 0, shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_state->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_state->NewRelminMxid;
+	shared->skippedallvis = vacrel->scan_state->skippedallvis;
+
+	/*
+	 * XXX: we copy the contents of vistest to the shared area, but in order
+	 * to do that, we need to either expose GlobalVisTest or to provide
+	 * functions to copy contents of GlobalVisTest to somewhere. Currently we
+	 * do the former but not sure it's the best choice.
+	 *
+	 * Alternative idea is to have each worker determine cutoff and have their
+	 * own vistest. But we need to carefully consider it since parallel
+	 * workers end up having different cutoff and horizon.
+	 */
+	shared->vistest = *vacrel->vistest;
+
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_SHARED, shared);
+
+	phvstate->shared = shared;
+
+	/* prepare the  parallel block table scan description */
+	pscan = shm_toc_allocate(pcxt->toc, pscan_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_DESC, pscan);
+
+	/* initialize parallel scan description */
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pscan);
+
+	/* Disable sync scan to always start from the first block */
+	pscan->base.phs_syncscan = false;
+
+	phvstate->pscandesc = pscan;
+
+	/* prepare the workers' parallel block table scan state */
+	pscanwork = shm_toc_allocate(pcxt->toc, pscanwork_len);
+	MemSet(pscanwork, 0, pscanwork_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_DESC_WORKER, pscanwork);
+	phvstate->scanstates = pscanwork;
+
+	vacrel->phvstate = phvstate;
+}
+
+/*
+ * Main function for parallel heap vacuum workers.
+ */
+void
+heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
+							ParallelWorkerContext *pwcxt)
+{
+	LVRelState	vacrel = {0};
+	PHVState   *phvstate;
+	PHVShared  *shared;
+	ParallelBlockTableScanDesc pscandesc;
+	PHVScanWorkerState *scanstate;
+	LVRelScanState *scan_state;
+	ErrorContextCallback errcallback;
+	bool		scan_done;
+
+	phvstate = palloc(sizeof(PHVState));
+
+	pscandesc = (ParallelBlockTableScanDesc) shm_toc_lookup(pwcxt->toc,
+															LV_PARALLEL_KEY_SCAN_DESC,
+															false);
+	phvstate->pscandesc = pscandesc;
+
+	shared = (PHVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_SCAN_SHARED,
+										  false);
+	phvstate->shared = shared;
+
+	scanstate = (PHVScanWorkerState *) shm_toc_lookup(pwcxt->toc,
+													  LV_PARALLEL_KEY_SCAN_DESC_WORKER,
+													  false);
+
+	phvstate->myscanstate = &(scanstate[ParallelWorkerNumber]);
+	scan_state = &(shared->worker_scan_state[ParallelWorkerNumber]);
+
+	/* Prepare LVRelState */
+	vacrel.rel = rel;
+	vacrel.indrels = parallel_vacuum_get_table_indexes(pvs, &vacrel.nindexes);
+	vacrel.pvs = pvs;
+	vacrel.phvstate = phvstate;
+	vacrel.aggressive = shared->aggressive;
+	vacrel.skipwithvm = shared->skipwithvm;
+	vacrel.cutoffs = shared->cutoffs;
+	vacrel.vistest = &(shared->vistest);
+	vacrel.dead_items = parallel_vacuum_get_dead_items(pvs,
+													   &vacrel.dead_items_info);
+	vacrel.rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel.scan_state = scan_state;
+
+	/* initialize per-worker relation statistics */
+	MemSet(scan_state, 0, sizeof(LVRelScanState));
+
+	/* Set fields necessary for heap scan */
+	vacrel.scan_state->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel.scan_state->NewRelminMxid = shared->NewRelminMxid;
+	vacrel.scan_state->skippedallvis = shared->skippedallvis;
+
+	/* Initialize the per-worker scan state if not yet */
+	if (!phvstate->myscanstate->initialized)
+	{
+		table_block_parallelscan_startblock_init(rel,
+												 &(phvstate->myscanstate->state),
+												 phvstate->pscandesc);
+
+		phvstate->myscanstate->last_blkno = InvalidBlockNumber;
+		phvstate->myscanstate->maybe_have_blocks = false;
+		phvstate->myscanstate->initialized = true;
+	}
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel.dbname = get_database_name(MyDatabaseId);
+	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel.relname = pstrdup(RelationGetRelationName(rel));
+	vacrel.indname = NULL;
+	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	scan_done = do_lazy_scan_heap(&vacrel);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	/*
+	 * If the leader or a worker finishes the heap scan because dead_items
+	 * TIDs is close to the limit, it might have some allocated blocks in its
+	 * scan state. Since this scan state might not be used in the next heap
+	 * scan, we remember that it might have some unconsumed blocks so that the
+	 * leader complete the scans after the heap scan phase finishes.
+	 */
+	phvstate->myscanstate->maybe_have_blocks = !scan_done;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+parallel_heap_complete_unfinished_scan(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		PHVScanWorkerState *wstate = &(vacrel->phvstate->scanstates[i]);
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		if (!wstate->maybe_have_blocks)
+			continue;
+
+		/* Attach the worker's scan state and do heap scan */
+		vacrel->phvstate->myscanstate = wstate;
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		Assert(scan_done);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Compute the minimum block number we have scanned so far and update
+ * vacrel->min_scanned_blkno.
+ */
+static void
+parallel_heap_vacuum_compute_min_scanned_blkno(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * We check all worker scan states here to compute the minimum block
+	 * number among all scan states.
+	 */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		PHVScanWorkerState *wstate = &(phvstate->scanstates[i]);
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!wstate->initialized)
+			continue;
+
+		if (!BlockNumberIsValid(phvstate->min_scanned_blkno) ||
+			wstate->last_blkno < phvstate->min_scanned_blkno)
+			phvstate->min_scanned_blkno = wstate->last_blkno;
+	}
+}
+
+/* Accumulate each worker's scan results into the leader's */
+static void
+parallel_heap_vacuum_gather_scan_results(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		LVRelScanState *ss = &(phvstate->shared->worker_scan_state[i]);
+
+		vacrel->scan_state->scanned_pages += ss->scanned_pages;
+		vacrel->scan_state->removed_pages += ss->removed_pages;
+		vacrel->scan_state->vm_new_frozen_pages += ss->vm_new_frozen_pages;
+		vacrel->scan_state->lpdead_item_pages += ss->lpdead_item_pages;
+		vacrel->scan_state->missed_dead_pages += ss->missed_dead_pages;
+		vacrel->scan_state->tuples_deleted += ss->tuples_deleted;
+		vacrel->scan_state->tuples_frozen += ss->tuples_frozen;
+		vacrel->scan_state->lpdead_items += ss->lpdead_items;
+		vacrel->scan_state->live_tuples += ss->live_tuples;
+		vacrel->scan_state->recently_dead_tuples += ss->recently_dead_tuples;
+		vacrel->scan_state->missed_dead_tuples += ss->missed_dead_tuples;
+
+		if (ss->nonempty_pages < vacrel->scan_state->nonempty_pages)
+			vacrel->scan_state->nonempty_pages = ss->nonempty_pages;
+
+		if (TransactionIdPrecedes(ss->NewRelfrozenXid, vacrel->scan_state->NewRelfrozenXid))
+			vacrel->scan_state->NewRelfrozenXid = ss->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(ss->NewRelminMxid, vacrel->scan_state->NewRelminMxid))
+			vacrel->scan_state->NewRelminMxid = ss->NewRelminMxid;
+
+		if (!vacrel->scan_state->skippedallvis && ss->skippedallvis)
+			vacrel->scan_state->skippedallvis = true;
+	}
+
+	/* Also, compute the minimum block number we scanned so far */
+	parallel_heap_vacuum_compute_min_scanned_blkno(vacrel);
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches parallel
+ * workers to scan the heap in parallel.
+ */
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	PHVScanWorkerState *scanstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+	/* initialize parallel scan description to join as a worker */
+	scanstate = palloc0(sizeof(PHVScanWorkerState));
+	scanstate->last_blkno = InvalidBlockNumber;
+	table_block_parallelscan_startblock_init(vacrel->rel, &(scanstate->state),
+											 vacrel->phvstate->pscandesc);
+	vacrel->phvstate->myscanstate = scanstate;
+
+	for (;;)
+	{
+		bool		scan_done;
+
+		/*
+		 * Scan the table until either we are close to overrunning the
+		 * available space for dead_items TIDs or we reach the end of the
+		 * table.
+		 */
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/* wait for parallel workers to finish and gather scan results */
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+		parallel_heap_vacuum_gather_scan_results(vacrel);
+
+		/* We reach the end of the table */
+		if (scan_done)
+			break;
+
+		/*
+		 * The parallel heap scan paused in the middle of the table due to
+		 * full of dead_items TIDs. We perform a round of index and heap
+		 * vacuuming and FSM vacuum.
+		 */
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (vacrel->phvstate->min_scanned_blkno > vacrel->next_fsm_block_to_vacuum)
+		{
+			/*
+			 * min_scanned_blkno was updated when gathering the workers' scan
+			 * results.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									vacrel->phvstate->min_scanned_blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = vacrel->phvstate->min_scanned_blkno;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+		/* Re-launch workers to restart parallel heap scan */
+		vacrel->phvstate->nworkers_launched =
+			parallel_vacuum_table_scan_begin(vacrel->pvs);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process could launch fewer workers in the next cycle.
+	 */
+	parallel_heap_complete_unfinished_scan(vacrel);
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 50dd3d7d14d..3001be84ddf 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -6,15 +6,24 @@
  * This file contains routines that are intended to support setting up, using,
  * and tearing down a ParallelVacuumState.
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan or both index bulk deletion and
+ * index cleanup or all of them with parallel worker processes. Different
+ * numbers of workers are launched for the table vacuuming and index processing.
+ * ParallelVacuumState contains shared information as well as the memory space
+ * for storing dead items allocated in the DSA area.
+ *
+ * When initializing parallel table vacuum scan, we invoke table AM routines for
+ * estimating DSM sizes and initializing DSM memory. Parallel table vacuum
+ * workers invoke the table AM routine for vacuuming the table.
+ *
+ * For processing indexes in parallel, individual indexes are processed by one
+ * vacuum process. We launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the parallel
+ * worker processes exit.
+ *
+ * Each time we process table or indexes in parallel, the parallel context is
+ * re-initialized so that the same DSM can be used for multiple passes of table vacuum
+ * or index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -28,6 +37,7 @@
 
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * True if the caller wants parallel workers to invoke vacuum table scan
+	 * callback.
+	 */
+	bool		do_vacuum_table_scan;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -101,6 +117,13 @@ typedef struct PVShared
 	 */
 	pg_atomic_uint32 cost_balance;
 
+	/*
+	 * The number of workers for parallel table scan/vacuuming and index
+	 * vacuuming, respectively.
+	 */
+	int			nworkers_for_table;
+	int			nworkers_for_index;
+
 	/*
 	 * Number of active parallel workers.  This is used for computing the
 	 * minimum threshold of the vacuum cost balance before a worker sleeps for
@@ -164,6 +187,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Passed to parallel table scan workers. NULL for leader process */
+	ParallelWorkerContext *pwcxt;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -193,6 +219,9 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/* How many times parallel table vacuum scan is called? */
+	int			num_table_scans;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
@@ -224,8 +253,9 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static void parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											int *nworkers_for_index, bool *will_parallel_vacuum);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_unsafe_indexes(ParallelVacuumState *pvs);
@@ -244,7 +274,7 @@ static void parallel_vacuum_error_callback(void *arg);
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -258,6 +288,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
+	int			nworkers_for_index;
 	int			querylen;
 
 	/*
@@ -265,15 +297,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	 * relation
 	 */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
-													   nrequested_workers,
-													   will_parallel_vacuum);
+	parallel_vacuum_compute_workers(rel, indrels, nindexes, nrequested_workers,
+									&nworkers_for_table, &nworkers_for_index,
+									will_parallel_vacuum);
+
+	parallel_workers = Max(nworkers_for_table, nworkers_for_index);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
@@ -329,6 +363,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -373,6 +411,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shared->relid = RelationGetRelid(rel);
 	shared->elevel = elevel;
 	shared->queryid = pgstat_get_my_query_id();
+	shared->nworkers_for_table = nworkers_for_table;
+	shared->nworkers_for_index = nworkers_for_index;
 	shared->maintenance_work_mem_worker =
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
@@ -421,6 +461,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Prepare AM-specific DSM for parallel table vacuum */
+	if (nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -534,33 +578,48 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.
+ *
+ * For parallel table vacuum, it asks AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to *nworkers_table.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel index vacuum, The index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. nrequested is the number of
+ * parallel workers that user requested.  If nrequested is 0, we compute the
+ * parallel degree based on nindexes, that is the number of indexes that
+ * support parallel vacuum.  This function also sets will_parallel_vacuum to
+ * remember indexes that participate in parallel vacuum.
  */
-static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+static void
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_for_table,
+								int *nworkers_for_index, bool *will_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			parallel_workers_table = 0;
+	int			parallel_workers_index = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
 	 * when parallelism is disabled.
 	 */
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
-		return 0;
+	{
+		*nworkers_for_table = 0;
+		*nworkers_for_index = 0;
+		return;
+	}
+
+	/*
+	 * Compute the number of workers for parallel table scan. Cap by
+	 * max_parallel_maintenance_workers.
+	 */
+	parallel_workers_table = Min(table_parallel_vacuum_compute_workers(rel, nrequested),
+								 max_parallel_maintenance_workers);
 
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
@@ -591,17 +650,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	nindexes_parallel--;
 
 	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Compute the parallel degree for parallel index vacuum */
+		parallel_workers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		parallel_workers_index = Min(parallel_workers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_for_table = parallel_workers_table;
+	*nworkers_for_index = parallel_workers_index;
 }
 
 /*
@@ -669,8 +729,12 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
-		/* Reinitialize parallel context to relaunch parallel workers */
-		if (pvs->num_index_scans > 0)
+		/*
+		 * Reinitialize parallel context to relaunch parallel workers if we
+		 * have used the parallel context for either index vacuuming or table
+		 * vacuuming.
+		 */
+		if (pvs->num_index_scans > 0 || pvs->num_table_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -982,6 +1046,146 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Prepare DSM and shared vacuum delays, and launch parallel workers for parallel
+ * table vacuum. Return the number of parallel workers launched.
+ *
+ * The caller must call parallel_vacuum_table_scan_end() to finish the parallel
+ * table vacuum.
+ */
+int
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->shared->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->do_vacuum_table_scan = true;
+
+	if (pvs->num_table_scans > 0)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->shared->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	ereport(pvs->shared->elevel,
+			(errmsg(ngettext("launched %d parallel vacuum worker for table processing (planned: %d)",
+							 "launched %d parallel vacuum workers for table processing (planned: %d)",
+							 pvs->pcxt->nworkers_launched),
+					pvs->pcxt->nworkers_launched, pvs->shared->nworkers_for_table)));
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel table vacuum scan, and gather statistics.
+ */
+void
+parallel_vacuum_table_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->shared->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	pvs->shared->do_vacuum_table_scan = false;
+	pvs->num_table_scans++;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/*
+ * Return the number of workers for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->shared->nworkers_for_table;
+}
+
+/*
+ * Return the number of workers for parallel index processing.
+ */
+int
+parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs)
+{
+	return pvs->shared->nworkers_for_index;
+}
+
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->do_vacuum_table_scan);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do table vacuum scan */
+	table_parallel_vacuum_relation_worker(pvs->heaprel, pvs, pvs->pwcxt);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -1033,7 +1237,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	if (shared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = shared->maintenance_work_mem_worker;
@@ -1064,6 +1267,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
 	pvs.heaprel = rel;
 
+	pvs.pwcxt = palloc(sizeof(ParallelWorkerContext));
+	pvs.pwcxt->toc = toc;
+	pvs.pwcxt->seg = seg;
+
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
 	pvs.status = PARALLEL_INDVAC_STATUS_INITIAL;
@@ -1081,8 +1288,16 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	if (pvs.shared->do_vacuum_table_scan)
+	{
+		/* Process table to perform vacuum */
+		parallel_vacuum_process_table(&pvs);
+	}
+	else
+	{
+		/* Process indexes to perform vacuum/cleanup */
+		parallel_vacuum_process_safe_indexes(&pvs);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index c769b1aa3ef..c408183425a 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,80 +99,6 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
-/*
- * State for the GlobalVisTest* family of functions. Those functions can
- * e.g. be used to decide if a deleted row can be removed without violating
- * MVCC semantics: If the deleted row's xmax is not considered to be running
- * by anyone, the row can be removed.
- *
- * To avoid slowing down GetSnapshotData(), we don't calculate a precise
- * cutoff XID while building a snapshot (looking at the frequently changing
- * xmins scales badly). Instead we compute two boundaries while building the
- * snapshot:
- *
- * 1) definitely_needed, indicating that rows deleted by XIDs >=
- *    definitely_needed are definitely still visible.
- *
- * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
- *    definitely be removed
- *
- * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
- * && XID < definitely_needed), the boundaries can be recomputed (using
- * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
- * maintaining an accurate value all the time.
- *
- * As it is not cheap to compute accurate boundaries, we limit the number of
- * times that happens in short succession. See GlobalVisTestShouldUpdate().
- *
- *
- * There are three backend lifetime instances of this struct, optimized for
- * different types of relations. As e.g. a normal user defined table in one
- * database is inaccessible to backends connected to another database, a test
- * specific to a relation can be more aggressive than a test for a shared
- * relation.  Currently we track four different states:
- *
- * 1) GlobalVisSharedRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in any database, nor a
- *    replication slot's xmin, nor a replication slot's catalog_xmin might
- *    still consider XID as running.
- *
- * 2) GlobalVisCatalogRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin, nor a replication slot's
- *    catalog_xmin might still consider XID as running.
- *
- *    I.e. the difference to GlobalVisSharedRels is that
- *    snapshot in other databases are ignored.
- *
- * 3) GlobalVisDataRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin consider XID as running.
- *
- *    I.e. the difference to GlobalVisCatalogRels is that
- *    replication slot's catalog_xmin is not taken into account.
- *
- * 4) GlobalVisTempRels, which only considers the current session, as temp
- *    tables are not visible to other sessions.
- *
- * GlobalVisTestFor(relation) returns the appropriate state
- * for the relation.
- *
- * The boundaries are FullTransactionIds instead of TransactionIds to avoid
- * wraparound dangers. There e.g. would otherwise exist no procarray state to
- * prevent maybe_needed to become old enough after the GetSnapshotData()
- * call.
- *
- * The typedef is in the header.
- */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 04afb1a6a66..740b69d35ef 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -21,6 +21,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "commands/vacuum.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -401,6 +402,13 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int requested);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+										  int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
+										ParallelWorkerContext *pwcxt);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index bb32de11ea0..c4f516dda14 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -654,6 +655,47 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * function must return 0 to disable parallel table vacuum.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel, int requested);
+
+	/*
+	 * Estimate the size of shared memory that the parallel table vacuum needs
+	 * for AM
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * This callback is called for parallel table vacuum workers.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_relation_worker) (Relation rel,
+													ParallelVacuumState *pvs,
+													ParallelWorkerContext *pwcxt);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1715,6 +1757,52 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int requested)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							   void *state)
+{
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Start a parallel table vacuuming for this relation.
+ */
+static inline void
+table_parallel_vacuum_relation_worker(Relation rel, ParallelVacuumState *pvs,
+									  ParallelWorkerContext *pwcxt)
+{
+	rel->rd_tableam->parallel_vacuum_relation_worker(rel, pvs, pwcxt);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 7613d00e26f..b70e50133fa 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -360,7 +360,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -370,6 +371,11 @@ extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												bool estimated_count);
+extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index afc284e9c36..c9d7a39d605 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -17,6 +17,7 @@
 #include "utils/relcache.h"
 #include "utils/resowner.h"
 #include "utils/snapshot.h"
+#include "utils/snapmgr_internal.h"
 
 
 extern PGDLLIMPORT bool FirstSnapshotSet;
@@ -96,7 +97,6 @@ extern char *ExportSnapshot(Snapshot snapshot);
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
  */
-typedef struct GlobalVisState GlobalVisState;
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
diff --git a/src/include/utils/snapmgr_internal.h b/src/include/utils/snapmgr_internal.h
new file mode 100644
index 00000000000..241121872b7
--- /dev/null
+++ b/src/include/utils/snapmgr_internal.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapmgr_internal.h
+ *		This file contains declarations of structs for snapshot manager
+ *		for internal use.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/snapmgr_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SNAPMGR_INTERNAL_H
+#define SNAPMGR_INTERNAL_H
+
+/*
+ * State for the GlobalVisTest* family of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track four different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisDataRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * 4) GlobalVisTempRels, which only considers the current session, as temp
+ *    tables are not visible to other sessions.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ */
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
+#endif							/* SNAPMGR_INTERNAL_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c4e0477c0d4..a0a0c9faadf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1841,6 +1841,9 @@ PGresAttValue
 PGresParamDesc
 PGresult
 PGresult_data
+PHVScanWorkerState
+PHVShared
+PHVState
 PIO_STATUS_BLOCK
 PLAINTREE
 PLAssignStmt
-- 
2.43.5

v5-0001-Move-lazy-heap-scanning-related-variables-to-stru.patchapplication/octet-stream; name=v5-0001-Move-lazy-heap-scanning-related-variables-to-stru.patchDownload

From 56a15d51dab3fdfc0f9b0e902a1bff2b60551b30 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 15 Nov 2024 14:14:13 -0800
Subject: [PATCH v5 1/8] Move lazy heap scanning related variables to struct
 LVRelScanState.

---
 src/backend/access/heap/vacuumlazy.c | 300 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 157 insertions(+), 144 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f2ca9430581..05406a0bc5a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -131,6 +131,47 @@ typedef enum
 	VACUUM_ERRCB_PHASE_TRUNCATE,
 } VacErrPhase;
 
+/*
+ * Relation statistics collected during heap scanning.
+ */
+typedef struct LVRelScanState
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVRelScanState;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -157,10 +198,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -186,43 +223,18 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
 
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
+	/* Working state for heap scanning and vacuuming */
+	LVRelScanState *scan_state;
 
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
-
-	/* Statistics output by us, for table */
-	double		new_rel_tuples; /* new estimated total # of tuples */
-	double		new_live_tuples;	/* new estimated total # of live tuples */
+	/* New estimated total # of tuples and total # of live  tuples */
+	double		new_rel_tuples;
+	double		new_live_tuples;
 	/* Statistics output by index AMs */
 	IndexBulkDeleteResult **indstats;
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -309,6 +321,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVRelScanState *scan_state;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -420,12 +433,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
+	scan_state = palloc(sizeof(LVRelScanState));
+	scan_state->scanned_pages = 0;
+	scan_state->removed_pages = 0;
+	scan_state->new_frozen_tuple_pages = 0;
+	scan_state->lpdead_item_pages = 0;
+	scan_state->missed_dead_pages = 0;
+	scan_state->nonempty_pages = 0;
+	scan_state->tuples_deleted = 0;
+	scan_state->tuples_frozen = 0;
+	scan_state->lpdead_items = 0;
+	scan_state->live_tuples = 0;
+	scan_state->recently_dead_tuples = 0;
+	scan_state->missed_dead_tuples = 0;
+	scan_state->vm_new_visible_pages = 0;
+	scan_state->vm_new_visible_frozen_pages = 0;
+	scan_state->vm_new_frozen_pages = 0;
+	vacrel->scan_state = scan_state;
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/* Allocate/initialize output statistics state */
@@ -434,19 +458,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
-	/* Initialize remaining counters (be tidy) */
-	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
-
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
-
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
 	 * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze.  Then determine
@@ -467,9 +478,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
+	vacrel->scan_state->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_state->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_state->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -550,15 +561,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_state->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_state->NewRelfrozenXid));
+	Assert(vacrel->scan_state->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_state->NewRelminMxid));
+	if (vacrel->scan_state->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -566,8 +577,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_state->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_state->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
@@ -588,7 +599,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_state->NewRelfrozenXid, vacrel->scan_state->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -604,8 +615,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples);
+						 vacrel->scan_state->recently_dead_tuples +
+						 vacrel->scan_state->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	if (instrument)
@@ -678,21 +689,21 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_state->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_state->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->scan_state->scanned_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->scan_state->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->scan_state->recently_dead_tuples);
+			if (vacrel->scan_state->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->scan_state->missed_dead_tuples,
+								 vacrel->scan_state->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -700,33 +711,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_state->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_state->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_state->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_state->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_state->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_state->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 (long long) vacrel->scan_state->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_state->vm_new_visible_pages,
+							 vacrel->scan_state->vm_new_visible_frozen_pages +
+							 vacrel->scan_state->vm_new_frozen_pages,
+							 vacrel->scan_state->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -746,10 +757,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_state->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->scan_state->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->scan_state->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -882,7 +893,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		bool		has_lpdead_items;
 		bool		got_cleanup_lock = false;
 
-		vacrel->scanned_pages++;
+		vacrel->scan_state->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -900,7 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_state->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1064,16 +1075,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_state->scanned_pages,
+													 vacrel->scan_state->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_state->recently_dead_tuples +
+		vacrel->scan_state->missed_dead_tuples;
 
 	/*
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
@@ -1110,8 +1121,8 @@ lazy_scan_heap(LVRelState *vacrel)
  * there are no further blocks to process.
  *
  * vacrel is an in/out parameter here.  Vacuum options and information about
- * the relation are read.  vacrel->skippedallvis is set if we skip a block
- * that's all-visible but not all-frozen, to ensure that we don't update
+ * the relation are read.  vacrel->scan_state->skippedallvis is set if we skip
+ * a block that's all-visible but not all-frozen, to ensure that we don't update
  * relfrozenxid in that case.  vacrel also holds information about the next
  * unskippable block, as bookkeeping for this function.
  */
@@ -1170,7 +1181,7 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_state->skippedallvis = true;
 		}
 	}
 
@@ -1414,11 +1425,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_state->vm_new_visible_pages++;
+				vacrel->scan_state->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_state->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1488,10 +1499,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_state->NewRelfrozenXid,
+							   &vacrel->scan_state->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_state->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_state->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1501,7 +1513,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_state->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -1536,7 +1548,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_state->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -1551,15 +1563,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_state->tuples_deleted += presult.ndeleted;
+	vacrel->scan_state->tuples_frozen += presult.nfrozen;
+	vacrel->scan_state->lpdead_items += presult.lpdead_items;
+	vacrel->scan_state->live_tuples += presult.live_tuples;
+	vacrel->scan_state->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_state->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -1608,13 +1620,13 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_state->vm_new_visible_pages++;
 			if (presult.all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_state->vm_new_visible_frozen_pages++;
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_state->vm_new_frozen_pages++;
 	}
 
 	/*
@@ -1700,8 +1712,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_state->vm_new_visible_pages++;
+			vacrel->scan_state->vm_new_visible_frozen_pages++;
 		}
 
 		/*
@@ -1709,7 +1721,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * above, so we don't need to test the value of old_vmbits.
 		 */
 		else
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_state->vm_new_frozen_pages++;
 	}
 }
 
@@ -1748,8 +1760,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_state->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_state->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1876,8 +1888,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_state->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_state->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -1904,25 +1916,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_state->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_state->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_state->live_tuples += live_tuples;
+	vacrel->scan_state->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_state->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_state->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_state->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -1951,7 +1963,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_state->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -1985,7 +1997,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_state->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2012,7 +2024,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		bypass = (vacrel->scan_state->lpdead_item_pages < threshold &&
 				  (TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L)));
 	}
 
@@ -2150,7 +2162,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_state->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2259,8 +2271,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_state->lpdead_items &&
+			vacuumed_pages == vacrel->scan_state->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2376,14 +2388,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_state->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_state->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_state->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -2459,7 +2471,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_state->scanned_pages < vacrel->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -2640,7 +2652,7 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->scan_state->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
 		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
@@ -2666,7 +2678,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_state->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -2767,7 +2779,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_state->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
@@ -2775,7 +2787,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_state->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -2803,7 +2815,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_state->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2915,7 +2927,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_state->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fbdb932e6b6..c4e0477c0d4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1478,6 +1478,7 @@ LPVOID
 LPWSTR
 LSEG
 LUID
+LVRelScanState
 LVRelState
 LVSavedErrInfo
 LWLock
-- 
2.43.5

#30

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 1 year ago

In reply to: Masahiko Sawada (#29)

1 attachment(s)

RE: Parallel heap vacuum

Dear Sawada-san,

Thanks for updating the patch. ISTM that 0001 and 0002 can be applied independently.
Therefore I can firstly post some comments only for them.

Comments for 0001:

```
+ /* New estimated total # of tuples and total # of live tuples */
```

There is a unnecessary blank.

```
+    scan_state = palloc(sizeof(LVRelScanState));
+    scan_state->scanned_pages = 0;
+    scan_state->removed_pages = 0;
+    scan_state->new_frozen_tuple_pages = 0;
+    scan_state->lpdead_item_pages = 0;
+    scan_state->missed_dead_pages = 0;
+    scan_state->nonempty_pages = 0;
+    scan_state->tuples_deleted = 0;
+    scan_state->tuples_frozen = 0;
+    scan_state->lpdead_items = 0;
+    scan_state->live_tuples = 0;
+    scan_state->recently_dead_tuples = 0;
+    scan_state->missed_dead_tuples = 0;
+    scan_state->vm_new_visible_pages = 0;
+    scan_state->vm_new_visible_frozen_pages = 0;
+    scan_state->vm_new_frozen_pages = 0;
+    vacrel->scan_state = scan_state;
```

Since most of attributes are initialized as zero, can you use palloc0() instead?

```
- * the relation are read.  vacrel->skippedallvis is set if we skip a block
- * that's all-visible but not all-frozen, to ensure that we don't update
+ * the relation are read.  vacrel->scan_state->skippedallvis is set if we skip
+ * a block that's all-visible but not all-frozen, to ensure that we don't update
  * relfrozenxid in that case.  vacrel also holds information about the next
```

A line exceeds 80-char limit.

```
+    /* How many time index vacuuming or cleaning up is executed? */
+    int         num_index_scans;
+
```

Comments for 0002:

```
+    /* How many time index vacuuming or cleaning up is executed? */
+    int         num_index_scans;
+
```

I feel this is bit confusing because LVRelState also has "num_index_scans".
How about "num_parallel_index_scans"?

Attached patch contains above changes.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

kuroda.diffsapplication/octet-stream; name=kuroda.diffsDownload

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 61b77af09b..c2fa06b674 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -227,7 +227,7 @@ typedef struct LVRelState
 	/* Working state for heap scanning and vacuuming */
 	LVRelScanState *scan_state;
 
-	/* New estimated total # of tuples and total # of live  tuples */
+	/* New estimated total # of tuples and total # of live tuples */
 	double		new_rel_tuples;
 	double		new_live_tuples;
 	/* Statistics output by index AMs */
@@ -321,7 +321,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
-	LVRelScanState *scan_state;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -433,23 +432,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	scan_state = palloc(sizeof(LVRelScanState));
-	scan_state->scanned_pages = 0;
-	scan_state->removed_pages = 0;
-	scan_state->new_frozen_tuple_pages = 0;
-	scan_state->lpdead_item_pages = 0;
-	scan_state->missed_dead_pages = 0;
-	scan_state->nonempty_pages = 0;
-	scan_state->tuples_deleted = 0;
-	scan_state->tuples_frozen = 0;
-	scan_state->lpdead_items = 0;
-	scan_state->live_tuples = 0;
-	scan_state->recently_dead_tuples = 0;
-	scan_state->missed_dead_tuples = 0;
-	scan_state->vm_new_visible_pages = 0;
-	scan_state->vm_new_visible_frozen_pages = 0;
-	scan_state->vm_new_frozen_pages = 0;
-	vacrel->scan_state = scan_state;
+	vacrel->scan_state = palloc0(sizeof(LVRelScanState));
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/* Allocate/initialize output statistics state */
@@ -1122,9 +1105,9 @@ lazy_scan_heap(LVRelState *vacrel)
  *
  * vacrel is an in/out parameter here.  Vacuum options and information about
  * the relation are read.  vacrel->scan_state->skippedallvis is set if we skip
- * a block that's all-visible but not all-frozen, to ensure that we don't update
- * relfrozenxid in that case.  vacrel also holds information about the next
- * unskippable block, as bookkeeping for this function.
+ * a block that's all-visible but not all-frozen, to ensure that we don't
+ * update relfrozenxid in that case.  vacrel also holds information about the
+ * next unskippable block, as bookkeeping for this function.
  */
 static bool
 heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 50dd3d7d14..11282e98a1 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -201,7 +201,7 @@ struct ParallelVacuumState
 	bool	   *will_parallel_vacuum;
 
 	/* How many time index vacuuming or cleaning up is executed? */
-	int			num_index_scans;
+	int			num_parallel_index_scans;
 
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
@@ -231,7 +231,7 @@ static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_unsafe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 											  PVIndStats *indstats);
-static bool parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
+static bool parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_parallel_index_scans,
 												   bool vacuum);
 static void parallel_vacuum_error_callback(void *arg);
 
@@ -631,7 +631,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 		nworkers = pvs->nindexes_parallel_cleanup;
 
 		/* Add conditionally parallel-aware indexes if in the first time call */
-		if (pvs->num_index_scans == 0)
+		if (pvs->num_parallel_index_scans == 0)
 			nworkers += pvs->nindexes_parallel_condcleanup;
 	}
 
@@ -659,7 +659,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 		indstats->parallel_workers_can_process =
 			(pvs->will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
-													pvs->num_index_scans,
+													pvs->num_parallel_index_scans,
 													vacuum));
 	}
 
@@ -670,7 +670,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 	if (nworkers > 0)
 	{
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (pvs->num_index_scans > 0)
+		if (pvs->num_parallel_index_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -766,7 +766,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 	}
 
 	/* Increment the counter */
-	pvs->num_index_scans++;
+	pvs->num_parallel_index_scans++;
 }
 
 /*
@@ -951,7 +951,8 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
  * parallel index vacuum or parallel index cleanup.
  */
 static bool
-parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
+parallel_vacuum_index_is_parallel_safe(Relation indrel,
+									   int num_parallel_index_scans,
 									   bool vacuum)
 {
 	uint8		vacoptions;
@@ -975,7 +976,7 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	 * VACUUM_OPTION_PARALLEL_COND_CLEANUP to know when indexes support
 	 * parallel cleanup conditionally.
 	 */
-	if (num_index_scans > 0 &&
+	if (num_parallel_index_scans > 0 &&
 		((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) != 0))
 		return false;

#31

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Masahiko Sawada (#28)

Re: Parallel heap vacuum

On 12/19/24 23:05, Masahiko Sawada wrote:

On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote:

On 12/13/24 00:04, Tomas Vondra wrote:

...

The main difference is here:

master / no parallel workers:

pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)

1 parallel worker:

pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)

Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
essentially just those with deleted tuples, which is ~1/20 of pages.
That's close to the 15x speedup.

This effect is clearest without indexes, but it does affect even runs
with indexes - having to scan the indexes makes it much less pronounced,
though. However, these indexes are pretty massive (about the same size
as the table) - multiple times larger than the table. Chances are it'd
be clearer on realistic data sets.

So the question is - is this correct? And if yes, why doesn't the
regular (serial) vacuum do that?

There's some more strange things, though. For example, how come the avg
read rate is 0.000 MB/s?

avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s

It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
pages are in shared buffers (thanks to the DELETE earlier in that session).

OK, after looking into this a bit more I think the reason is rather
simple - SKIP_PAGES_THRESHOLD.

With serial runs, we end up scanning all pages, because even with an
update every 5000 tuples, that's still only ~25 pages apart, well within
the 32-page window. So we end up skipping no pages, scan and vacuum all
everything.

But parallel runs have this skipping logic disabled, or rather the logic
that switches to sequential scans if the gap is less than 32 pages.

IMHO this raises two questions:

1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
sequential scans is the pages are close enough. Maybe there is a reason
for this difference? Workers can reduce the difference between random
and sequential I/0, similarly to prefetching. But that just means the
workers should use a lower threshold, e.g. as

SKIP_PAGES_THRESHOLD / nworkers

or something like that? I don't see this discussed in this thread.

Each parallel heap scan worker allocates a chunk of blocks which is
8192 blocks at maximum, so we would need to use the
SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
need to evaluate the differences anyway. WIll do the benchmark test
and share the results.

Right. I don't think this really matters for small tables, and for large
tables the chunks should be fairly large (possibly up to 8192 blocks),
in which case we could apply SKIP_PAGE_THRESHOLD just like in the serial
case. There might be differences at boundaries between chunks, but that
seems like a minor / expected detail. I haven't checked know if the code
would need to change / how much.

2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
storage. If I can get an order of magnitude improvement (or more than
that) by disabling the threshold, and just doing random I/O, maybe
there's time to adjust it a bit.

Yeah, you've started a thread for this so let's discuss it there.

OK. FWIW as suggested in the other thread, it doesn't seem to be merely
a question of VACUUM performance, as not skipping pages gives vacuum the
opportunity to do cleanup that would otherwise need to happen later.

If only for this reason, I think it would be good to keep the serial and
parallel vacuum consistent.

regards

--
Tomas Vondra

#32

Masahiko Sawada

sawada.mshk@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#31)

9 attachment(s)

Re: Parallel heap vacuum

On Wed, Dec 25, 2024 at 8:52 AM Tomas Vondra <tomas@vondra.me> wrote:

On 12/19/24 23:05, Masahiko Sawada wrote:

On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote:

On 12/13/24 00:04, Tomas Vondra wrote:

...

The main difference is here:

master / no parallel workers:

pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)

1 parallel worker:

pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)

Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
essentially just those with deleted tuples, which is ~1/20 of pages.
That's close to the 15x speedup.

This effect is clearest without indexes, but it does affect even runs
with indexes - having to scan the indexes makes it much less pronounced,
though. However, these indexes are pretty massive (about the same size
as the table) - multiple times larger than the table. Chances are it'd
be clearer on realistic data sets.

So the question is - is this correct? And if yes, why doesn't the
regular (serial) vacuum do that?

There's some more strange things, though. For example, how come the avg
read rate is 0.000 MB/s?

avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s

It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
pages are in shared buffers (thanks to the DELETE earlier in that session).

OK, after looking into this a bit more I think the reason is rather
simple - SKIP_PAGES_THRESHOLD.

With serial runs, we end up scanning all pages, because even with an
update every 5000 tuples, that's still only ~25 pages apart, well within
the 32-page window. So we end up skipping no pages, scan and vacuum all
everything.

But parallel runs have this skipping logic disabled, or rather the logic
that switches to sequential scans if the gap is less than 32 pages.

IMHO this raises two questions:

1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
sequential scans is the pages are close enough. Maybe there is a reason
for this difference? Workers can reduce the difference between random
and sequential I/0, similarly to prefetching. But that just means the
workers should use a lower threshold, e.g. as

SKIP_PAGES_THRESHOLD / nworkers

or something like that? I don't see this discussed in this thread.

Each parallel heap scan worker allocates a chunk of blocks which is
8192 blocks at maximum, so we would need to use the
SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
need to evaluate the differences anyway. WIll do the benchmark test
and share the results.

Right. I don't think this really matters for small tables, and for large
tables the chunks should be fairly large (possibly up to 8192 blocks),
in which case we could apply SKIP_PAGE_THRESHOLD just like in the serial
case. There might be differences at boundaries between chunks, but that
seems like a minor / expected detail. I haven't checked know if the code
would need to change / how much.

2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
storage. If I can get an order of magnitude improvement (or more than
that) by disabling the threshold, and just doing random I/O, maybe
there's time to adjust it a bit.

Yeah, you've started a thread for this so let's discuss it there.

OK. FWIW as suggested in the other thread, it doesn't seem to be merely
a question of VACUUM performance, as not skipping pages gives vacuum the
opportunity to do cleanup that would otherwise need to happen later.

If only for this reason, I think it would be good to keep the serial and
parallel vacuum consistent.

I've not evaluated SKIP_PAGE_THRESHOLD optimization yet but I'd like
to share the latest patch set as cfbot reports some failures. Comments
from Kuroda-san are also incorporated in this version. Also, I'd like
to share the performance test results I did with the latest patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

parallel_heap_vacuum_benchmark_v6.pdfapplication/pdf; name=parallel_heap_vacuum_benchmark_v6.pdfDownload

v6-0004-raidxtree.h-support-shared-iteration.patchapplication/octet-stream; name=v6-0004-raidxtree.h-support-shared-iteration.patchDownload

From 15c1688c537764c2ef859ccfc9dd506c12eb970a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:29:51 -0700
Subject: [PATCH v6 4/8] raidxtree.h: support shared iteration.

This commit supports a shared iteration operation on a radix tree with
multiple processes. The radix tree must be in shared mode to start a
shared itereation. Parallel workers can attach the shared iteration
using the iterator handle given by the leader process. Same as the
normal interation, it's guarnteed that the shared iteration returns
key-values in an ascending order.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/include/lib/radixtree.h                   | 227 +++++++++++++++---
 .../modules/test_radixtree/test_radixtree.c   | 128 ++++++----
 2 files changed, 281 insertions(+), 74 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 6432b51a246..bfe4c927fa8 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -136,6 +136,9 @@
  * RT_LOCK_SHARE 	- Lock the radix tree in share mode
  * RT_UNLOCK		- Unlock the radix tree
  * RT_GET_HANDLE	- Return the handle of the radix tree
+ * RT_BEGIN_ITERATE_SHARED	- Begin iterating in shared mode.
+ * RT_ATTACH_ITERATE_SHARED	- Attach to the shared iterator.
+ * RT_GET_ITER_HANDLE		- Get the handle of the shared iterator.
  *
  * Optional Interface
  * ---------
@@ -179,6 +182,9 @@
 #define RT_ATTACH RT_MAKE_NAME(attach)
 #define RT_DETACH RT_MAKE_NAME(detach)
 #define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#define RT_BEGIN_ITERATE_SHARED RT_MAKE_NAME(begin_iterate_shared)
+#define RT_ATTACH_ITERATE_SHARED RT_MAKE_NAME(attach_iterate_shared)
+#define RT_GET_ITER_HANDLE RT_MAKE_NAME(get_iter_handle)
 #define RT_LOCK_EXCLUSIVE RT_MAKE_NAME(lock_exclusive)
 #define RT_LOCK_SHARE RT_MAKE_NAME(lock_share)
 #define RT_UNLOCK RT_MAKE_NAME(unlock)
@@ -238,15 +244,19 @@
 #define RT_SHRINK_NODE_16 RT_MAKE_NAME(shrink_child_16)
 #define RT_SHRINK_NODE_48 RT_MAKE_NAME(shrink_child_48)
 #define RT_SHRINK_NODE_256 RT_MAKE_NAME(shrink_child_256)
+#define RT_INITIALIZE_ITER RT_MAKE_NAME(initialize_iter)
 #define RT_NODE_ITERATE_NEXT RT_MAKE_NAME(node_iterate_next)
 #define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
 
 /* type declarations */
 #define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
 #define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER_CONTROL RT_MAKE_NAME(iter_control)
 #define RT_ITER RT_MAKE_NAME(iter)
 #ifdef RT_SHMEM
 #define RT_HANDLE RT_MAKE_NAME(handle)
+#define RT_ITER_CONTROL_SHARED RT_MAKE_NAME(iter_control_shared)
+#define RT_ITER_HANDLE RT_MAKE_NAME(iter_handle)
 #endif
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_CHILD_PTR RT_MAKE_NAME(child_ptr)
@@ -272,6 +282,7 @@ typedef struct RT_ITER RT_ITER;
 
 #ifdef RT_SHMEM
 typedef dsa_pointer RT_HANDLE;
+typedef dsa_pointer RT_ITER_HANDLE;
 #endif
 
 #ifdef RT_SHMEM
@@ -282,6 +293,9 @@ RT_SCOPE	RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_LOCK_EXCLUSIVE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_LOCK_SHARE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_UNLOCK(RT_RADIX_TREE * tree);
+RT_SCOPE	RT_ITER *RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree);
+RT_SCOPE	RT_ITER_HANDLE RT_GET_ITER_HANDLE(RT_ITER * iter);
+RT_SCOPE	RT_ITER *RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle);
 #else
 RT_SCOPE	RT_RADIX_TREE *RT_CREATE(MemoryContext ctx);
 #endif
@@ -689,6 +703,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 	RT_HANDLE	handle;
 	uint32		magic;
 	LWLock		lock;
+	int			tranche_id;
 #endif
 
 	RT_PTR_ALLOC root;
@@ -742,11 +757,9 @@ typedef struct RT_NODE_ITER
 	int			idx;
 }			RT_NODE_ITER;
 
-/* state for iterating over the whole radix tree */
-struct RT_ITER
+/* Contain the iteration state data */
+typedef struct RT_ITER_CONTROL
 {
-	RT_RADIX_TREE *tree;
-
 	/*
 	 * A stack to track iteration for each level. Level 0 is the lowest (or
 	 * leaf) level
@@ -757,8 +770,36 @@ struct RT_ITER
 
 	/* The key constructed during iteration */
 	uint64		key;
-};
+}			RT_ITER_CONTROL;
+
+#ifdef RT_SHMEM
+/* Contain the shared iteration state data */
+typedef struct RT_ITER_CONTROL_SHARED
+{
+	/* Actual shared iteration state data */
+	RT_ITER_CONTROL common;
+
+	/* protect the control data */
+	LWLock		lock;
+
+	RT_ITER_HANDLE handle;
+	pg_atomic_uint32 refcnt;
+}			RT_ITER_CONTROL_SHARED;
+#endif
+
+/* state for iterating over the whole radix tree */
+struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
 
+	/* pointing to either local memory or DSA */
+	RT_ITER_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	/* True if the iterator is for shared iteration */
+	bool		shared;
+#endif
+};
 
 /* verification (available only in assert-enabled builds) */
 static void RT_VERIFY_NODE(RT_NODE * node);
@@ -1850,6 +1891,7 @@ RT_CREATE(MemoryContext ctx)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
 	tree->ctl->handle = dp;
 	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	tree->ctl->tranche_id = tranche_id;
 	LWLockInitialize(&tree->ctl->lock, tranche_id);
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
@@ -1902,6 +1944,9 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 	dsa_pointer control;
 
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->iter_context = AllocSetContextCreate(CurrentMemoryContext,
+											   RT_STR(RT_PREFIX) "_radix_tree iter context",
+											   ALLOCSET_SMALL_SIZES);
 
 	/* Find the control object in shared memory */
 	control = handle;
@@ -2074,35 +2119,86 @@ RT_FREE(RT_RADIX_TREE * tree)
 
 /***************** ITERATION *****************/
 
+/* Common routine to initialize the given iterator */
+static void
+RT_INITIALIZE_ITER(RT_RADIX_TREE * tree, RT_ITER * iter)
+{
+	RT_CHILD_PTR root;
+
+	iter->tree = tree;
+
+	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
+	root.alloc = iter->tree->ctl->root;
+	RT_PTR_SET_LOCAL(tree, &root);
+
+	iter->ctl->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+
+	/* Set the root to start */
+	iter->ctl->cur_level = iter->ctl->top_level;
+	iter->ctl->node_iters[iter->ctl->cur_level].node = root;
+	iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
+}
+
 /*
  * Create and return the iterator for the given radix tree.
  *
- * Taking a lock in shared mode during the iteration is the caller's
- * responsibility.
+ * Taking a lock on a radix tree in shared mode during the iteration is the
+ * caller's responsibility.
  */
 RT_SCOPE	RT_ITER *
 RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 {
 	RT_ITER    *iter;
-	RT_CHILD_PTR root;
 
 	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
 											  sizeof(RT_ITER));
-	iter->tree = tree;
+	iter->ctl = (RT_ITER_CONTROL *) MemoryContextAllocZero(tree->iter_context,
+														   sizeof(RT_ITER_CONTROL));
 
-	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
-	root.alloc = iter->tree->ctl->root;
-	RT_PTR_SET_LOCAL(tree, &root);
+	RT_INITIALIZE_ITER(tree, iter);
 
-	iter->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+#ifdef RT_SHMEM
+	/* we will non-shared iteration on a shared radix tree */
+	iter->shared = false;
+#endif
 
-	/* Set the root to start */
-	iter->cur_level = iter->top_level;
-	iter->node_iters[iter->cur_level].node = root;
-	iter->node_iters[iter->cur_level].idx = 0;
+	return iter;
+}
+
+#ifdef RT_SHMEM
+/*
+ * Create and return the shared iterator for the given shard radix tree.
+ *
+ * Taking a lock on a radix tree in shared mode during the shared iteration to
+ * prevent concurrent writes is the caller's responsibility.
+ */
+RT_SCOPE	RT_ITER *
+RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl_shared;
+	dsa_pointer dp;
+
+	/* The radix tree must be in shared mode */
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	dp = dsa_allocate0(tree->dsa, sizeof(RT_ITER_CONTROL_SHARED));
+	ctl_shared = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, dp);
+	ctl_shared->handle = dp;
+	LWLockInitialize(&ctl_shared->lock, tree->ctl->tranche_id);
+	pg_atomic_init_u32(&ctl_shared->refcnt, 1);
+
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
+											  sizeof(RT_ITER));
+
+	iter->ctl = (RT_ITER_CONTROL *) ctl_shared;
+	iter->shared = true;
+
+	RT_INITIALIZE_ITER(tree, iter);
 
 	return iter;
 }
+#endif
 
 /*
  * Scan the inner node and return the next child pointer if one exists, otherwise
@@ -2116,12 +2212,18 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	RT_CHILD_PTR node;
 	RT_PTR_ALLOC *slot = NULL;
 
+	node_iter = &(iter->ctl->node_iters[level]);
+	node = node_iter->node;
+
 #ifdef RT_SHMEM
-	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
-#endif
 
-	node_iter = &(iter->node_iters[level]);
-	node = node_iter->node;
+	/*
+	 * Since the iterator is shared, the local pointer of the node might be
+	 * set by other backends, we need to make sure to use the local pointer.
+	 */
+	if (iter->shared)
+		RT_PTR_SET_LOCAL(iter->tree, &node);
+#endif
 
 	Assert(node.local != NULL);
 
@@ -2194,8 +2296,8 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	}
 
 	/* Update the key */
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
-	iter->key |= (((uint64) key_chunk) << (level * RT_SPAN));
+	iter->ctl->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
+	iter->ctl->key |= (((uint64) key_chunk) << (level * RT_SPAN));
 
 	return slot;
 }
@@ -2209,18 +2311,29 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 {
 	RT_PTR_ALLOC *slot = NULL;
 
-	while (iter->cur_level <= iter->top_level)
+#ifdef RT_SHMEM
+	/* Prevent the shared iterator from being updated concurrently */
+	if (iter->shared)
+		LWLockAcquire(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock, LW_EXCLUSIVE);
+#endif
+
+	while (iter->ctl->cur_level <= iter->ctl->top_level)
 	{
 		RT_CHILD_PTR node;
 
-		slot = RT_NODE_ITERATE_NEXT(iter, iter->cur_level);
+		slot = RT_NODE_ITERATE_NEXT(iter, iter->ctl->cur_level);
 
-		if (iter->cur_level == 0 && slot != NULL)
+		if (iter->ctl->cur_level == 0 && slot != NULL)
 		{
 			/* Found a value at the leaf node */
-			*key_p = iter->key;
+			*key_p = iter->ctl->key;
 			node.alloc = *slot;
 
+#ifdef RT_SHMEM
+			if (iter->shared)
+				LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 			if (RT_CHILDPTR_IS_VALUE(*slot))
 				return (RT_VALUE_TYPE *) slot;
 			else
@@ -2236,17 +2349,23 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 			node.alloc = *slot;
 			RT_PTR_SET_LOCAL(iter->tree, &node);
 
-			iter->cur_level--;
-			iter->node_iters[iter->cur_level].node = node;
-			iter->node_iters[iter->cur_level].idx = 0;
+			iter->ctl->cur_level--;
+			iter->ctl->node_iters[iter->ctl->cur_level].node = node;
+			iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
 		}
 		else
 		{
 			/* Not found the child slot, move up the tree */
-			iter->cur_level++;
+			iter->ctl->cur_level++;
 		}
+
 	}
 
+#ifdef RT_SHMEM
+	if (iter->shared)
+		LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 	/* We've visited all nodes, so the iteration finished */
 	return NULL;
 }
@@ -2257,9 +2376,45 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 RT_SCOPE void
 RT_END_ITERATE(RT_ITER * iter)
 {
+#ifdef RT_SHMEM
+	RT_ITER_CONTROL_SHARED *ctl = (RT_ITER_CONTROL_SHARED *) iter->ctl;
+
+	if (iter->shared &&
+		pg_atomic_sub_fetch_u32(&ctl->refcnt, 1) == 0)
+		dsa_free(iter->tree->dsa, ctl->handle);
+#endif
 	pfree(iter);
 }
 
+#ifdef	RT_SHMEM
+RT_SCOPE	RT_ITER_HANDLE
+RT_GET_ITER_HANDLE(RT_ITER * iter)
+{
+	Assert(iter->shared);
+	return ((RT_ITER_CONTROL_SHARED *) iter->ctl)->handle;
+
+}
+
+RT_SCOPE	RT_ITER *
+RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl;
+
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
+											  sizeof(RT_ITER));
+	iter->tree = tree;
+	ctl = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, handle);
+	iter->ctl = (RT_ITER_CONTROL *) ctl;
+	iter->shared = true;
+
+	/* For every iterator, increase the refcnt by 1 */
+	pg_atomic_add_fetch_u32(&ctl->refcnt, 1);
+
+	return iter;
+}
+#endif
+
 /***************** DELETION *****************/
 
 #ifdef RT_USE_DELETE
@@ -2959,7 +3114,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_PTR_ALLOC
 #undef RT_INVALID_PTR_ALLOC
 #undef RT_HANDLE
+#undef RT_ITER_HANDLE
+#undef RT_ITER_CONTROL
+#undef RT_ITER_HANDLE
 #undef RT_ITER
+#undef RT_SHARED_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
 #undef RT_NODE_KIND_4
@@ -2996,6 +3155,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_LOCK_SHARE
 #undef RT_UNLOCK
 #undef RT_GET_HANDLE
+#undef RT_BEGIN_ITERATE_SHARED
+#undef RT_ATTACH_ITERATE_SHARED
+#undef RT_GET_ITER_HANDLE
+#undef RT_ATTACH_ITER
+#undef RT_GET_ITER_HANDLE
 #undef RT_FIND
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
@@ -3052,5 +3216,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_SHRINK_NODE_256
 #undef RT_NODE_DELETE
 #undef RT_NODE_INSERT
+#undef RT_INITIALIZE_ITER
 #undef RT_NODE_ITERATE_NEXT
 #undef RT_VERIFY_NODE
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 8b379567970..3043d0af6a4 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -161,13 +161,87 @@ test_empty(void)
 #endif
 }
 
+/* Iteration test for test_basic() */
+static void
+test_iterate_basic(rt_radix_tree *radixtree, uint64 *keys, int children,
+				   bool asc, bool shared)
+{
+	rt_iter    *iter;
+
+#ifdef TEST_SHARED_RT
+	if (!shared)
+		iter = rt_begin_iterate(radixtree);
+	else
+		iter = rt_begin_iterate_shared(radixtree);
+#else
+	iter = rt_begin_iterate(radixtree);
+#endif
+
+	for (int i = 0; i < children; i++)
+	{
+		uint64		expected;
+		uint64		iterkey;
+		TestValueType *iterval;
+
+		/* iteration is ordered by key, so adjust expected value accordingly */
+		if (asc)
+			expected = keys[i];
+		else
+			expected = keys[children - 1 - i];
+
+		iterval = rt_iterate_next(iter, &iterkey);
+
+		EXPECT_TRUE(iterval != NULL);
+		EXPECT_EQ_U64(iterkey, expected);
+		EXPECT_EQ_U64(*iterval, expected);
+	}
+
+	rt_end_iterate(iter);
+}
+
+/* Iteration test for test_random() */
+static void
+test_iterate_random(rt_radix_tree *radixtree, uint64 *keys, int num_keys,
+					bool shared)
+{
+	rt_iter    *iter;
+
+#ifdef TEST_SHARED_RT
+	if (!shared)
+		iter = rt_begin_iterate(radixtree);
+	else
+		iter = rt_begin_iterate_shared(radixtree);
+#else
+	iter = rt_begin_iterate(radixtree);
+#endif
+
+	for (int i = 0; i < num_keys; i++)
+	{
+		uint64		expected;
+		uint64		iterkey;
+		TestValueType *iterval;
+
+		/* skip duplicate keys */
+		if (i < num_keys - 1 && keys[i + 1] == keys[i])
+			continue;
+
+		expected = keys[i];
+		iterval = rt_iterate_next(iter, &iterkey);
+
+		EXPECT_TRUE(iterval != NULL);
+		EXPECT_EQ_U64(iterkey, expected);
+		EXPECT_EQ_U64(*iterval, expected);
+	}
+
+	rt_end_iterate(iter);
+}
+
 /* Basic set, find, and delete tests */
 static void
 test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 {
 	MemoryContext radixtree_ctx;
 	rt_radix_tree *radixtree;
-	rt_iter    *iter;
 	uint64	   *keys;
 	int			children = test_info->nkeys;
 #ifdef TEST_SHARED_RT
@@ -250,28 +324,12 @@ test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 	}
 
 	/* test that iteration returns the expected keys and values */
-	iter = rt_begin_iterate(radixtree);
-
-	for (int i = 0; i < children; i++)
-	{
-		uint64		expected;
-		uint64		iterkey;
-		TestValueType *iterval;
-
-		/* iteration is ordered by key, so adjust expected value accordingly */
-		if (asc)
-			expected = keys[i];
-		else
-			expected = keys[children - 1 - i];
-
-		iterval = rt_iterate_next(iter, &iterkey);
-
-		EXPECT_TRUE(iterval != NULL);
-		EXPECT_EQ_U64(iterkey, expected);
-		EXPECT_EQ_U64(*iterval, expected);
-	}
+	test_iterate_basic(radixtree, keys, children, asc, false);
 
-	rt_end_iterate(iter);
+#ifdef TEST_SHARED_RT
+	/* test shared-iteration as well */
+	test_iterate_basic(radixtree, keys, children, asc, true);
+#endif
 
 	/* delete all keys again */
 	for (int i = 0; i < children; i++)
@@ -302,7 +360,6 @@ test_random(void)
 {
 	MemoryContext radixtree_ctx;
 	rt_radix_tree *radixtree;
-	rt_iter    *iter;
 	pg_prng_state state;
 
 	/* limit memory usage by limiting the key space */
@@ -395,27 +452,12 @@ test_random(void)
 	}
 
 	/* test that iteration returns the expected keys and values */
-	iter = rt_begin_iterate(radixtree);
-
-	for (int i = 0; i < num_keys; i++)
-	{
-		uint64		expected;
-		uint64		iterkey;
-		TestValueType *iterval;
+	test_iterate_random(radixtree, keys, num_keys, false);
 
-		/* skip duplicate keys */
-		if (i < num_keys - 1 && keys[i + 1] == keys[i])
-			continue;
-
-		expected = keys[i];
-		iterval = rt_iterate_next(iter, &iterkey);
-
-		EXPECT_TRUE(iterval != NULL);
-		EXPECT_EQ_U64(iterkey, expected);
-		EXPECT_EQ_U64(*iterval, expected);
-	}
-
-	rt_end_iterate(iter);
+#ifdef TEST_SHARED_RT
+	/* test shared-iteration as well */
+	test_iterate_random(radixtree, keys, num_keys, true);
+#endif
 
 	/* reset random number generator for deletion */
 	pg_prng_seed(&state, seed);
-- 
2.43.5

v6-0006-radixtree.h-Add-RT_NUM_KEY-API-to-get-the-number-.patchapplication/octet-stream; name=v6-0006-radixtree.h-Add-RT_NUM_KEY-API-to-get-the-number-.patchDownload

From 3a6062c76a69ebc34117e5f4277ba2e7d2269321 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Dec 2024 16:54:46 -0800
Subject: [PATCH v6 6/8] radixtree.h: Add RT_NUM_KEY API to get the number of
 keys.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/include/lib/radixtree.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index bfe4c927fa8..12d8217762e 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -126,6 +126,7 @@
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
  * RT_END_ITERATE	- End iteration
  * RT_MEMORY_USAGE	- Get the memory as measured by space in memory context blocks
+ * RT_NUM_KEYS		- Get the number of key-value pairs in radix tree
  *
  * Interface for Shared Memory
  * ---------
@@ -197,6 +198,7 @@
 #define RT_DELETE RT_MAKE_NAME(delete)
 #endif
 #define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_NUM_KEYS RT_MAKE_NAME(num_keys)
 #define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
 #define RT_STATS RT_MAKE_NAME(stats)
 
@@ -313,6 +315,7 @@ RT_SCOPE	RT_VALUE_TYPE *RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p);
 RT_SCOPE void RT_END_ITERATE(RT_ITER * iter);
 
 RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE * tree);
+RT_SCOPE int64 RT_NUM_KEYS(RT_RADIX_TREE * tree);
 
 #ifdef RT_DEBUG
 RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree);
@@ -2844,6 +2847,15 @@ RT_MEMORY_USAGE(RT_RADIX_TREE * tree)
 	return total;
 }
 
+RT_SCOPE int64
+RT_NUM_KEYS(RT_RADIX_TREE * tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	return tree->ctl->num_keys;
+}
+
 /*
  * Perform some sanity checks on the given node.
  */
@@ -3167,6 +3179,7 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_END_ITERATE
 #undef RT_DELETE
 #undef RT_MEMORY_USAGE
+#undef RT_NUM_KEYS
 #undef RT_DUMP_NODE
 #undef RT_STATS
 
-- 
2.43.5

v6-0005-Support-shared-itereation-on-TidStore.patchapplication/octet-stream; name=v6-0005-Support-shared-itereation-on-TidStore.patchDownload

From d9df05156392d9df46d177c2ffaa9bf70974c187 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:34:57 -0700
Subject: [PATCH v6 5/8] Support shared itereation on TidStore.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/common/tidstore.c          | 59 ++++++++++++++++++
 src/include/access/tidstore.h                 |  3 +
 .../modules/test_tidstore/test_tidstore.c     | 62 ++++++++++++++-----
 3 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 27f20cf1972..399adf4af31 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -483,6 +483,7 @@ TidStoreBeginIterate(TidStore *ts)
 	iter = palloc0(sizeof(TidStoreIter));
 	iter->ts = ts;
 
+	/* begin iteration on the radix tree */
 	if (TidStoreIsShared(ts))
 		iter->tree_iter.shared = shared_ts_begin_iterate(ts->tree.shared);
 	else
@@ -533,6 +534,56 @@ TidStoreEndIterate(TidStoreIter *iter)
 	pfree(iter);
 }
 
+/*
+ * Prepare to iterate through a shared TidStore in shared mode. This function
+ * is aimed to start the iteration on the given TidStore with parallel workers.
+ *
+ * The TidStoreIter struct is created in the caller's memory context, and it
+ * will be freed in TidStoreEndIterate.
+ *
+ * The caller is responsible for locking TidStore until the iteration is
+ * finished.
+ */
+TidStoreIter *
+TidStoreBeginIterateShared(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	if (!TidStoreIsShared(ts))
+		elog(ERROR, "cannot begin shared iteration on local TidStore");
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* begin the shared iteration on radix tree */
+	iter->tree_iter.shared =
+		(shared_ts_iter *) shared_ts_begin_iterate_shared(ts->tree.shared);
+
+	return iter;
+}
+
+/*
+ * Attach to the shared TidStore iterator. 'iter_handle' is the dsa_pointer
+ * returned by TidStoreGetSharedIterHandle(). The returned object is allocated
+ * in backend-local memory using CurrentMemoryContext.
+ */
+TidStoreIter *
+TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle)
+{
+	TidStoreIter *iter;
+
+	Assert(TidStoreIsShared(ts));
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* Attach to the shared iterator */
+	iter->tree_iter.shared = shared_ts_attach_iterate_shared(ts->tree.shared,
+															 iter_handle);
+
+	return iter;
+}
+
 /*
  * Return the memory usage of TidStore.
  */
@@ -564,6 +615,14 @@ TidStoreGetHandle(TidStore *ts)
 	return (dsa_pointer) shared_ts_get_handle(ts->tree.shared);
 }
 
+dsa_pointer
+TidStoreGetSharedIterHandle(TidStoreIter *iter)
+{
+	Assert(TidStoreIsShared(iter->ts));
+
+	return (dsa_pointer) shared_ts_get_iter_handle(iter->tree_iter.shared);
+}
+
 /*
  * Given a TidStoreIterResult returned by TidStoreIterateNext(), extract the
  * offset numbers.  Returns the number of offsets filled in, if <=
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 041091df278..c886cef0f7d 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -37,6 +37,9 @@ extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
 extern void TidStoreUnlock(TidStore *ts);
+extern TidStoreIter *TidStoreBeginIterateShared(TidStore *ts);
+extern TidStoreIter *TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle);
+extern dsa_pointer TidStoreGetSharedIterHandle(TidStoreIter *iter);
 extern void TidStoreDestroy(TidStore *ts);
 extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 									int num_offsets);
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index eb16e0fbfa6..36654cf0110 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -33,6 +33,7 @@ PG_FUNCTION_INFO_V1(test_is_full);
 PG_FUNCTION_INFO_V1(test_destroy);
 
 static TidStore *tidstore = NULL;
+static bool tidstore_is_shared;
 static size_t tidstore_empty_size;
 
 /* array for verification of some tests */
@@ -107,6 +108,7 @@ test_create(PG_FUNCTION_ARGS)
 		LWLockRegisterTranche(tranche_id, "test_tidstore");
 
 		tidstore = TidStoreCreateShared(tidstore_max_size, tranche_id);
+		tidstore_is_shared = true;
 
 		/*
 		 * Remain attached until end of backend or explicitly detached so that
@@ -115,8 +117,11 @@ test_create(PG_FUNCTION_ARGS)
 		dsa_pin_mapping(TidStoreGetDSA(tidstore));
 	}
 	else
+	{
 		/* VACUUM uses insert only, so we test the other option. */
 		tidstore = TidStoreCreateLocal(tidstore_max_size, false);
+		tidstore_is_shared = false;
+	}
 
 	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
 
@@ -212,14 +217,42 @@ do_set_block_offsets(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(blkno);
 }
 
+/* Collect TIDs stored in the tidstore, in order */
+static void
+check_iteration(TidStore *tidstore, int *num_iter_tids, bool shared_iter)
+{
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+
+	TidStoreLockShare(tidstore);
+
+	if (shared_iter)
+		iter = TidStoreBeginIterateShared(tidstore);
+	else
+		iter = TidStoreBeginIterate(tidstore);
+
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	{
+		OffsetNumber offsets[MaxOffsetNumber];
+		int			num_offsets;
+
+		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
+		Assert(num_offsets <= lengthof(offsets));
+		for (int i = 0; i < num_offsets; i++)
+			ItemPointerSet(&(items.iter_tids[(*num_iter_tids)++]), iter_result->blkno,
+						   offsets[i]);
+	}
+
+	TidStoreEndIterate(iter);
+	TidStoreUnlock(tidstore);
+}
+
 /*
  * Verify TIDs in store against the array.
  */
 Datum
 check_set_block_offsets(PG_FUNCTION_ARGS)
 {
-	TidStoreIter *iter;
-	TidStoreIterResult *iter_result;
 	int			num_iter_tids = 0;
 	int			num_lookup_tids = 0;
 	BlockNumber prevblkno = 0;
@@ -261,22 +294,23 @@ check_set_block_offsets(PG_FUNCTION_ARGS)
 	}
 
 	/* Collect TIDs stored in the tidstore, in order */
+	check_iteration(tidstore, &num_iter_tids, false);
 
-	TidStoreLockShare(tidstore);
-	iter = TidStoreBeginIterate(tidstore);
-	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	/* If the tidstore is shared, check the shared-iteration as well */
+	if (tidstore_is_shared)
 	{
-		OffsetNumber offsets[MaxOffsetNumber];
-		int			num_offsets;
+		int			num_iter_tids_shared = 0;
 
-		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
-		Assert(num_offsets <= lengthof(offsets));
-		for (int i = 0; i < num_offsets; i++)
-			ItemPointerSet(&(items.iter_tids[num_iter_tids++]), iter_result->blkno,
-						   offsets[i]);
+		check_iteration(tidstore, &num_iter_tids_shared, true);
+
+		/*
+		 * verify that normal iteration and shared iteration returned the
+		 * number of TIDs.
+		 */
+		if (num_lookup_tids != num_iter_tids_shared)
+			elog(ERROR, "shared-iteration should have %d TIDs, have %d aaa",
+				 items.num_tids, num_iter_tids_shared);
 	}
-	TidStoreEndIterate(iter);
-	TidStoreUnlock(tidstore);
 
 	/*
 	 * Sort verification and lookup arrays and test that all arrays are the
-- 
2.43.5

v6-0008-Support-parallel-heap-vacuum-during-lazy-vacuum.patchapplication/octet-stream; name=v6-0008-Support-parallel-heap-vacuum-during-lazy-vacuum.patchDownload

From 4cc9be274dd46febf446cfc62b275182860d4226 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:37:45 -0700
Subject: [PATCH v6 8/8] Support parallel heap vacuum during lazy vacuum.

This commit further extends parallel vacuum to perform the heap vacuum
phase with parallel workers. It leverages the shared TidStore iteration.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/ref/vacuum.sgml          |  17 +-
 src/backend/access/heap/vacuumlazy.c  | 280 +++++++++++++++++++-------
 src/backend/commands/vacuumparallel.c |  10 +-
 src/include/commands/vacuum.h         |   2 +-
 4 files changed, 223 insertions(+), 86 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index aae0bbcd577..104157b5a56 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -278,20 +278,21 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
       <para>
-       Perform scanning heap, index vacuum, and index cleanup phases of
-       <command>VACUUM</command> in parallel using
+       Perform scanning heap, vacuuming heap, index vacuum, and index cleanup
+       phases of <command>VACUUM</command> in parallel using
        <replaceable class="parameter">integer</replaceable> background workers
        (for the details of each vacuum phase, please refer to
        <xref linkend="vacuum-phases"/>).
       </para>
       <para>
        For heap tables, the number of workers used to perform the scanning
-       heap is determined based on the size of table. A table can participate in
-       parallel scanning heap if and only if the size of the table is more than
-       <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
-       the heap table's blocks will be divided into ranges and shared among the
-       cooperating processes. Each worker process will complete the scanning of
-       its given range of blocks before requesting an additional range of blocks.
+       heap and vacuuming heap is determined based on the size of table. A table
+       can participate in parallel scanning heap if and only if the size of the
+       table is more than <xref linkend="guc-min-parallel-table-scan-size"/>.
+       During scanning heap, the heap table's blocks will be divided into ranges
+       and shared among the cooperating processes. Each worker process will
+       complete the scanning of its given range of blocks before requesting an
+       additional range of blocks.
       </para>
       <para>
        The number of workers used to perform parallel index vacuum and index
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 6502930258a..4841c7715e3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -20,6 +20,41 @@
  * that there only needs to be one call to lazy_vacuum, after the initial pass
  * completes.
  *
+ * Parallel Vacuum
+ * ----------------
+ * Lazy vacuum on heap tables supports parallel processing for three vacuum
+ * phases: scanning heap, vacuuming indexes, and vacuuming heap. Before the
+ * scanning heap phase, we initialize parallel vacuum state, ParallelVacuumState,
+ * and allocate the TID store in a DSA area if we can use parallel mode for any
+ * of these three phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size, number of indexes, and the number
+ * of pages having dead tuples. Parallel workers are launched at the beginning
+ * of each phase and exit at the end of each phase.
+ *
+ * For scanning the heap table with parallel workers, we utilize the
+ * table_block_parallelscan_xxx facility which splits the table into several
+ * chunks and parallel workers allocate chunks to scan. If dead_items TIDs is
+ * close to overrunning the available space during parallel heap scan, parallel
+ * workers exit and leader process gathers the scan results. Then, it performs
+ * a round of index and heap vacuuming that could also use the parallelism. After
+ * vacuuming both indexes and heap table, the leader process vacuums FSM to make
+ * newly-freed space visible. Then, it relaunches parallel workers to resume the
+ * scanning heap phase with parallel workers again. In order to be able to resume
+ * the parallel heap scan from the previous status, the workers' parallel scan
+ * descriptions are stored in the shared memory (DSM) space to share among parallel
+ * workers. If the leader could launch fewer workers to resume the parallel heap
+ * scan, some blocks are remained as un-scanned. The leader process serially deals
+ * with such blocks at the end of scanning heap phase (see
+ * parallel_heap_complete_unfinished_scan()).
+ *
+ * At the beginning of the vacuuming heap phase, the leader launches parallel
+ * workers and initiates the shared iteration on the shared TID store. At the
+ * end of the phase, the leader process waits for all workers to finish and gather
+ * the workers' results.
+ *
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -172,6 +207,7 @@ typedef struct LVRelScanState
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	BlockNumber vacuumed_pages; /* # pages vacuumed in one second pass */
 
 	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
@@ -205,11 +241,15 @@ typedef struct PHVShared
 	 * The final value is OR of worker's skippedallvis.
 	 */
 	bool		skippedallvis;
+	bool		do_index_vacuuming;
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState vistest;
 
+	dsa_pointer shared_iter_handle;
+	bool		do_heap_vacuum;
+
 	/* per-worker scan stats for parallel heap vacuum scan */
 	LVRelScanState worker_scan_state[FLEXIBLE_ARRAY_MEMBER];
 } PHVShared;
@@ -257,6 +297,14 @@ typedef struct PHVState
 	/* Assigned per-worker scan state */
 	PHVScanWorkerState *myscanstate;
 
+	/*
+	 * The number of parallel workers to launch for parallel heap scanning.
+	 * Note that the number of parallel workers for parallel heap vacuuming
+	 * could vary but is less than num_heapscan_workers. So this works also as
+	 * the maximum number of workers for parallel heap scanning and vacuuming.
+	 */
+	int			num_heapscan_workers;
+
 	/*
 	 * All blocks up to this value has been scanned, i.e. the minimum of all
 	 * PHVScanWorkerState->last_blkno. This field is updated by
@@ -374,6 +422,7 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
+static void do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter);
 static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, OffsetNumber *deadoffsets,
 								  int num_offsets, Buffer vmbuffer);
@@ -404,6 +453,7 @@ static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
 static void parallel_heap_vacuum_compute_min_scanned_blkno(LVRelState *vacrel);
 static void parallel_heap_vacuum_gather_scan_results(LVRelState *vacrel);
 static void parallel_heap_complete_unfinished_scan(LVRelState *vacrel);
+static int	compute_heap_vacuum_parallel_workers(Relation rel, BlockNumber nblocks);
 
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
@@ -551,6 +601,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	scan_state->lpdead_item_pages = 0;
 	scan_state->missed_dead_pages = 0;
 	scan_state->nonempty_pages = 0;
+	scan_state->vacuumed_pages = 0;
 	scan_state->tuples_deleted = 0;
 	scan_state->tuples_frozen = 0;
 	scan_state->lpdead_items = 0;
@@ -2456,46 +2507,14 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	return allindexes;
 }
 
-/*
- *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
- *
- * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
- * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
- *
- * We may also be able to truncate the line pointer array of the heap pages we
- * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
- * array, it can be reclaimed as free space.  These LP_UNUSED items usually
- * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
- * each page to LP_UNUSED, and then consider if it's possible to truncate the
- * page's line pointer array).
- *
- * Note: the reason for doing this as a second pass is we cannot remove the
- * tuples until we've removed their index entries, and we want to process
- * index entry removal in batches as large as possible.
- */
 static void
-lazy_vacuum_heap_rel(LVRelState *vacrel)
+do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter)
 {
-	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
-	LVSavedErrInfo saved_err_info;
-	TidStoreIter *iter;
-	TidStoreIterResult *iter_result;
-
-	Assert(vacrel->do_index_vacuuming);
-	Assert(vacrel->do_index_cleanup);
-	Assert(vacrel->num_index_scans > 0);
-
-	/* Report that we are now vacuuming the heap */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
 
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrel, &saved_err_info,
-							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 InvalidBlockNumber, InvalidOffsetNumber);
+	/* LVSavedErrInfo saved_err_info; */
+	TidStoreIterResult *iter_result;
 
-	iter = TidStoreBeginIterate(vacrel->dead_items);
 	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
@@ -2533,26 +2552,106 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		UnlockReleaseBuffer(buf);
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-		vacuumed_pages++;
+		vacrel->scan_state->vacuumed_pages++;
 	}
-	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
+}
+
+/*
+ *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
+ *
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
+ *
+ * We may also be able to truncate the line pointer array of the heap pages we
+ * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
+ * array, it can be reclaimed as free space.  These LP_UNUSED items usually
+ * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
+ * each page to LP_UNUSED, and then consider if it's possible to truncate the
+ * page's line pointer array).
+ *
+ * Note: the reason for doing this as a second pass is we cannot remove the
+ * tuples until we've removed their index entries, and we want to process
+ * index entry removal in batches as large as possible.
+ */
+static void
+lazy_vacuum_heap_rel(LVRelState *vacrel)
+{
+	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	int			nworkers = 0;
+
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
+	Assert(vacrel->num_index_scans > 0);
+
+	/* Report that we are now vacuuming the heap */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+							 InvalidBlockNumber, InvalidOffsetNumber);
+
+	vacrel->scan_state->vacuumed_pages = 0;
+
+	/* Compute parallel workers required to scan blocks to vacuum */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		nworkers = compute_heap_vacuum_parallel_workers(vacrel->rel,
+														TidStoreNumBlocks(vacrel->dead_items));
+
+	if (nworkers > 0)
+	{
+		PHVState   *phvstate = vacrel->phvstate;
+
+		iter = TidStoreBeginIterateShared(vacrel->dead_items);
+
+		/* launch workers */
+		phvstate->shared->do_heap_vacuum = true;
+		phvstate->shared->shared_iter_handle = TidStoreGetSharedIterHandle(iter);
+		phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs,
+																	   nworkers);
+	}
+	else
+		iter = TidStoreBeginIterate(vacrel->dead_items);
+
+	/* do the real work */
+	do_lazy_vacuum_heap_rel(vacrel, iter);
+
+	if (ParallelHeapVacuumIsActive(vacrel) && nworkers > 0)
+	{
+		PHVState   *phvstate = vacrel->phvstate;
+
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+
+		/* Gather the heap vacuum statistics that workers collected */
+		for (int i = 0; i < phvstate->nworkers_launched; i++)
+		{
+			LVRelScanState *ss = &(phvstate->shared->worker_scan_state[i]);
+
+			vacrel->scan_state->vacuumed_pages += ss->vacuumed_pages;
+		}
+	}
+
+	TidStoreEndIterate(iter);
+
 	/*
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
 		   (vacrel->dead_items_info->num_items == vacrel->scan_state->lpdead_items &&
-			vacuumed_pages == vacrel->scan_state->lpdead_item_pages));
+			vacrel->scan_state->vacuumed_pages == vacrel->scan_state->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
 					vacrel->relname, (long long) vacrel->dead_items_info->num_items,
-					vacuumed_pages)));
+					vacrel->scan_state->vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3261,6 +3360,11 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		{
 			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
 																&vacrel->dead_items_info);
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+				vacrel->phvstate->num_heapscan_workers =
+					parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
 			return;
 		}
 	}
@@ -3508,37 +3612,41 @@ update_relstats_all_indexes(LVRelState *vacrel)
  *
  * The calculation logic is borrowed from compute_parallel_worker().
  */
-int
-heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+static int
+compute_heap_vacuum_parallel_workers(Relation rel, BlockNumber nblocks)
 {
 	int			parallel_workers = 0;
 	int			heap_parallel_threshold;
 	int			heap_pages;
 
-	if (nrequested == 0)
+	/*
+	 * Select the number of workers based on the log of the size of the
+	 * relation.Note that the upper limit of the min_parallel_table_scan_size
+	 * GUC is chosen to prevent overflow here.
+	 */
+	heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+	heap_pages = BlockNumberIsValid(nblocks) ?
+		nblocks : RelationGetNumberOfBlocks(rel);
+	while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
 	{
-		/*
-		 * Select the number of workers based on the log of the size of the
-		 * relation. Note that the upper limit of the
-		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
-		 * here.
-		 */
-		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
-		heap_pages = RelationGetNumberOfBlocks(rel);
-		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
-		{
-			parallel_workers++;
-			heap_parallel_threshold *= 3;
-			if (heap_parallel_threshold > INT_MAX / 3)
-				break;
-		}
+		parallel_workers++;
+		heap_parallel_threshold *= 3;
+		if (heap_parallel_threshold > INT_MAX / 3)
+			break;
 	}
-	else
-		parallel_workers = nrequested;
 
 	return parallel_workers;
 }
 
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	if (nrequested == 0)
+		return compute_heap_vacuum_parallel_workers(rel, InvalidBlockNumber);
+	else
+		return nrequested;
+}
+
 /* Estimate shared memory sizes required for parallel heap vacuum */
 static inline void
 heap_parallel_estimate_shared_memory_size(Relation rel, int nworkers, Size *pscan_len,
@@ -3620,6 +3728,7 @@ heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
 	shared->NewRelfrozenXid = vacrel->scan_state->NewRelfrozenXid;
 	shared->NewRelminMxid = vacrel->scan_state->NewRelminMxid;
 	shared->skippedallvis = vacrel->scan_state->skippedallvis;
+	shared->do_index_vacuuming = vacrel->do_index_vacuuming;
 
 	/*
 	 * XXX: we copy the contents of vistest to the shared area, but in order
@@ -3672,7 +3781,6 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	PHVScanWorkerState *scanstate;
 	LVRelScanState *scan_state;
 	ErrorContextCallback errcallback;
-	bool		scan_done;
 
 	phvstate = palloc(sizeof(PHVState));
 
@@ -3709,10 +3817,11 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	/* initialize per-worker relation statistics */
 	MemSet(scan_state, 0, sizeof(LVRelScanState));
 
-	/* Set fields necessary for heap scan */
+	/* Set fields necessary for heap scan and vacuum */
 	vacrel.scan_state->NewRelfrozenXid = shared->NewRelfrozenXid;
 	vacrel.scan_state->NewRelminMxid = shared->NewRelminMxid;
 	vacrel.scan_state->skippedallvis = shared->skippedallvis;
+	vacrel.do_index_vacuuming = shared->do_index_vacuuming;
 
 	/* Initialize the per-worker scan state if not yet */
 	if (!phvstate->myscanstate->initialized)
@@ -3734,25 +3843,44 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
 	vacrel.relname = pstrdup(RelationGetRelationName(rel));
 	vacrel.indname = NULL;
-	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = &vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
-	scan_done = do_lazy_scan_heap(&vacrel);
+	if (shared->do_heap_vacuum)
+	{
+		TidStoreIter *iter;
+
+		iter = TidStoreAttachIterateShared(vacrel.dead_items, shared->shared_iter_handle);
+
+		/* Join parallel heap vacuum */
+		vacrel.phase = VACUUM_ERRCB_PHASE_VACUUM_HEAP;
+		do_lazy_vacuum_heap_rel(&vacrel, iter);
+
+		TidStoreEndIterate(iter);
+	}
+	else
+	{
+		bool		scan_done;
+
+		/* Join parallel heap scan */
+		vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+		scan_done = do_lazy_scan_heap(&vacrel);
+
+		/*
+		 * If the leader or a worker finishes the heap scan because dead_items
+		 * TIDs is close to the limit, it might have some allocated blocks in
+		 * its scan state. Since this scan state might not be used in the next
+		 * heap scan, we remember that it might have some unconsumed blocks so
+		 * that the leader complete the scans after the heap scan phase
+		 * finishes.
+		 */
+		phvstate->myscanstate->maybe_have_blocks = !scan_done;
+	}
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
-
-	/*
-	 * If the leader or a worker finishes the heap scan because dead_items
-	 * TIDs is close to the limit, it might have some allocated blocks in its
-	 * scan state. Since this scan state might not be used in the next heap
-	 * scan, we remember that it might have some unconsumed blocks so that the
-	 * leader complete the scans after the heap scan phase finishes.
-	 */
-	phvstate->myscanstate->maybe_have_blocks = !scan_done;
 }
 
 /*
@@ -3874,7 +4002,10 @@ do_parallel_lazy_scan_heap(LVRelState *vacrel)
 	Assert(!IsParallelWorker());
 
 	/* launcher workers */
-	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+	vacrel->phvstate->shared->do_heap_vacuum = false;
+	vacrel->phvstate->nworkers_launched =
+		parallel_vacuum_table_scan_begin(vacrel->pvs,
+										 vacrel->phvstate->num_heapscan_workers);
 
 	/* initialize parallel scan description to join as a worker */
 	scanstate = palloc0(sizeof(PHVScanWorkerState));
@@ -3933,7 +4064,8 @@ do_parallel_lazy_scan_heap(LVRelState *vacrel)
 
 		/* Re-launch workers to restart parallel heap scan */
 		vacrel->phvstate->nworkers_launched =
-			parallel_vacuum_table_scan_begin(vacrel->pvs);
+			parallel_vacuum_table_scan_begin(vacrel->pvs,
+											 vacrel->phvstate->num_heapscan_workers);
 	}
 
 	/*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 9f8c8f09576..2a096ed4128 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1054,8 +1054,10 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
  * table vacuum.
  */
 int
-parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs, int nworkers_request)
 {
+	int			nworkers;
+
 	Assert(!IsParallelWorker());
 
 	if (pvs->shared->nworkers_for_table == 0)
@@ -1069,11 +1071,13 @@ parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
 	if (pvs->num_table_scans > 0)
 		ReinitializeParallelDSM(pvs->pcxt);
 
+	nworkers = Min(nworkers_request, pvs->shared->nworkers_for_table);
+
 	/*
 	 * The number of workers might vary between table vacuum and index
 	 * processing
 	 */
-	ReinitializeParallelWorkers(pvs->pcxt, pvs->shared->nworkers_for_table);
+	ReinitializeParallelWorkers(pvs->pcxt, nworkers);
 	LaunchParallelWorkers(pvs->pcxt);
 
 	if (pvs->pcxt->nworkers_launched > 0)
@@ -1097,7 +1101,7 @@ parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
 			(errmsg(ngettext("launched %d parallel vacuum worker for table processing (planned: %d)",
 							 "launched %d parallel vacuum workers for table processing (planned: %d)",
 							 pvs->pcxt->nworkers_launched),
-					pvs->pcxt->nworkers_launched, pvs->shared->nworkers_for_table)));
+					pvs->pcxt->nworkers_launched, nworkers)));
 
 	return pvs->pcxt->nworkers_launched;
 }
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d45866d61e5..7bec04395e9 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -371,7 +371,7 @@ extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												bool estimated_count);
-extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs, int nworkers_request);
 extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
 extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
 extern int	parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs);
-- 
2.43.5

v6-0007-Add-TidStoreNumBlocks-API-to-get-the-number-of-bl.patchapplication/octet-stream; name=v6-0007-Add-TidStoreNumBlocks-API-to-get-the-number-of-bl.patchDownload

From 6cd17b504ca51178dcb5a7e03d65b5071b9483e6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Dec 2024 16:55:52 -0800
Subject: [PATCH v6 7/8] Add TidStoreNumBlocks API to get the number of blocks
 in TidStore.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/common/tidstore.c | 12 ++++++++++++
 src/include/access/tidstore.h        |  1 +
 2 files changed, 13 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 399adf4af31..c43b3d8ac69 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -596,6 +596,18 @@ TidStoreMemoryUsage(TidStore *ts)
 		return local_ts_memory_usage(ts->tree.local);
 }
 
+/*
+ * Return the number of entries in TidStore.
+ */
+BlockNumber
+TidStoreNumBlocks(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+		return shared_ts_num_keys(ts->tree.shared);
+	else
+		return local_ts_num_keys(ts->tree.local);
+}
+
 /*
  * Return the DSA area where the TidStore lives.
  */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index c886cef0f7d..fd739d20da1 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -51,6 +51,7 @@ extern int	TidStoreGetBlockOffsets(TidStoreIterResult *result,
 									int max_offsets);
 extern void TidStoreEndIterate(TidStoreIter *iter);
 extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern BlockNumber TidStoreNumBlocks(TidStore *ts);
 extern dsa_pointer TidStoreGetHandle(TidStore *ts);
 extern dsa_area *TidStoreGetDSA(TidStore *ts);
 
-- 
2.43.5

v6-0003-Support-parallel-heap-scan-during-lazy-vacuum.patchapplication/octet-stream; name=v6-0003-Support-parallel-heap-scan-during-lazy-vacuum.patchDownload

From a4ab09e39033423616064c0da4e6a704fdba03a4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 1 Jul 2024 15:17:46 +0900
Subject: [PATCH v6 3/8] Support parallel heap scan during lazy vacuum.

Commit 40d964ec99 allowed vacuum command to process indexes in
parallel. This change extends the parallel vacuum to support parallel
heap scan during lazy vacuum.
---
 doc/src/sgml/ref/vacuum.sgml             |  58 +-
 src/backend/access/heap/heapam_handler.c |   6 +
 src/backend/access/heap/vacuumlazy.c     | 929 ++++++++++++++++++++---
 src/backend/commands/vacuumparallel.c    | 305 ++++++--
 src/backend/storage/ipc/procarray.c      |  74 --
 src/include/access/heapam.h              |   8 +
 src/include/access/tableam.h             |  88 +++
 src/include/commands/vacuum.h            |   8 +-
 src/include/utils/snapmgr.h              |   2 +-
 src/include/utils/snapmgr_internal.h     |  91 +++
 src/tools/pgindent/typedefs.list         |   3 +
 11 files changed, 1332 insertions(+), 240 deletions(-)
 create mode 100644 src/include/utils/snapmgr_internal.h

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9110938fab6..aae0bbcd577 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -277,27 +277,43 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
    <varlistentry>
     <term><literal>PARALLEL</literal></term>
     <listitem>
-     <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
-     </para>
+      <para>
+       Perform scanning heap, index vacuum, and index cleanup phases of
+       <command>VACUUM</command> in parallel using
+       <replaceable class="parameter">integer</replaceable> background workers
+       (for the details of each vacuum phase, please refer to
+       <xref linkend="vacuum-phases"/>).
+      </para>
+      <para>
+       For heap tables, the number of workers used to perform the scanning
+       heap is determined based on the size of table. A table can participate in
+       parallel scanning heap if and only if the size of the table is more than
+       <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+       the heap table's blocks will be divided into ranges and shared among the
+       cooperating processes. Each worker process will complete the scanning of
+       its given range of blocks before requesting an additional range of blocks.
+      </para>
+      <para>
+       The number of workers used to perform parallel index vacuum and index
+       cleanup is equal to the number of indexes on the relation that support
+       parallel vacuum. An index can participate in parallel vacuum if and only
+       if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+       Only one worker can be used per index. So parallel workers for index vacuum
+       and index cleanup are launched only when there are at least <literal>2</literal>
+       indexes in the table.
+      </para>
+      <para>
+       Workers for vacuum are launched before the start of each phase and exit
+       at the end of the phase. The number of workers for each phase is limited by
+       the number of workers specified with <literal>PARALLEL</literal> option if
+       any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+       Please note that in any parallel vacuum phase, it is not guaanteed that the
+       number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+       will be used during execution. It is possible for a vacuum to run with fewer
+       workers than specified, or even with no workers at all. These behaviors might
+       change in a future release. This option can't be used with the <literal>FULL</literal>
+       option.
+      </para>
     </listitem>
    </varlistentry>
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e817f8f8f84..9484a2fdb3f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2656,6 +2656,12 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_relation_worker = heap_parallel_vacuum_worker,
+
 	.scan_analyze_next_block = heapam_scan_analyze_next_block,
 	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 116c0612ca5..6502930258a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -48,6 +48,7 @@
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -115,10 +116,24 @@
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
 /*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
+ * DSM keys for heap parallel vacuum scan. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_KEY_SCAN_SHARED			0xFFFF0001
+#define LV_PARALLEL_KEY_SCAN_DESC			0xFFFF0002
+#define LV_PARALLEL_KEY_SCAN_DESC_WORKER	0xFFFF0003
+
+/*
+ * Macros to check if we are in parallel heap vacuuming, parallel index vacuuming,
+ * or both. If ParallelVacuumIsActive() is true, we are in the parallel mode, meaning
+ * that we have dead items TIDs on shared memory area.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelIndexVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_index((vacrel)->pvs) > 0)
+#define ParallelHeapVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_table((vacrel)->pvs) > 0)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -172,6 +187,87 @@ typedef struct LVRelScanState
 	bool		skippedallvis;
 } LVRelScanState;
 
+/*
+ * Struct for information that needs to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/*
+	 * Have we skipped any all-visible pages?
+	 *
+	 * The final value is OR of worker's skippedallvis.
+	 */
+	bool		skippedallvis;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState vistest;
+
+	/* per-worker scan stats for parallel heap vacuum scan */
+	LVRelScanState worker_scan_state[FLEXIBLE_ARRAY_MEMBER];
+} PHVShared;
+#define SizeOfPHVShared (offsetof(PHVShared, worker_scan_state))
+
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerState
+{
+	bool		initialized;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData state;
+
+	/*
+	 * True if a parallel vacuum scan worker allocated blocks in state but
+	 * might have not scanned all of them. The leader process will take over
+	 * for scanning these remaining blocks.
+	 */
+	bool		maybe_have_blocks;
+
+	/* last block number the worker scanned */
+	BlockNumber last_blkno;
+} PHVScanWorkerState;
+
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pscandesc;
+
+	/* Shared information */
+	PHVShared  *shared;
+
+	/*
+	 * Points to all per-worker scan state array stored on DSM area.
+	 *
+	 * During parallel heap scan, each worker allocates some chunks of blocks
+	 * to scan in its scan state, and could exit while leaving some chunks
+	 * un-scanned if the size of dead_items TIDs is close to overrunning the
+	 * the available space. We store the scan states on shared memory area so
+	 * that workers can resume heap scans from the previous point.
+	 */
+	PHVScanWorkerState *scanstates;
+
+	/* Assigned per-worker scan state */
+	PHVScanWorkerState *myscanstate;
+
+	/*
+	 * All blocks up to this value has been scanned, i.e. the minimum of all
+	 * PHVScanWorkerState->last_blkno. This field is updated by
+	 * parallel_heap_vacuum_compute_min_scanned_blkno().
+	 */
+	BlockNumber min_scanned_blkno;
+
+	/* The number of workers launched for parallel heap vacuum */
+	int			nworkers_launched;
+} PHVState;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -183,6 +279,9 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
+	/* Parallel heap vacuum state and sizes for each struct */
+	PHVState   *phvstate;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -223,6 +322,8 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
+	BlockNumber next_fsm_block_to_vacuum;	/* next block to check for FSM
+											 * vacuum */
 
 	/* Working state for heap scanning and vacuuming */
 	LVRelScanState *scan_state;
@@ -254,8 +355,11 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static bool do_lazy_scan_heap(LVRelState *vacrel);
 static bool heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 									 bool *all_visible_according_to_vm);
+static bool heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno,
+											  bool *all_visible_according_to_vm);
 static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
@@ -296,6 +400,11 @@ static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 static void update_relstats_all_indexes(LVRelState *vacrel);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_heap_vacuum_compute_min_scanned_blkno(LVRelState *vacrel);
+static void parallel_heap_vacuum_gather_scan_results(LVRelState *vacrel);
+static void parallel_heap_complete_unfinished_scan(LVRelState *vacrel);
+
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
 									 LVSavedErrInfo *saved_vacrel,
@@ -432,6 +541,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		Assert(params->index_cleanup == VACOPTVALUE_AUTO);
 	}
 
+	vacrel->next_fsm_block_to_vacuum = 0;
+
 	/* Initialize page counters explicitly (be tidy) */
 	scan_state = palloc(sizeof(LVRelScanState));
 	scan_state->scanned_pages = 0;
@@ -452,6 +563,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->scan_state = scan_state;
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
+	/* dead_items_alloc allocates vacrel->dead_items later on */
+
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
 	vacrel->new_live_tuples = 0;
@@ -861,12 +974,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_fsm_block_to_vacuum = 0;
-	bool		all_visible_according_to_vm;
-
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -886,12 +994,93 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_allvis = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
-	while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
+	/*
+	 * Do the actual work. If parallel heap vacuum is active, we scan and
+	 * vacuum heap using parallel workers.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+	{
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/* We must have scanned all heap pages */
+		Assert(scan_done);
+	}
+
+	/* report that everything is now scanned */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_state->scanned_pages,
+													 vacrel->scan_state->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_state->recently_dead_tuples +
+		vacrel->scan_state->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+								rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * Return true if we processed all blocks, otherwise false if we exit from this function
+ * while not completing the heap scan due to full of dead item TIDs. In serial heap scan
+ * case, this function always returns true. In parallel heap vacuum scan, this function
+ * is called by both worker processes and the leader process, and could return false.
+ */
+static bool
+do_lazy_scan_heap(LVRelState *vacrel)
+{
+	bool		all_visible_according_to_vm;
+	BlockNumber blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		scan_done = true;
+
+	while (true)
 	{
 		Buffer		buf;
 		Page		page;
 		bool		has_lpdead_items;
 		bool		got_cleanup_lock = false;
+		bool		got_blkno;
+
+		/* Get the next block for vacuum to process */
+		if (ParallelHeapVacuumIsActive(vacrel))
+			got_blkno = heap_vac_scan_next_block_parallel(vacrel, &blkno, &all_visible_according_to_vm);
+		else
+			got_blkno = heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm);
+
+		if (!got_blkno)
+			break;
 
 		vacrel->scan_state->scanned_pages++;
 
@@ -911,46 +1100,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scan_state->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (!IsParallelWorker() &&
+			vacrel->scan_state->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
-		/*
-		 * Consider if we definitely have enough space to process TIDs on page
-		 * already.  If we are close to overrunning the available space for
-		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page.
-		 */
-		if (TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
-		{
-			/*
-			 * Before beginning index vacuuming, we release any pin we may
-			 * hold on the visibility map page.  This isn't necessary for
-			 * correctness, but we do it anyway to avoid holding the pin
-			 * across a lengthy, unrelated operation.
-			 */
-			if (BufferIsValid(vmbuffer))
-			{
-				ReleaseBuffer(vmbuffer);
-				vmbuffer = InvalidBuffer;
-			}
-
-			/* Perform a round of index and heap vacuuming */
-			vacrel->consider_bypass_optimization = false;
-			lazy_vacuum(vacrel);
-
-			/*
-			 * Vacuum the Free Space Map to make newly-freed space visible on
-			 * upper-level FSM pages.  Note we have not yet processed blkno.
-			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-									blkno);
-			next_fsm_block_to_vacuum = blkno;
-
-			/* Report that we are once again scanning the heap */
-			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
-		}
-
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
@@ -1039,9 +1192,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1055,57 +1209,178 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				BlockNumber fsm_vac_up_to;
+
+				/*
+				 * If parallel heap vacuum scan is active, compute the minimum
+				 * block number we scanned so far.
+				 */
+				if (ParallelHeapVacuumIsActive(vacrel))
+				{
+					parallel_heap_vacuum_compute_min_scanned_blkno(vacrel);
+					fsm_vac_up_to = vacrel->phvstate->min_scanned_blkno;
+				}
+				else
+				{
+					/* blkno is already processed */
+					fsm_vac_up_to = blkno + 1;
+				}
+
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+										fsm_vac_up_to);
+				vacrel->next_fsm_block_to_vacuum = fsm_vac_up_to;
 			}
 		}
 		else
 			UnlockReleaseBuffer(buf);
+
+		/*
+		 * Consider if we definitely have enough space to process TIDs on page
+		 * already.  If we are close to overrunning the available space for
+		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
+		 * this page.
+		 */
+		if (TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
+		{
+			/*
+			 * Before beginning index vacuuming, we release any pin we may
+			 * hold on the visibility map page.  This isn't necessary for
+			 * correctness, but we do it anyway to avoid holding the pin
+			 * across a lengthy, unrelated operation.
+			 */
+			if (BufferIsValid(vmbuffer))
+			{
+				ReleaseBuffer(vmbuffer);
+				vmbuffer = InvalidBuffer;
+			}
+
+			/*
+			 * In parallel heap scan, we pause the heap scan without invoking
+			 * index and heap vacuuming, and return to the caller with
+			 * scan_done being false. The parallel vacuum workers will exit as
+			 * theirs jobs are done. The leader process will wait for all
+			 * workers to finish and perform index and heap vacuuming, and
+			 * then performs FSM vacuum too.
+			 */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/* Remember the last scanned block */
+				vacrel->phvstate->myscanstate->last_blkno = blkno;
+
+				/* Remember we might have some unprocessed blocks */
+				scan_done = false;
+
+				break;
+			}
+
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
+
+			/*
+			 * Vacuum the Free Space Map to make newly-freed space visible on
+			 * upper-level FSM pages.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = blkno;
+
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			continue;
+		}
 	}
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	return scan_done;
+}
 
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_state->scanned_pages,
-													 vacrel->scan_state->live_tuples);
+/*
+ * A parallel scan variant of heap_vac_scan_next_block(). Similar to
+ * heap_vac_scan_next_block(), the block number and visibility status of the next
+ * block to process are set in *blkno and *all_visible_according_to_vm. The return
+ * value is false if there are no further blocks to process.
+ *
+ * In parallel vacuum scan, we don't use the SKIP_PAGES_THRESHOLD optimization.
+ */
+static bool
+heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno,
+								  bool *all_visible_according_to_vm)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+	BlockNumber next_block;
+	Buffer		vmbuffer = InvalidBuffer;
+	uint8		mapbits = 0;
 
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_state->recently_dead_tuples +
-		vacrel->scan_state->missed_dead_tuples;
+	Assert(ParallelHeapVacuumIsActive(vacrel));
 
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
+	for (;;)
+	{
+		next_block = table_block_parallelscan_nextpage(vacrel->rel,
+													   &(phvstate->myscanstate->state),
+													   phvstate->pscandesc);
 
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+		/* Have we reached the end of the table? */
+		if (!BlockNumberIsValid(next_block) || next_block >= vacrel->rel_pages)
+		{
+			if (BufferIsValid(vmbuffer))
+				ReleaseBuffer(vmbuffer);
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+			*blkno = vacrel->rel_pages;
+			return false;
+		}
 
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
+		/* We always treat the last block as unsafe to skip */
+		if (next_block == vacrel->rel_pages - 1)
+			break;
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_block, &vmbuffer);
+
+		/*
+		 * A block is unskippable if it is not all visible according to the
+		 * visibility map.
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+			break;
+		}
+
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		if (!vacrel->skipwithvm)
+			break;
+
+		/*
+		 * Aggressive VACUUM caller can't skip pages just because they are
+		 * all-visible.
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * All-visible block is safe to skip in non-aggressive case. But
+			 * remember that the final range contains such a block for later.
+			 */
+			vacrel->scan_state->skippedallvis = true;
+		}
+	}
+
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
+
+	*blkno = next_block;
+	*all_visible_according_to_vm = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
+
+	return true;
 }
 
 /*
@@ -1254,11 +1529,12 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 
 		/*
 		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
+		 * (caller must have the opportunity to set
+		 * vacrel->scan_state->nonempty_pages). This rule avoids having
+		 * lazy_truncate_heap() take access-exclusive lock on rel to attempt a
+		 * truncation that fails anyway, just because there are tuples on the
+		 * last page (it is likely that there will be tuples on other nearby
+		 * pages as well, but those can be skipped).
 		 *
 		 * Implement this by always treating the last block as unsafe to skip.
 		 */
@@ -2117,7 +2393,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2493,7 +2769,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2943,12 +3219,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -2966,11 +3238,20 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize parallel heap scan/vacuuming or index vacuuming
+			 * or both based on the table size and the number of indexes.
+			 * Since only one worker can be used for an index, we will invoke
+			 * parallelism for index vacuuming only if there are at least two
+			 * indexes on a table.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3010,9 +3291,19 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	/*
+	 * Protect both dead_items and dead_items_info from concurrent updates in
+	 * parallel heap scan cases.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
@@ -3212,6 +3503,448 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of parallel workers for parallel vacuum heap scan.
+ *
+ * The calculation logic is borrowed from compute_parallel_worker().
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	int			parallel_workers = 0;
+	int			heap_parallel_threshold;
+	int			heap_pages;
+
+	if (nrequested == 0)
+	{
+		/*
+		 * Select the number of workers based on the log of the size of the
+		 * relation. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		heap_pages = RelationGetNumberOfBlocks(rel);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nrequested;
+
+	return parallel_workers;
+}
+
+/* Estimate shared memory sizes required for parallel heap vacuum */
+static inline void
+heap_parallel_estimate_shared_memory_size(Relation rel, int nworkers, Size *pscan_len,
+										  Size *shared_len, Size *pscanwork_len)
+{
+	Size		size = 0;
+
+	size = add_size(size, SizeOfPHVShared);
+	size = add_size(size, mul_size(sizeof(LVRelScanState), nworkers));
+	*shared_len = size;
+
+	*pscan_len = table_block_parallelscan_estimate(rel);
+
+	*pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+}
+
+/*
+ * Compute the amount of space we'll need in the parallel heap vacuum
+ * DSM, and inform pcxt->estimator about our needs.
+ *
+ * nworkers is the number of workers for the table vacuum. Note that it could
+ * be different than pcxt->nworkers since it is the maximum of number of
+ * workers for table vacuum and index vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+							  int nworkers, void *state)
+{
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+
+	/* space for PHVShared */
+	shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for ParallelBlockTableScanDesc */
+	shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for per-worker scan state, PHVScanWorkerState */
+	shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+								int nworkers, void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	PHVState   *phvstate = vacrel->phvstate;
+	ParallelBlockTableScanDesc pscan;
+	PHVScanWorkerState *pscanwork;
+	PHVShared  *shared;
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	phvstate = (PHVState *) palloc0(sizeof(PHVState));
+	phvstate->min_scanned_blkno = InvalidBlockNumber;
+
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+
+	shared = shm_toc_allocate(pcxt->toc, shared_len);
+
+	/* Prepare the shared information */
+
+	MemSet(shared, 0, shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_state->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_state->NewRelminMxid;
+	shared->skippedallvis = vacrel->scan_state->skippedallvis;
+
+	/*
+	 * XXX: we copy the contents of vistest to the shared area, but in order
+	 * to do that, we need to either expose GlobalVisTest or to provide
+	 * functions to copy contents of GlobalVisTest to somewhere. Currently we
+	 * do the former but not sure it's the best choice.
+	 *
+	 * Alternative idea is to have each worker determine cutoff and have their
+	 * own vistest. But we need to carefully consider it since parallel
+	 * workers end up having different cutoff and horizon.
+	 */
+	shared->vistest = *vacrel->vistest;
+
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_SHARED, shared);
+
+	phvstate->shared = shared;
+
+	/* prepare the  parallel block table scan description */
+	pscan = shm_toc_allocate(pcxt->toc, pscan_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_DESC, pscan);
+
+	/* initialize parallel scan description */
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pscan);
+
+	/* Disable sync scan to always start from the first block */
+	pscan->base.phs_syncscan = false;
+
+	phvstate->pscandesc = pscan;
+
+	/* prepare the workers' parallel block table scan state */
+	pscanwork = shm_toc_allocate(pcxt->toc, pscanwork_len);
+	MemSet(pscanwork, 0, pscanwork_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_DESC_WORKER, pscanwork);
+	phvstate->scanstates = pscanwork;
+
+	vacrel->phvstate = phvstate;
+}
+
+/*
+ * Main function for parallel heap vacuum workers.
+ */
+void
+heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
+							ParallelWorkerContext *pwcxt)
+{
+	LVRelState	vacrel = {0};
+	PHVState   *phvstate;
+	PHVShared  *shared;
+	ParallelBlockTableScanDesc pscandesc;
+	PHVScanWorkerState *scanstate;
+	LVRelScanState *scan_state;
+	ErrorContextCallback errcallback;
+	bool		scan_done;
+
+	phvstate = palloc(sizeof(PHVState));
+
+	pscandesc = (ParallelBlockTableScanDesc) shm_toc_lookup(pwcxt->toc,
+															LV_PARALLEL_KEY_SCAN_DESC,
+															false);
+	phvstate->pscandesc = pscandesc;
+
+	shared = (PHVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_SCAN_SHARED,
+										  false);
+	phvstate->shared = shared;
+
+	scanstate = (PHVScanWorkerState *) shm_toc_lookup(pwcxt->toc,
+													  LV_PARALLEL_KEY_SCAN_DESC_WORKER,
+													  false);
+
+	phvstate->myscanstate = &(scanstate[ParallelWorkerNumber]);
+	scan_state = &(shared->worker_scan_state[ParallelWorkerNumber]);
+
+	/* Prepare LVRelState */
+	vacrel.rel = rel;
+	vacrel.indrels = parallel_vacuum_get_table_indexes(pvs, &vacrel.nindexes);
+	vacrel.pvs = pvs;
+	vacrel.phvstate = phvstate;
+	vacrel.aggressive = shared->aggressive;
+	vacrel.skipwithvm = shared->skipwithvm;
+	vacrel.cutoffs = shared->cutoffs;
+	vacrel.vistest = &(shared->vistest);
+	vacrel.dead_items = parallel_vacuum_get_dead_items(pvs,
+													   &vacrel.dead_items_info);
+	vacrel.rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel.scan_state = scan_state;
+
+	/* initialize per-worker relation statistics */
+	MemSet(scan_state, 0, sizeof(LVRelScanState));
+
+	/* Set fields necessary for heap scan */
+	vacrel.scan_state->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel.scan_state->NewRelminMxid = shared->NewRelminMxid;
+	vacrel.scan_state->skippedallvis = shared->skippedallvis;
+
+	/* Initialize the per-worker scan state if not yet */
+	if (!phvstate->myscanstate->initialized)
+	{
+		table_block_parallelscan_startblock_init(rel,
+												 &(phvstate->myscanstate->state),
+												 phvstate->pscandesc);
+
+		phvstate->myscanstate->last_blkno = InvalidBlockNumber;
+		phvstate->myscanstate->maybe_have_blocks = false;
+		phvstate->myscanstate->initialized = true;
+	}
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel.dbname = get_database_name(MyDatabaseId);
+	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel.relname = pstrdup(RelationGetRelationName(rel));
+	vacrel.indname = NULL;
+	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	scan_done = do_lazy_scan_heap(&vacrel);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	/*
+	 * If the leader or a worker finishes the heap scan because dead_items
+	 * TIDs is close to the limit, it might have some allocated blocks in its
+	 * scan state. Since this scan state might not be used in the next heap
+	 * scan, we remember that it might have some unconsumed blocks so that the
+	 * leader complete the scans after the heap scan phase finishes.
+	 */
+	phvstate->myscanstate->maybe_have_blocks = !scan_done;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+parallel_heap_complete_unfinished_scan(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		PHVScanWorkerState *wstate = &(vacrel->phvstate->scanstates[i]);
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		if (!wstate->maybe_have_blocks)
+			continue;
+
+		/* Attach the worker's scan state and do heap scan */
+		vacrel->phvstate->myscanstate = wstate;
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		Assert(scan_done);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Compute the minimum block number we have scanned so far and update
+ * vacrel->min_scanned_blkno.
+ */
+static void
+parallel_heap_vacuum_compute_min_scanned_blkno(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * We check all worker scan states here to compute the minimum block
+	 * number among all scan states.
+	 */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		PHVScanWorkerState *wstate = &(phvstate->scanstates[i]);
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!wstate->initialized)
+			continue;
+
+		if (!BlockNumberIsValid(phvstate->min_scanned_blkno) ||
+			wstate->last_blkno < phvstate->min_scanned_blkno)
+			phvstate->min_scanned_blkno = wstate->last_blkno;
+	}
+}
+
+/* Accumulate each worker's scan results into the leader's */
+static void
+parallel_heap_vacuum_gather_scan_results(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		LVRelScanState *ss = &(phvstate->shared->worker_scan_state[i]);
+
+		vacrel->scan_state->scanned_pages += ss->scanned_pages;
+		vacrel->scan_state->removed_pages += ss->removed_pages;
+		vacrel->scan_state->vm_new_frozen_pages += ss->vm_new_frozen_pages;
+		vacrel->scan_state->lpdead_item_pages += ss->lpdead_item_pages;
+		vacrel->scan_state->missed_dead_pages += ss->missed_dead_pages;
+		vacrel->scan_state->tuples_deleted += ss->tuples_deleted;
+		vacrel->scan_state->tuples_frozen += ss->tuples_frozen;
+		vacrel->scan_state->lpdead_items += ss->lpdead_items;
+		vacrel->scan_state->live_tuples += ss->live_tuples;
+		vacrel->scan_state->recently_dead_tuples += ss->recently_dead_tuples;
+		vacrel->scan_state->missed_dead_tuples += ss->missed_dead_tuples;
+
+		if (ss->nonempty_pages < vacrel->scan_state->nonempty_pages)
+			vacrel->scan_state->nonempty_pages = ss->nonempty_pages;
+
+		if (TransactionIdPrecedes(ss->NewRelfrozenXid, vacrel->scan_state->NewRelfrozenXid))
+			vacrel->scan_state->NewRelfrozenXid = ss->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(ss->NewRelminMxid, vacrel->scan_state->NewRelminMxid))
+			vacrel->scan_state->NewRelminMxid = ss->NewRelminMxid;
+
+		if (!vacrel->scan_state->skippedallvis && ss->skippedallvis)
+			vacrel->scan_state->skippedallvis = true;
+	}
+
+	/* Also, compute the minimum block number we scanned so far */
+	parallel_heap_vacuum_compute_min_scanned_blkno(vacrel);
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches parallel
+ * workers to scan the heap in parallel.
+ */
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	PHVScanWorkerState *scanstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+	/* initialize parallel scan description to join as a worker */
+	scanstate = palloc0(sizeof(PHVScanWorkerState));
+	scanstate->last_blkno = InvalidBlockNumber;
+	table_block_parallelscan_startblock_init(vacrel->rel, &(scanstate->state),
+											 vacrel->phvstate->pscandesc);
+	vacrel->phvstate->myscanstate = scanstate;
+
+	for (;;)
+	{
+		bool		scan_done;
+
+		/*
+		 * Scan the table until either we are close to overrunning the
+		 * available space for dead_items TIDs or we reach the end of the
+		 * table.
+		 */
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/* wait for parallel workers to finish and gather scan results */
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+		parallel_heap_vacuum_gather_scan_results(vacrel);
+
+		/* We reach the end of the table */
+		if (scan_done)
+			break;
+
+		/*
+		 * The parallel heap scan paused in the middle of the table due to
+		 * full of dead_items TIDs. We perform a round of index and heap
+		 * vacuuming and FSM vacuum.
+		 */
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (vacrel->phvstate->min_scanned_blkno > vacrel->next_fsm_block_to_vacuum)
+		{
+			/*
+			 * min_scanned_blkno was updated when gathering the workers' scan
+			 * results.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									vacrel->phvstate->min_scanned_blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = vacrel->phvstate->min_scanned_blkno;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+		/* Re-launch workers to restart parallel heap scan */
+		vacrel->phvstate->nworkers_launched =
+			parallel_vacuum_table_scan_begin(vacrel->pvs);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process could launch fewer workers in the next cycle.
+	 */
+	parallel_heap_complete_unfinished_scan(vacrel);
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 08011fde23f..9f8c8f09576 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -6,15 +6,24 @@
  * This file contains routines that are intended to support setting up, using,
  * and tearing down a ParallelVacuumState.
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan or both index bulk deletion and
+ * index cleanup or all of them with parallel worker processes. Different
+ * numbers of workers are launched for the table vacuuming and index processing.
+ * ParallelVacuumState contains shared information as well as the memory space
+ * for storing dead items allocated in the DSA area.
+ *
+ * When initializing parallel table vacuum scan, we invoke table AM routines for
+ * estimating DSM sizes and initializing DSM memory. Parallel table vacuum
+ * workers invoke the table AM routine for vacuuming the table.
+ *
+ * For processing indexes in parallel, individual indexes are processed by one
+ * vacuum process. We launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the parallel
+ * worker processes exit.
+ *
+ * Each time we process table or indexes in parallel, the parallel context is
+ * re-initialized so that the same DSM can be used for multiple passes of table vacuum
+ * or index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -28,6 +37,7 @@
 
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * True if the caller wants parallel workers to invoke vacuum table scan
+	 * callback.
+	 */
+	bool		do_vacuum_table_scan;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -101,6 +117,13 @@ typedef struct PVShared
 	 */
 	pg_atomic_uint32 cost_balance;
 
+	/*
+	 * The number of workers for parallel table scan/vacuuming and index
+	 * vacuuming, respectively.
+	 */
+	int			nworkers_for_table;
+	int			nworkers_for_index;
+
 	/*
 	 * Number of active parallel workers.  This is used for computing the
 	 * minimum threshold of the vacuum cost balance before a worker sleeps for
@@ -164,6 +187,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Passed to parallel table scan workers. NULL for leader process */
+	ParallelWorkerContext *pwcxt;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -193,6 +219,9 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/* How many times parallel table vacuum scan is called? */
+	int			num_table_scans;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
@@ -224,8 +253,9 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static void parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											int *nworkers_for_index, bool *will_parallel_vacuum);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_unsafe_indexes(ParallelVacuumState *pvs);
@@ -244,7 +274,7 @@ static void parallel_vacuum_error_callback(void *arg);
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -258,6 +288,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
+	int			nworkers_for_index;
 	int			querylen;
 
 	/*
@@ -265,15 +297,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	 * relation
 	 */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
-													   nrequested_workers,
-													   will_parallel_vacuum);
+	parallel_vacuum_compute_workers(rel, indrels, nindexes, nrequested_workers,
+									&nworkers_for_table, &nworkers_for_index,
+									will_parallel_vacuum);
+
+	parallel_workers = Max(nworkers_for_table, nworkers_for_index);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
@@ -329,6 +363,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -373,6 +411,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shared->relid = RelationGetRelid(rel);
 	shared->elevel = elevel;
 	shared->queryid = pgstat_get_my_query_id();
+	shared->nworkers_for_table = nworkers_for_table;
+	shared->nworkers_for_index = nworkers_for_index;
 	shared->maintenance_work_mem_worker =
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
@@ -421,6 +461,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Prepare AM-specific DSM for parallel table vacuum */
+	if (nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -534,33 +578,48 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.
+ *
+ * For parallel table vacuum, it asks AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to *nworkers_table.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel index vacuum, The index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. nrequested is the number of
+ * parallel workers that user requested.  If nrequested is 0, we compute the
+ * parallel degree based on nindexes, that is the number of indexes that
+ * support parallel vacuum.  This function also sets will_parallel_vacuum to
+ * remember indexes that participate in parallel vacuum.
  */
-static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+static void
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_for_table,
+								int *nworkers_for_index, bool *will_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			parallel_workers_table = 0;
+	int			parallel_workers_index = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
 	 * when parallelism is disabled.
 	 */
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
-		return 0;
+	{
+		*nworkers_for_table = 0;
+		*nworkers_for_index = 0;
+		return;
+	}
+
+	/*
+	 * Compute the number of workers for parallel table scan. Cap by
+	 * max_parallel_maintenance_workers.
+	 */
+	parallel_workers_table = Min(table_parallel_vacuum_compute_workers(rel, nrequested),
+								 max_parallel_maintenance_workers);
 
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
@@ -591,17 +650,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	nindexes_parallel--;
 
 	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Compute the parallel degree for parallel index vacuum */
+		parallel_workers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		parallel_workers_index = Min(parallel_workers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_for_table = parallel_workers_table;
+	*nworkers_for_index = parallel_workers_index;
 }
 
 /*
@@ -669,8 +729,12 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
-		/* Reinitialize parallel context to relaunch parallel workers */
-		if (pvs->num_index_scans > 0)
+		/*
+		 * Reinitialize parallel context to relaunch parallel workers if we
+		 * have used the parallel context for either index vacuuming or table
+		 * vacuuming.
+		 */
+		if (pvs->num_index_scans > 0 || pvs->num_table_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -982,6 +1046,146 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Prepare DSM and shared vacuum delays, and launch parallel workers for parallel
+ * table vacuum. Return the number of parallel workers launched.
+ *
+ * The caller must call parallel_vacuum_table_scan_end() to finish the parallel
+ * table vacuum.
+ */
+int
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->shared->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->do_vacuum_table_scan = true;
+
+	if (pvs->num_table_scans > 0)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->shared->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	ereport(pvs->shared->elevel,
+			(errmsg(ngettext("launched %d parallel vacuum worker for table processing (planned: %d)",
+							 "launched %d parallel vacuum workers for table processing (planned: %d)",
+							 pvs->pcxt->nworkers_launched),
+					pvs->pcxt->nworkers_launched, pvs->shared->nworkers_for_table)));
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel table vacuum scan, and gather statistics.
+ */
+void
+parallel_vacuum_table_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->shared->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	pvs->shared->do_vacuum_table_scan = false;
+	pvs->num_table_scans++;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/*
+ * Return the number of workers for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->shared->nworkers_for_table;
+}
+
+/*
+ * Return the number of workers for parallel index processing.
+ */
+int
+parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs)
+{
+	return pvs->shared->nworkers_for_index;
+}
+
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->do_vacuum_table_scan);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do table vacuum scan */
+	table_parallel_vacuum_relation_worker(pvs->heaprel, pvs, pvs->pwcxt);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -1033,7 +1237,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	if (shared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = shared->maintenance_work_mem_worker;
@@ -1064,6 +1267,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
 	pvs.heaprel = rel;
 
+	pvs.pwcxt = palloc(sizeof(ParallelWorkerContext));
+	pvs.pwcxt->toc = toc;
+	pvs.pwcxt->seg = seg;
+
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
 	pvs.status = PARALLEL_INDVAC_STATUS_INITIAL;
@@ -1081,8 +1288,16 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	if (pvs.shared->do_vacuum_table_scan)
+	{
+		/* Process table to perform vacuum */
+		parallel_vacuum_process_table(&pvs);
+	}
+	else
+	{
+		/* Process indexes to perform vacuum/cleanup */
+		parallel_vacuum_process_safe_indexes(&pvs);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 2e54c11f880..4813a07860d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,80 +99,6 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
-/*
- * State for the GlobalVisTest* family of functions. Those functions can
- * e.g. be used to decide if a deleted row can be removed without violating
- * MVCC semantics: If the deleted row's xmax is not considered to be running
- * by anyone, the row can be removed.
- *
- * To avoid slowing down GetSnapshotData(), we don't calculate a precise
- * cutoff XID while building a snapshot (looking at the frequently changing
- * xmins scales badly). Instead we compute two boundaries while building the
- * snapshot:
- *
- * 1) definitely_needed, indicating that rows deleted by XIDs >=
- *    definitely_needed are definitely still visible.
- *
- * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
- *    definitely be removed
- *
- * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
- * && XID < definitely_needed), the boundaries can be recomputed (using
- * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
- * maintaining an accurate value all the time.
- *
- * As it is not cheap to compute accurate boundaries, we limit the number of
- * times that happens in short succession. See GlobalVisTestShouldUpdate().
- *
- *
- * There are three backend lifetime instances of this struct, optimized for
- * different types of relations. As e.g. a normal user defined table in one
- * database is inaccessible to backends connected to another database, a test
- * specific to a relation can be more aggressive than a test for a shared
- * relation.  Currently we track four different states:
- *
- * 1) GlobalVisSharedRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in any database, nor a
- *    replication slot's xmin, nor a replication slot's catalog_xmin might
- *    still consider XID as running.
- *
- * 2) GlobalVisCatalogRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin, nor a replication slot's
- *    catalog_xmin might still consider XID as running.
- *
- *    I.e. the difference to GlobalVisSharedRels is that
- *    snapshot in other databases are ignored.
- *
- * 3) GlobalVisDataRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin consider XID as running.
- *
- *    I.e. the difference to GlobalVisCatalogRels is that
- *    replication slot's catalog_xmin is not taken into account.
- *
- * 4) GlobalVisTempRels, which only considers the current session, as temp
- *    tables are not visible to other sessions.
- *
- * GlobalVisTestFor(relation) returns the appropriate state
- * for the relation.
- *
- * The boundaries are FullTransactionIds instead of TransactionIds to avoid
- * wraparound dangers. There e.g. would otherwise exist no procarray state to
- * prevent maybe_needed to become old enough after the GetSnapshotData()
- * call.
- *
- * The typedef is in the header.
- */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7d06dad83fc..94438eff25c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -21,6 +21,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "commands/vacuum.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -401,6 +402,13 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int requested);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+										  int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
+										ParallelWorkerContext *pwcxt);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 09b9b394e0e..d7d74514a60 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -654,6 +655,47 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * function must return 0 to disable parallel table vacuum.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel, int requested);
+
+	/*
+	 * Estimate the size of shared memory that the parallel table vacuum needs
+	 * for AM
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * This callback is called for parallel table vacuum workers.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_relation_worker) (Relation rel,
+													ParallelVacuumState *pvs,
+													ParallelWorkerContext *pwcxt);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1715,6 +1757,52 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int requested)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							   void *state)
+{
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Start a parallel table vacuuming for this relation.
+ */
+static inline void
+table_parallel_vacuum_relation_worker(Relation rel, ParallelVacuumState *pvs,
+									  ParallelWorkerContext *pwcxt)
+{
+	rel->rd_tableam->parallel_vacuum_relation_worker(rel, pvs, pwcxt);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e7b7753b691..d45866d61e5 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -360,7 +360,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -370,6 +371,11 @@ extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												bool estimated_count);
+extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..3b6fb603544 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -17,6 +17,7 @@
 #include "utils/relcache.h"
 #include "utils/resowner.h"
 #include "utils/snapshot.h"
+#include "utils/snapmgr_internal.h"
 
 
 extern PGDLLIMPORT bool FirstSnapshotSet;
@@ -95,7 +96,6 @@ extern char *ExportSnapshot(Snapshot snapshot);
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
  */
-typedef struct GlobalVisState GlobalVisState;
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
diff --git a/src/include/utils/snapmgr_internal.h b/src/include/utils/snapmgr_internal.h
new file mode 100644
index 00000000000..4363adf7f62
--- /dev/null
+++ b/src/include/utils/snapmgr_internal.h
@@ -0,0 +1,91 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapmgr_internal.h
+ *		This file contains declarations of structs for snapshot manager
+ *		for internal use.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/snapmgr_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SNAPMGR_INTERNAL_H
+#define SNAPMGR_INTERNAL_H
+
+#include "access/transam.h"
+
+/*
+ * State for the GlobalVisTest* family of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track four different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisDataRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * 4) GlobalVisTempRels, which only considers the current session, as temp
+ *    tables are not visible to other sessions.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ */
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
+#endif							/* SNAPMGR_INTERNAL_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 80202d4a824..ede0da49ce0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1842,6 +1842,9 @@ PGresAttValue
 PGresParamDesc
 PGresult
 PGresult_data
+PHVScanWorkerState
+PHVShared
+PHVState
 PIO_STATUS_BLOCK
 PLAINTREE
 PLAssignStmt
-- 
2.43.5

v6-0002-Remember-the-number-of-times-parallel-index-vacuu.patchapplication/octet-stream; name=v6-0002-Remember-the-number-of-times-parallel-index-vacuu.patchDownload

From 26028c5b9838dc3c1b688d23bc1285c455f4409f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Dec 2024 15:54:32 -0800
Subject: [PATCH v6 2/8] Remember the number of times parallel index
 vacuuming/cleanup is executed in ParallelVacuumState.

Previously, the caller can passes an arbitrary value for
'num_index_scans' to parallel index vacuuming or cleaning up APIs, but
it didn't make sense since if the caller needs to be careful about
counting how many times it executed index vacuuming or cleaning
up. Otherwise, it fails to reinitialize parallel DSM.

This commit changes parallel vacuum APIs so that ParallelVacuumState
has the counter num_index_scans and re-initialize parallel DSM based
on that.

An upcoming patch for parallel table scan will do a similar thing.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c  |  4 +---
 src/backend/commands/vacuumparallel.c | 27 +++++++++++++++------------
 src/include/commands/vacuum.h         |  4 +---
 3 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 75cd67395f4..116c0612ca5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2143,8 +2143,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	else
 	{
 		/* Outsource everything to parallel variant */
-		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, old_live_tuples,
-											vacrel->num_index_scans);
+		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, old_live_tuples);
 
 		/*
 		 * Do a postcheck to consider applying wraparound failsafe now.  Note
@@ -2514,7 +2513,6 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	{
 		/* Outsource everything to parallel variant */
 		parallel_vacuum_cleanup_all_indexes(vacrel->pvs, reltuples,
-											vacrel->num_index_scans,
 											estimated_count);
 	}
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0d92e694d6a..08011fde23f 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -200,6 +200,9 @@ struct ParallelVacuumState
 	 */
 	bool	   *will_parallel_vacuum;
 
+	/* How many time index vacuuming or cleaning up is executed? */
+	int			num_index_scans;
+
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
 	 * parallel index cleanup respectively.
@@ -223,8 +226,7 @@ struct ParallelVacuumState
 
 static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 											bool *will_parallel_vacuum);
-static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
-												bool vacuum);
+static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_unsafe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
@@ -497,8 +499,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
  * Do parallel index bulk-deletion with parallel workers.
  */
 void
-parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tuples,
-									int num_index_scans)
+parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tuples)
 {
 	Assert(!IsParallelWorker());
 
@@ -509,7 +510,7 @@ parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 	pvs->shared->reltuples = num_table_tuples;
 	pvs->shared->estimated_count = true;
 
-	parallel_vacuum_process_all_indexes(pvs, num_index_scans, true);
+	parallel_vacuum_process_all_indexes(pvs, true);
 }
 
 /*
@@ -517,7 +518,7 @@ parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tup
  */
 void
 parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tuples,
-									int num_index_scans, bool estimated_count)
+									bool estimated_count)
 {
 	Assert(!IsParallelWorker());
 
@@ -529,7 +530,7 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 	pvs->shared->reltuples = num_table_tuples;
 	pvs->shared->estimated_count = estimated_count;
 
-	parallel_vacuum_process_all_indexes(pvs, num_index_scans, false);
+	parallel_vacuum_process_all_indexes(pvs, false);
 }
 
 /*
@@ -608,8 +609,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
  * must be used by the parallel vacuum leader process.
  */
 static void
-parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
-									bool vacuum)
+parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 {
 	int			nworkers;
 	PVIndVacStatus new_status;
@@ -631,7 +631,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		nworkers = pvs->nindexes_parallel_cleanup;
 
 		/* Add conditionally parallel-aware indexes if in the first time call */
-		if (num_index_scans == 0)
+		if (pvs->num_index_scans == 0)
 			nworkers += pvs->nindexes_parallel_condcleanup;
 	}
 
@@ -659,7 +659,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		indstats->parallel_workers_can_process =
 			(pvs->will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
-													num_index_scans,
+													pvs->num_index_scans,
 													vacuum));
 	}
 
@@ -670,7 +670,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	if (nworkers > 0)
 	{
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (pvs->num_index_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -764,6 +764,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 	}
+
+	/* Increment the counter */
+	pvs->num_index_scans++;
 }
 
 /*
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 12d0b61950d..e7b7753b691 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -366,11 +366,9 @@ extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
-												long num_table_tuples,
-												int num_index_scans);
+												long num_table_tuples);
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
-												int num_index_scans,
 												bool estimated_count);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
-- 
2.43.5

v6-0001-Move-lazy-heap-scanning-related-variables-to-stru.patchapplication/octet-stream; name=v6-0001-Move-lazy-heap-scanning-related-variables-to-stru.patchDownload

From b18a0b8d0e4ccd5dda981527adecacfc14ce91c3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 15 Nov 2024 14:14:13 -0800
Subject: [PATCH v6 1/8] Move lazy heap scanning related variables to struct
 LVRelScanState.

---
 src/backend/access/heap/vacuumlazy.c | 304 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 159 insertions(+), 146 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 09fab08b8e1..75cd67395f4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -131,6 +131,47 @@ typedef enum
 	VACUUM_ERRCB_PHASE_TRUNCATE,
 } VacErrPhase;
 
+/*
+ * Relation statistics collected during heap scanning.
+ */
+typedef struct LVRelScanState
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVRelScanState;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -157,10 +198,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -186,43 +223,18 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
 
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
+	/* Working state for heap scanning and vacuuming */
+	LVRelScanState *scan_state;
 
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
-
-	/* Statistics output by us, for table */
-	double		new_rel_tuples; /* new estimated total # of tuples */
-	double		new_live_tuples;	/* new estimated total # of live tuples */
+	/* New estimated total # of tuples and total # of live tuples */
+	double		new_rel_tuples;
+	double		new_live_tuples;
 	/* Statistics output by index AMs */
 	IndexBulkDeleteResult **indstats;
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -309,6 +321,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVRelScanState *scan_state;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -420,12 +433,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
+	scan_state = palloc(sizeof(LVRelScanState));
+	scan_state->scanned_pages = 0;
+	scan_state->removed_pages = 0;
+	scan_state->new_frozen_tuple_pages = 0;
+	scan_state->lpdead_item_pages = 0;
+	scan_state->missed_dead_pages = 0;
+	scan_state->nonempty_pages = 0;
+	scan_state->tuples_deleted = 0;
+	scan_state->tuples_frozen = 0;
+	scan_state->lpdead_items = 0;
+	scan_state->live_tuples = 0;
+	scan_state->recently_dead_tuples = 0;
+	scan_state->missed_dead_tuples = 0;
+	scan_state->vm_new_visible_pages = 0;
+	scan_state->vm_new_visible_frozen_pages = 0;
+	scan_state->vm_new_frozen_pages = 0;
+	vacrel->scan_state = scan_state;
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/* Allocate/initialize output statistics state */
@@ -434,19 +458,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
-	/* Initialize remaining counters (be tidy) */
-	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
-
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
-
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
 	 * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze.  Then determine
@@ -467,9 +478,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
+	vacrel->scan_state->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_state->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_state->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -550,15 +561,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_state->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_state->NewRelfrozenXid));
+	Assert(vacrel->scan_state->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_state->NewRelminMxid));
+	if (vacrel->scan_state->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -566,8 +577,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_state->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_state->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
@@ -588,7 +599,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_state->NewRelfrozenXid, vacrel->scan_state->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -604,8 +615,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples);
+						 vacrel->scan_state->recently_dead_tuples +
+						 vacrel->scan_state->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	if (instrument)
@@ -678,21 +689,21 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_state->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_state->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->scan_state->scanned_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->scan_state->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->scan_state->recently_dead_tuples);
+			if (vacrel->scan_state->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->scan_state->missed_dead_tuples,
+								 vacrel->scan_state->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -700,33 +711,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_state->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_state->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_state->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_state->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_state->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_state->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 (long long) vacrel->scan_state->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_state->vm_new_visible_pages,
+							 vacrel->scan_state->vm_new_visible_frozen_pages +
+							 vacrel->scan_state->vm_new_frozen_pages,
+							 vacrel->scan_state->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -746,10 +757,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_state->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->scan_state->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->scan_state->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -882,7 +893,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		bool		has_lpdead_items;
 		bool		got_cleanup_lock = false;
 
-		vacrel->scanned_pages++;
+		vacrel->scan_state->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -900,7 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_state->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1064,16 +1075,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_state->scanned_pages,
+													 vacrel->scan_state->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_state->recently_dead_tuples +
+		vacrel->scan_state->missed_dead_tuples;
 
 	/*
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
@@ -1110,10 +1121,10 @@ lazy_scan_heap(LVRelState *vacrel)
  * there are no further blocks to process.
  *
  * vacrel is an in/out parameter here.  Vacuum options and information about
- * the relation are read.  vacrel->skippedallvis is set if we skip a block
- * that's all-visible but not all-frozen, to ensure that we don't update
- * relfrozenxid in that case.  vacrel also holds information about the next
- * unskippable block, as bookkeeping for this function.
+ * the relation are read.  vacrel->scan_state->skippedallvis is set if we skip
+ * a block that's all-visible but not all-frozen, to ensure that we don't
+ * update relfrozenxid in that case.  vacrel also holds information about the
+ * next unskippable block, as bookkeeping for this function.
  */
 static bool
 heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
@@ -1170,7 +1181,7 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_state->skippedallvis = true;
 		}
 	}
 
@@ -1414,11 +1425,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_state->vm_new_visible_pages++;
+				vacrel->scan_state->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_state->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1488,10 +1499,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_state->NewRelfrozenXid,
+							   &vacrel->scan_state->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_state->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_state->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1501,7 +1513,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_state->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -1536,7 +1548,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_state->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -1551,15 +1563,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_state->tuples_deleted += presult.ndeleted;
+	vacrel->scan_state->tuples_frozen += presult.nfrozen;
+	vacrel->scan_state->lpdead_items += presult.lpdead_items;
+	vacrel->scan_state->live_tuples += presult.live_tuples;
+	vacrel->scan_state->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_state->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -1608,13 +1620,13 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_state->vm_new_visible_pages++;
 			if (presult.all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_state->vm_new_visible_frozen_pages++;
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_state->vm_new_frozen_pages++;
 	}
 
 	/*
@@ -1700,8 +1712,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_state->vm_new_visible_pages++;
+			vacrel->scan_state->vm_new_visible_frozen_pages++;
 		}
 
 		/*
@@ -1709,7 +1721,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * above, so we don't need to test the value of old_vmbits.
 		 */
 		else
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_state->vm_new_frozen_pages++;
 	}
 }
 
@@ -1748,8 +1760,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_state->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_state->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1876,8 +1888,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_state->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_state->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -1904,25 +1916,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_state->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_state->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_state->live_tuples += live_tuples;
+	vacrel->scan_state->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_state->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_state->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_state->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -1951,7 +1963,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_state->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -1985,7 +1997,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_state->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2012,7 +2024,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		bypass = (vacrel->scan_state->lpdead_item_pages < threshold &&
 				  (TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L)));
 	}
 
@@ -2150,7 +2162,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_state->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2259,8 +2271,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_state->lpdead_items &&
+			vacuumed_pages == vacrel->scan_state->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2376,14 +2388,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_state->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_state->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_state->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -2459,7 +2471,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_state->scanned_pages < vacrel->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -2640,7 +2652,7 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->scan_state->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
 		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
@@ -2666,7 +2678,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_state->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -2767,7 +2779,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_state->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
@@ -2775,7 +2787,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_state->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -2803,7 +2815,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_state->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2915,7 +2927,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_state->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e1c4f913f84..80202d4a824 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1479,6 +1479,7 @@ LPVOID
 LPWSTR
 LSEG
 LUID
+LVRelScanState
 LVRelState
 LVSavedErrInfo
 LWLock
-- 
2.43.5

#33

Masahiko Sawada

sawada.mshk@gmail.com

12 months ago

In reply to: Masahiko Sawada (#32)

9 attachment(s)

Re: Parallel heap vacuum

On Fri, Jan 3, 2025 at 3:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 25, 2024 at 8:52 AM Tomas Vondra <tomas@vondra.me> wrote:

On 12/19/24 23:05, Masahiko Sawada wrote:

On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote:

On 12/13/24 00:04, Tomas Vondra wrote:

...

The main difference is here:

master / no parallel workers:

pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)

1 parallel worker:

pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)

Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
essentially just those with deleted tuples, which is ~1/20 of pages.
That's close to the 15x speedup.

This effect is clearest without indexes, but it does affect even runs
with indexes - having to scan the indexes makes it much less pronounced,
though. However, these indexes are pretty massive (about the same size
as the table) - multiple times larger than the table. Chances are it'd
be clearer on realistic data sets.

So the question is - is this correct? And if yes, why doesn't the
regular (serial) vacuum do that?

There's some more strange things, though. For example, how come the avg
read rate is 0.000 MB/s?

avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s

It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
pages are in shared buffers (thanks to the DELETE earlier in that session).

OK, after looking into this a bit more I think the reason is rather
simple - SKIP_PAGES_THRESHOLD.

With serial runs, we end up scanning all pages, because even with an
update every 5000 tuples, that's still only ~25 pages apart, well within
the 32-page window. So we end up skipping no pages, scan and vacuum all
everything.

But parallel runs have this skipping logic disabled, or rather the logic
that switches to sequential scans if the gap is less than 32 pages.

IMHO this raises two questions:

1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
sequential scans is the pages are close enough. Maybe there is a reason
for this difference? Workers can reduce the difference between random
and sequential I/0, similarly to prefetching. But that just means the
workers should use a lower threshold, e.g. as

SKIP_PAGES_THRESHOLD / nworkers

or something like that? I don't see this discussed in this thread.

Each parallel heap scan worker allocates a chunk of blocks which is
8192 blocks at maximum, so we would need to use the
SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
need to evaluate the differences anyway. WIll do the benchmark test
and share the results.

Right. I don't think this really matters for small tables, and for large
tables the chunks should be fairly large (possibly up to 8192 blocks),
in which case we could apply SKIP_PAGE_THRESHOLD just like in the serial
case. There might be differences at boundaries between chunks, but that
seems like a minor / expected detail. I haven't checked know if the code
would need to change / how much.

2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
storage. If I can get an order of magnitude improvement (or more than
that) by disabling the threshold, and just doing random I/O, maybe
there's time to adjust it a bit.

Yeah, you've started a thread for this so let's discuss it there.

OK. FWIW as suggested in the other thread, it doesn't seem to be merely
a question of VACUUM performance, as not skipping pages gives vacuum the
opportunity to do cleanup that would otherwise need to happen later.

If only for this reason, I think it would be good to keep the serial and
parallel vacuum consistent.

I've not evaluated SKIP_PAGE_THRESHOLD optimization yet but I'd like
to share the latest patch set as cfbot reports some failures. Comments
from Kuroda-san are also incorporated in this version. Also, I'd like
to share the performance test results I did with the latest patch.

I've implemented SKIP_PAGE_THRESHOLD optimization in parallel heap
scan, and attached the updated patch set. I've attached the
performance test results too to compare v6 and v7 patch sets. I can
see there are not big differences in test cases but the v7 patch has a
slightly better performance.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

parallel_heap_vacuum_v6_v7.pdfapplication/pdf; name=parallel_heap_vacuum_v6_v7.pdfDownload

%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(��Google Sheets)
/Title
(��q!�L0n0�0�0�0�0�0�0�0�)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x���]�$����_�~��������e����`���m��F������9�$�X���3�{����Efk��Z���I���
~��~�>_��K�#�p��������|�����������5NH���������t���a^��2���x�{�����4-@���i��3#���������f���w�F6c��gX����+���k����s��+�����R���<:)v����B��_��UX�����s��!.�zyL�����3l�5d�����fr�N�~�I��k$9��5��z��%1p�o��2��bj�3���r���L�K��i�����������a�������^r2bc�Ul�����@H�����#�?r^FlN�/��q
��/�����o�����b@�G��=���s�9�9gW��e����h@���t�?��={�re:�\@����uq�s;�+wyp�5�4Fop&���\@������S6�S�8���u3@�����Xyd��L������@�mX&�a#p�C�����<�e������b��/���u^�E��f��0~�'��u�!&"W9N�_������!M��ab�*�SZ���\�m<�A��z�*��p��0����<�!���)�e8��p�9���[i3�NCZ.�a�q.��1�L��sNF�1s������s'��K������s2*,�1���9�is���s��8..��1
��C���2*�e���w6�W]��9D{1�!H�7�N<���*)�����b1C5�3���u��4�����0�!@Lf�6r���[�@����������EMk�NI�v#,�{���5���s�xe=b���`�_�e�W8�z��~�^,"������^�9��q"GE�����5�S�Q�]���:�@��ED&�\���T	6��Fr��!BX���uN�"G]M�c�rJ?m��H��8�tc#�)����\����6D��
'�
m�I�Fl+���K�K��<�h��K��
z�����p�GQ��b��IU����e��nz�D~��q��d. Y������\��jQ�X����/-9z�QB-���aB2�,��
���8����d. ?����Bh��
j0P����k<���1�[��[m'7L�A`P ,�p�����bqu�aB-
���k�P{���.��1�dD�)���\�8�p�j�X'c0"��@�X���tg�m6'�������8a!q��T��1Q/?���4`OP���@r2����]U�����hh�
���hFl�l�W�Zli�V!�<�������� "��h_!d. �t�����UE�:!s�w�kXD�VMu��$�w�z58����d. ?����X�!<CEg3�w��a]�V�5,">��������o
��U��5,
�����P���5�D��8�:z�6"�9N7<�(l�"���hr����N�����-5Ng@XH\�D�����W��������K��������P,[<Um��A�2bkw�RD�T��+��$���RD�T�����,�b�3��8P(�A�� �H��2W���*5�3D��S��a�Ja�J��;���l�jA�^<56HaO �n����Ohh(���@XX������;N�����6�ZC�~�W���16Um�W�\�u��3��vru�0*$�0�����C�Ev��)��T&Gd��!90JX�]�-�Fs]�p����5���%'#6u��c��ZAB/9����z	�����@�����e�����=*l}m���d���1���UB�o��{�E��&5!�Hm�w���l����q�� ������^��v�00s}��������/1�����������`��������Ww������I��i}|A�~�[��R68��@XX��+5���)�*�� ��������26������L�l(y�R��9?���0@�O�<�d��L���(��5��o��9'���}k`L����.9/���]k���:�����;7��E���}�$s����eG�WYN������C�#,��^�D�\�����Y��2�jb�*�c���l,��Y�&�U��S
\�P�p6��Q�T��c.\��Z�hg��	q"�D�Wq�#�d.�G{�:�����sr��k�Neh�o��F�\@��%�^��dcM���3����]���Hr~hc
�!�C�O�^r2bc7�b��J��
v���mY8�/�J@��4���u������
��b�^���2�-"2�mX@@��*�}���������b#�Jq��d. �s^
��q�� �X��{f��^��3�w+����U���C�W)NO,pmC������ "�
qr����Pc��'rQD~�>� s?�'��
�*��'W��:U���>�\��m,���������Cj�F�c�$z<��c�5��������O��l	C�������9��[�Z���J��

��?^#�u�I��-	T���H^F��{w4��=���}}���l��>�v* ��E��}����d��I#���@�9UJ2�,���F-��t�0!���z���b�~���;���xSO�~:�.�������[SP�T���.\������Fj�
sd�����V���1���3: �V'�\���tknNT%W%�W8Qs\�y7�y%����E��!$'#����
[�!���#9�������0���E�2bkw���=�i���>�\@�x��XL!>�r��A���[N�,�U9Se$s�����jp&_��\@~|�	y�Cx�v&���d{�m|T��������o���;7>�E�g����x���WU����Y���-�\����:����Y�3���5^��s�>���9a!q�5���y�?=���:�#S���c�<�����$��sNFl���{~V�e��1�Fl���!��YP�?�e���`C��d. Y�=2��:9�����l���1�aZ� �H�`p&g��\��u
�t�V-
���k<�����
����xOgi��@�\�P�}�fAW'`P ,,\������(N���(���\�8U����)t�V�W��:]��{%|�c�����UM���YBd	���o������g!*C4_����r~,F��Dehf?����
c�3I��r�
!Y���J����uo��=�d���6���l��k%B�D`V���$�X�s��U3S�r���{��ap&��r�O��K�hXBD�Z&]$�X�fe�f%>� ��ez\�kj=��3h�s����;��B�����j�X��,����{���0�p�"�q����u`�w�&,j<�$�D��&j��$�H�!���>�V�h��@%'��d��/�d��'G.9��������� !���������r���aT#��H^Fl��}9jM�?�	��<�Z�l��}9jQ�?,
�	�\@g��/G�%��] �$s�>w�~58�w��d. {���r�b�N3�x+���m9jR�>L
���u��V���{��aP�>
�����P�|hW����a!r��Tz���r��DX+�x���R��vDz)���KT�����s>e�P/<�5�����f:a�9����j�����7��9����:����R��U��c�5bKwm��-
���}M$/#�v����P�����������Uw�*!s���o�w�������$�H�]xU�3y LH�r���Q��z L�\�y�<�{���9�~U����7���}�r� B�
pr����Pc��1�D~���:�d.��:�:�(��_e8���5^�J���1����B�k��=��+�.��"4$��d4���:�w��0	�bL�4�z$'#6v�Z$�a9�����WF�2bk7���C�Vq����$�7'����*e_d. ��or]
"8�R��H����c5,��m�0!���/l"�8zt}��ZDp�P&���,:�]+�j�Y�2=��5��w���DxV�L�p�j�n�Q�����}|A�~�S��G��*��5W��:]G���{$W'�B�k��=>��������UT������e��Z{�1����9����F�%�{3}���0v��+l�x��W&�eTX�sf_�a<P	��#���<�1*���U��,
�w��Z���
���3�-�;��3��@��3��
�l�Q{���@50s}����{��Z�����Z�
�Z�����QA%4���U^�A>:H�$F�>�H�x��(�TD�k�\��Z�C�s`PG��!P-$�q������Y���1"##f��@'���1�0*@�oH�M 9���nhPc��^A#v������5F��
*�}e$/#�v�>j��Jw_!d. ��s�W
bTP�����,����1*�t�:I���;������$s��p� ���=@����x�����1*�r�������{�P�T���.\�����c�������Gd.��z�:�8�c�
wr����N�����`X�A!��D�'j�;w�����M�?9�Ar2�����$F����������u`I4�U�����S��=�g:8�\@�xs�GM!2��Y���dS�jXDd�3�jB2�,�;��gr6@������4���Y�H"!s����5���L'f�@���Z��g���CH�����pmj��Fh�DH��j#r��T���������r�\�u��*w�����&�5QK|d�N���+(v���m����L'����S����g�rNFd���:j*��*H?�5^#6u�V���j��E�2bswm�Qk��*�}u���l/k������E*�}�����b}���#��.�
H�R}����8�{���b����WG-�e��u����s���D�WmLO)p�C����WG
"��2&�\��5�����A�}U�>� s��p�^������|s��SY�{u���>��`MH�������C�i4�w�rNF��S���D`
������KX��Kg`�=�e��n|���N��5�����{7>������E�����������tV-
��U�H���8�����br�;���.��1�U3Wx+O�>�{�tV
�n�u�[9Y\�,6��@�R��w���R�q"/E0�s�j!r��T����b�|��`La�}�j�����*e�z�����T5U���J���$lFP��I�,��y���C��gaA�����,�RD}V�%�L�Z�����_	�t������A��V����Ob�b�jBr���y�FK5�������������T�t6Vm�k���������K�jr&�h&r��]��e��H3PWh&����Xy���?��:F��X3�Z�Z���khq��A&34KQ���N���,�6�[5��u�*.{��)Df:�
�f#�.q��w�[|f��4z��(�A��n�,�����
�=y�\@�wCh���~3����c
�
�j����m���m�����q�T3�_����K�4T�U
�*��n���l��m�fV��*o�������2`fg�G�0��%.��=of4pU#sQ�y�=��"6p?���Z��d(��}��@oV5�� g/^��/�8��fu�(()��X�=N5/��w����)*��i����TN�~TV����y�N*j��u>��s8����Qt�@�t�8��_7�
�
���e�T�P>��d{���3��D?���d{7?i+�X�aeT�N�B���������������w3��aV���_C�����������������{w������>��#�EG	
�?�G��*��2�����:��*�~
��z�v�;?Vnf1��,�}�T��;qb/�����]���?�w���+�
l�����Se���*B���6U1����6����~��������]�����;��Fr1�8-O�2�
�\�@��y�U�S����l�����U�S�4��w�}X�0�2����K�V��92��U�s����d���#�:�;*�O{�*����
d5�C��&��%.���5V����x���CQ�P���z��p1�1^5<{�R�~���h��Y��*�)$i����E%��*��c����T�Lw�|1k1>�(�MULf1�s6�}�_��;N��D�&���bG m���	��GVL0���D�}i�b2����d����=Q��@VBQ���s�Z^L���E��HV����
�h���=ZX��}V���2��J,��mdk��K���,�y`V;�*����E�Wn1k[����$_=�{�G�}&i�bg��m���|�I#w���}��}0&�0�K��K��n��M�3��1k
��E��������V��ht�J(*�q�u��W�����}$3KQ��G�b�s����\�us�&ip��H,��m��8B��7b�^��K4��de-��r��U����O�k�����z�`��j��E�����^3t�F��5�-�X���[qK��-�:�Z�NBl�
�����um�Y�x�^�f���)3��n��=�n��=��W��u:jcV���7�4q�"�gU��=�����z	��1+��{����@�1�i��W��u:qcF���+7����t{w������B�n�H*j��e�T^����3�A����g
bO��]gb��;���~BG�������-�K�k�~��O����	E��
vV���c��sg��>�������s��]��|��Q�'t��������>^}�\��4��dd-j�r�|t�v��������'n�Z�E�T�/�
g�M\��;v�{��@[��������}��Q�+�U�� ���m|�\�	����n��4��+��N��>��������]�&K�A1����4�Od&u��6�fm��DJ�{`U{�7+kQ������p�:��N�M����������������:�6�
v]����U��H�[�;���v��kF��:�����b����1_5�q����������
\�ps-�����5��df)*�p��s�w���fU{
��5+��|�������J�.5��D��^�Q�s����zL�}�f$uN�&���.@�[��
�X��//j���.y�`�N��Ix/�h�5�f?0���v���vBQ���s8m������5I#�HV�����<[��q8��}�\�I�'���GnL�`�����B\}�\�I�g2�U_�i`�����'�fJe�z�����v>������e�:�]��x��kV���W��Q����}����v��Z�du��3
��!�u5�]�����PT>p��3�7-�j������.�=U17�-�QT4�}��]O�-b�];T1W���m(��y�w��]����v�X�l�)5}��@�u{w]�7��v�T�l��`��x�K��������'�_`�I��Up|XEG�U`��s�b����.'�����&z���} ��y��9Z��[#��a=e��>������m������m����u�I��DVbQ����c�� �T��������}&#kQ����	��0X���f�-/`QlsNU�VM� 2���~��7���8v%a����z��9��jO���]L�V1����l.�P��e��}�7��PT>p���7n�Us.�h����RT}������W�*���K�*f�~w�m+�>M�v�oy���*&�n\��Qz<�;�����#OvmZ�a��`���� ������jV�d���5�f�g]�	�r��W
�Y	E�7����)U�H����!`�`;������ms��:��WA����?�F�������������W�u�U��Lf���+7��:���V�p�U�,��?��jS
��j_����]����b�h��������^�W��v��Z�d�+6�_�w�Gx_�CQ���r0�4^���U�#��?��V�*����@�����#��?��VC�US����6���j�����d�[�U\��`~�.�����+��]u5U�������������:}�����i��l������:�O>*��K;��3������}�2�VtF�=3��/\�����{D�
-��\f3COVz�|�5X�5X	�{����K���g e
�ctct�,�b\�b^���R���#Q,�Q,��b�:�e%�tr��JX��X����(2�R g8��<0�b��arar���"�D�Sf�R�_�"9���8��ip����*��NPL�;3@)�;@�wf�R�w�":���8�@`:9�����)9�����s�C�����s��HA���#3\�w��s�8����s^�C�y'�P����prY����w��9d�w�bv�f��p�	��ygX�C�y'�e�
8��e�
��CK����\Qr99�#V+�')�H���j��h��$`�]&��Kp��@��%aDI	u����1/	JJ�;����L@��%aFI	l��qpE`��wIR2J<;�59Gb�1��������_��
u��������93�`>�������93
l����u)�o>f��AL�C�����V��,��P�������\��0p
/~�]��J��m,�C�K8����$n��2XW���y�w�k+9:�ap2��f��_[���
��|(_���m%�6f��|%�����\��0pq(_N��]J��s��!��9�H�F��9z�V�Sr����NZ<�|F������/�(�x�x�x}Fqt����|K�j���������O�]M�3fW|
Z|
Z<���M�����]�@aq�c��1j���
�+�V-�����%o*����n����|��|)�hp^7��uXG;�*������E��	nI
0Z����zfq�u.�h\��M'�:�cm/���
���o��RX������Y��B"`�o):���o}��rA�8 ���.zT�HFO�:�E��+�)��Q�����Og�H2�v���@����������W@E�d����$��R"9�Be��"`2I2���@M� �zI2.�BQlr)*�$ct"�2���KP%WG!)����J)���bi�Q[)��r��y'TS�c1�;������L�	��(V���N(.�B�5��t�dt�����������P��!���c�bq�`&��*s�sHy'��Ba��W�D�:�;3��atJ�3�<�t�<��K2�N���_�Q�'��'��'��$�82O�<"����rIF����y�
6A��y
3�L��L����K2�����������d������yR�%��<i�<��3_.����z��4y.���)���,��4�?Cz�<���-�����0�ddy�?C
�H�b�vS�#�}�W�i]~���a������jC	|IhF�����p�T7"/���g9y��L<J���x���x���D����8���k4q�BGz�<z�F�F�%�w�<"�����D������D=c���J�
��Q�;
�H�%�.�gZ<����H�%�F����#I���D������$��R�����!"3Z�712�)yG�����y��hd���Hy������<=R��{p��g�O��c�,W�$(�z�/�nh�\��3��G�3{.���5����7�5RE�����S,Z�ha�.�H#��\��iS��]��%��H�E	O� �ND6��,��t�SA"���J0Y�,�D�F*]�8di"�����*�D�f������p�����\�M���@d!P�N� �i��P���T
��#��Hd)R��X����u;�*���f�`'�2���K8=����-��%��aK��Vr�.E��%�d��,T�)\�N=��Mf8���Pe�8_Zb��_@��mGL�Tl"/����2�ddbS����Hy���a�by�#RP|IF*6�w�<"!���d�by'�#RP�|IF����G����%��D��x !M��0R�����	)(,�$#��=�HHA1���Tl"��y��y�����D^����A^���LlJ���3$���{p�F�g�M��g�\���b�#�Y�ha���<�a��$#����R��=�@'�g�M��g�\���b�#�Y�hf���HyV��Z��!6=�]��-��M��]Z�# �����D��Dd�����Jp:da �0Y�t�P�����Hdi���cA�&"K���2�\��&65��u����Ld��l
D6"�Jp
YX�,,D*]��d1Y�D�"�.`�YZ�,Q����tMljv��F�q�N�
�b�g��L=@�1I��E��$hd+�|	��@lT2���K���f�0��@]���*6�m#1��G��$\M�+��M�%���@^���LlJ^�)�t�<�Tl"��yDB

�/�H�&���G$��8���Tl"��yDB
J�/�(�����<���$#��$��)xF*6�w�<"!���d�by��	)(F_���M�]=O\=O�\ ����2#��\���M�;z��4{.�������,��4�?Clz�<�����D^����A^���Ll"��gHHA������!6=R��{p�F�g�M��g�\���b�#�Y�ha���<�.'�x��Ul4�1WO� �ND6��,��t�SA"���J0Y�L�r�i��T���'"K��U]N|�r����$hd�Ld��l
D6"�Jp
YX�,,D*]��d1Y�D�"�.`�YZ�,Q������.z�&A#���K8=b���-��%���I��Vr�.E��$�dP�U��@l4��\��*6�����	��p5o�^ ��K<�����%����"6=R���y��D�����_���M�=�HHAq�%��D������&_�Q����x /sIF&6�78HHAS�<�Tl"��yDB

�/�H�&�F�#RP��$#���z��z��=�@*6���3$dF���$#��w��	)h�\���b�#�Y�hf���HyV��Z��!6=�e.���&�z��4z.���b�#���H�&�z���x.������������?ClzU��hMljn����D\<��l:�t"�p���NY�,D*]�0dq$�8Y�t�X������sU�R��&65��u����Ld��l
D�Q:��tUlj����Bdq��KA#��Hd)R��X����u�K)]@���z�8P��Y��@l4��\��s� 6	�J._������RJ���f�0��@]���&6�-������9D�Ll&['��[IFNl&['�<�h�����d������,�ddb3�r�;!�A�%��L��NHy�rIF����M����r+���M������2#��w�<�!f���LlJ��y 63B��%��L��NHy�rIF&6%��g�Ry��d��f�;zV����=�@#���M��g�\���Y��C��z.����b�!����J2rbS�zV����=�@'�g�)��=�@#���M��g�\���Y��C��z.����b�!�������M��.mbS������T�M'"�NDNT���� �����@���,�DG"K#�.`�4Y�~n�2�.����}����i&�sA6"���@�8��,,D"��.`X
��,F"K�J0��b�C%Su�K��Md��bS�D._���*6=4��\��s�Tlzhd+�|	�����P�T]��tb����8�����&�etUN�.u*Qn��2wQ��(V q���-�;%!)�
48�:l��Ew�H����/[$Z��:R�@�#����Rt��+��H��(��.���J�����j��-V����%bQ�]�#�
	_#%
�m<R�@�#���br�(�
�<6�Ub��<j1+��Ll�����y,��+�����[����R�@�cW�Pl]��B$J�9�5����y��|r�9�M|�XV��h�9�M|�X���h�9�M|�X���h�9�5����y�@��y��<��;!���k����������9���z'dEK8x&Uw�j�wBV���gRiV��	�R(�0�������,���.��D�>{2H?�i&��N������@��:Q�/D�h,0.T���5{$2De�)R��N��+�!:ga(�h:Q�'O�8%"+��Dd��M����.��D�>���u��T���G"C���PJ�t�f��8���Pu����o��P:����ycI��,:��.��M�>{2t7�i&�j��������@��}S�/D&�
0.T���75{$2�n�)R�j���+�Iw��(�h}S�'O��8%"+��Md��I����.��M�>�t7�u������G"[GO��(�h}S�S@w��K�}3n��^x��/�K�5n���/����r�K	#�����XJ�z�f�=:*�4Y�Wk�@d�QC���^��"��
*]@���=�tT��t�Wk�����
D7���^���'��h*d�^����	�r���^��"��
�T����5�Hd������t�Wkv����#�|	��N��3��PobJ5K��d���:�T��z&�h�����j�����;C#���R
�����}ghdz;S�Aw[�S���L�hJ5��lr������=M���MN��342��)�����)�w�F�76�t789E�������T��'�h�*����*�����;C#��K�}3
N��3��P:c��%Yb�Dd��R���75��������
�}S�"���.��M���t7��P�j��������H��}S��D&�M :��.��M��<�������7�]�&A#[NT���75�@d����JP��f�l=:��.��M�N=�
p$�/����t�;�_
�S��X�%&KD��1�
t}�gO�}3��Y�o"{ 2��D"�h}�"C�L$bQ���7�=�f"������+��o&�R���o"{�d�7�X���fr������3�P��7�} 2��D"�h}�G"C�L$b�t]�Dv��7�X�P���g9a�^�_�qc�)�U�S���Q�m��{���L�z��43c��2'-��9��%��������	8.�Pb��IKDf��N�)2C��n'-�2�tG`��IJ�.x����'�)1c���'��5O��rb��UOZb`F\���Pb��IK������w<	C���OZ���^�<r����@z�;�s�a9�q.]~���������n1��<���wB���K����;!W�
/�K{��&�;!W�
G���� ���+{�W��=�-������~�����t�;!W���s7�W��H�re��_����?*�N����~��z����;!+{����}(��wB���~�����C�:7���?-�����x���V����J���f��'>��������R YP+�;���<[q���m~D���?�a��8����;����Z�_(����_��R)���V���F�����~�T
$"Kj�O|4�R\��O��g4_!�Cq�����#`����/G�N���(�	h1�<��=��d��c��e,�HKh�<��Q�v�ak�kL�K�c�UN���1����D��1�Y�c��N����q�KF��D�1\���!������c��3��d���1\���)�z�ac��+1;F��D�1�xJ�8��q�q��
��q9��+���I�^b�3�5p�����1\c�3��d�/#������
s��|�o?_���������o?o�S��������9������9������w�_�=��
] _�>������������2r+�������!7|�U������3�����K�H/��%f�����3���@f����SO[����j��S6$c��_���/�
6VfH��E|y�2����>��Uy{\����40�����Z/-7�Z���/�;�����e��������c�������O��������R���lHR��S6�S.����$e|�4XJ}��$%N>%N��|d�S.�L���$e�m����IJ�m���$���y�m0�����`��
.�v)����S�.e�6�|��R2����AF�2�6�(_�>�=�N�"~�ApFi���R��$�#�r�B�u
�� �\Vhz,.�B@�A�P��R��RPh���mB�\
�9��$*�������u�����D��s�v�%��,i��9M:���C�4����.���SK:���n���[���WN;vN��-i��9M:���s����E�������������v��|�h'�i���v���|�hG��u���]=��C����.i�otxI[}����4��9m����>���]�����.'{\������K�H�&_��4-��g�i��=#MK��!-��i������_��4-z��H��o���v9������r��wi�.���%#M[}�d����K�H�F�.i���%#M[|�d�i�o�����Q����R�Ha$CK
'J
�u�->�h�)*������H��{YV����c`J.�Sp�c��(c
�qC66��w�#=��OS
�|��f���+(�4�L>�x������+(�4�L>Uy�v�/��4.?u��r��O�Q2��n��a2�cJ�qR!|%QJI���D)%�We�����s"�JI���,%:Xe��DHaI��R5�����8'b����Z#�$j!����0fJb���)z���)����J�����
eI\}�V��9%q���)�3����_.���o����3�7F|���'�o��H��0�"1�F�*�B�~��'N�0�"1�F� �����(�o!����o!����}a E�B�b��P��.ci�/A����S$���0�"q�-���+����Nl0�j��(9�8Y��s������FJ�L���]�4s�4sz���K��3v���p��~�0�>��>q�d��n��]�:p�:P�e���g��Gn��uZ�����&����Q�5M�����6�R*^%#���PR1�I*^%c���50����S�EPR1�I*^%����ePR1�I*^%5QS�aN�.��+��N�V�R(���J_%5R[����&j+}5��:��T}9����J_%u���DI]���1��P�S�%QRGj+}M����
�M*�3�.^��)�K�mI�F�s������gl�T�����%�.�puI��K*\]R��9U]=���K*\]R���
W�T�����%�.���J]=����TuuI�����%5P[��Kj��RW��Dm���S��s������V����P[��K�Jm���S��s������V���:S[��z����&7��P.E'����i6A�����$�����.�[�N�� o�:a�����������.�[�N�� o�:�����&j*u��j�.�[�Dm��.���J]]R#�����&j+u��j��.�[�Lm��.��������V��9�&�]�����J]]Rgj+uu����l���_�������xMSW�.�[*\=� o�p����������
W�.�k��ztA�R���yK��G�-�]���DMe3�[��ztA�R'j+���������sj���������ln{KUW�.�[�Lme��9u������R[�����]��������s�Lm����������6\�-noE'����xMSW_]��T��������W�-��� o�p��yMUW_]��T��������W�-��� o���J]=����.�[�Dm��.���J]]R#�����&j+u������ o�3�����.�V����R[���Tu��yK����%u��RW��1�����v������
[��xMSWO.�[*\=� o�p����������
WO.�k��zrA�R���yK��'�-��\���DM���S�����:Q[��Kj��RW��Hm��.���J]=���'�-u��RW����J]]RWj+u�����\�����J]]Rgj+uu��H��\]��u����w�6WO���k��z��9JJ�i?������#�.���H������5U]]R��R'�-\�puI����%5QS���T�����K�Dm��.���J]]R#�����&j+u�4�z����K�Lm5�g����P[����R[����DmWG�HmWG*v��=o��L��d�f���b��b5��O4Y���BM�UX�@j��*,H 5�dV$��h�
K��%	�N�TX�@j����R#�V%�����,�����
�:mKm��	�.�VX�@�Jm��	I����bm�#�'�:S[au���'��R[a}B��O��@��#G,�P�T���e�B���u��[W0$��
�d���t]�@�4s�4sz��������)rz���
�u�S����+���8}8}%����
�Gn?]��ez9(X���4���t�;�t����.X��-M�k�J[I���$]�m���p���>������_���?�����������_���|(F���S^L�9����������_�G���O��?����/���������~�����]��L?��B�<lo<K��_|�o?�������?���������??�����������^������w���?��mt�Z�_���e�����������q�_����g����c�K�t�RG�_�<�<��>���������	��V�;�/M����w)_�(/�i����[����yS:
��m���/s��m~4������Eg�qz����h?.�tv�a�G{��8�\iX�<����x��}i�Q�/���1i<q��RXNu~��HG~���4�F�*q�x�{�ay\��v��K+����.��mCn�?}zm��|�����8������-���Y���6�K�`~�G��_�G���ij�(��%����N�������+������No�v6sv��H�B:">�o�Z�i/�;���L�v�o�1� >7N��0����[�|kV�t������G�k�n�t���8�J�r��i��7�Wo#��������S����	|o}��b��<��V�/_����O���������������_.��s�5�0�|6�|�G�JTr��o� ;����zy��`��U��9�?�F�1r�����9n�/�Vu������4*D�}�h�J���o{�q��q~\�k�|[:�9����f2����U �U�C�/��}m������?��O?Z��p�����K�`>���������8��9����=���9��6�d���i���c_%�~i�
K��m|�l��*��k|1��6�^F��>qye9��}��Vq��5*
)��6.~{��8i�"e~�H�2���)U�y�zjg�H��?o�si��U)�����C���Y5)���y�%�=^)����q2�D����_o"��7�_O�H�i�yil�&Rv���G�<��H����5~������,�Z���H���?D�v�������D��w>o����L�������}���?�z��[���)���f2�r�������/R}�e�Ji�[)6������W����)U����DJ����>�����H���]RB]�8�DJ�U�5��n�0�r���<�H�������C��e"���u6�r�������x�kF5�?6���y[2���M�F&X�5y�W�����^�xk��0����Y��Y����Q�	����C�n�*����n3�?Z7r�`����o���N�����c���;�r��_7����O&X�)��4�T,>��d��;,6�7��:D�`Y��%X��m�NN�TY��z^���,�_�u���nsM'X��4�;�R�����L����-,�/�o&X}�L�	�C�l
k6���:_7�o���D�N���W��Y�*����{�g���6+��u�2�r���y�����H�Y��A�D��w��y�V&Rv\4?�o'��U�������mOG�����>o#M�H9�����Y������^�r�������)R��w��*s�����*���$�fo��b"�������54N�T��H���9�H9��rr:EJ��4P[�2�r@��~n"���=~�H������[}�DJ�C����Gk"e�q~���n�����G�.�9�R��t��f�"e���,]5�&Rv"��?�=�����E�m_�2�r��Oy��s����~��{u��S����8�d�����F���I�C�}���Z%
��{����*P�S��T�@�37��9��|�������Z�S�����5�r��=��'x������d��bo�7�bo�%���yn�5�n��S��&;�Sg�7��`o�v���[���%zl����--����s;���������������7����>�s�|�t��i���l��,T4��R|�+_�Y�^���O�D�8x/�/��[x�^�����Fzxg���m_H��������o�sh^�R��o�I�G�����%?����w������=?r����6?r��*�l~����$������\�9�o=j��#�~}[����=�g	9�[c������6����W�oe:?���w:4�{�9���|2��G��_eb�wk��S�`5��=���H�s#G��e{o?O��#��X���tv��O�
����`v����G��9���m��[�������6&P���J����N��y;V�����(�_-��[��9�r��:����S�L�	����.P����b����o��7�T�a�ZO��j9t�~������T�1�R��>��US-;�o�����U�=��r �q�T��*���~������s���={�M����]�]��r9����9z�o����u�#���y�6�r�S������������O���qY�r�2���������Vet���4
�_��.m�>`�t*�Z�'�m��p�1��	S1���Q��Y��5��`9���'��������
�z���:�n��Jo���^���e��M��{O ��cD�4�
����U+,;��2�^�rE_�}���	�}��S��W��`9�����`s=L�<���
���g�r��?����eQ�9�8�r�_?6��:�:�r`���Z��zQ����~7��K���pO��U5|#�M�T��e�}��'X�[�;�R����z������IY,����0�`����\���o�4Oy�`�q�o�n���ov��Y��6&X��,�[L���������G�����=����;��!������~�Z�������/�'��}����C��r�OoY@���Xx���t_�r��?�����g�D9��7I��N�Rg��[WzW����E���n]����k��7�r�����R%J�����+G���w��D9�o�K*Q�|3��k�6��N���~uJ�=�y�������������+p&Q��6��{���'L���C�4��J�#�|��p�����v=o�5�&Q��t���}�F%���>�u/�2H�����6��^�rd#?���+�~���j��r��z�s��i���~�R��=�Sg}�5��y=�s��:��������V����n|��[5��9fo\��>�O�q[��g��U�?���C?�S�F�Uo��v������8~�Y�:��l�h�\��>��_:�8p�g/H�s������>�S���m�=r���0�o�������Z�}��>�~�5�y�g}���eSV�_��I�#���q����9��m��U��Y�#?}�8�nZ$��0��E���S&UV���*������6���5���29l�fM�S&7�;Wun�7��29���q�|8�F��{�&:}��q�d�mf��Ov�_e������O�����c�'���[��';��x��h�'�Gm����e���~��_��@��i�#{X'y��z�\��9�����u�c~��(��T9���u7��Us��'�lJ���_�)��)��{�l���r��	��>o��G�''c��Z4|�W��9�o�H����'�?nt�1��FM����i'�1���K����o�w������Ng��53��C�W6�?zg�e�;�z7���C�~�X���K�--��s~���o���s9�l�����#}��u�1,[�9��/����K&R|�y��s&R��&U�WCm����~tJt2e��uzk��dJ��y�_�T�{�Y_:78�R�����_��dJ��I~���[����T����R�)����!*S��U�z�8����(�);���S��B&Svb�w�>��a�����O��L9���~
�d�?�w�n�Y�����j�3�d��w>�m�������L��m�����9�/�3�r`��&!�e�������u4/+�\���|��[�X��/�h=��2���!R���}��y��S�}x&R������}����ZM����t���_���u�PE����s�*R�}k�T�r����?]�"e�u��Fa){�_>���7��k^��y�H9���sUE�)���&R��;��Q�r����>���<*R�|��,��4�0�r������]7*Rn�j���H9�����E�K9���M�5�S�rd#_:�|�Te�T�g�!2���{����*e�*���*���{���L��?�db�*G�O�w��L��CG�^e2�����Q�)�O��&S|���(������@�\����#G����k����{O�^,{����T��yZ�;���	��,�L�x�7�h��Mv��l�aN�?���]6�PQ�r���|G�������6��6�r��_�
A��oG��m����QN�z�S�M�&X���������{�����5{��;�s��z[��9bo=k�~��{����N����?Z��~~{��2��������M�������o�������y���p���>����Dz����U�PZ�����>{�U2�]�#��!�����W���1������e;���?&Dv�'�si����}��r�$_�-����oa�a�����>�����^	w���� U���>u����k�*��39���&{��z������������6�39�o��6gr�q:��sD���A��Lv����������WS���6g����k���39�n>���Lv�qD�6�#����������>���I�F>��CZ���<`���g��|C��Z.3�u�����`�����K�x�3(���u;��:��������/%�^?�/^���g|���2�����x��j���:����!kw���������&^n�~�����u�����o��7�������������0a��/�O�W/;�����Q�~/�����P�oPQ�r��yX���v�?��:�Z�`9����+N�W�u_�����Yi�hl�+6^;:��(^�Y���a�<6��^���)a�����R��=�S%����S���O��w}��Z�'�g��jbZ���u��J�����)����uQB����&{�)��Y�L��)fZ����*{`��X<]��/���t.�?�%U��NM����c���G�H>�o��5�r��O�����K��������z/��i��T^�
D�R90�
����5����G��W�o�=t�����)TvL@]�ox1�2�M���{��TI_����O�T��RZ#��&������9mR����h�����u���{�l3m������C�l����I�i0A��M�����s�J���f�cz�N��H����K��I�#����	��w��P�,�G�}��&��U���}}�fN:�z��I���K�9v����G��5������:������F��7kR��U#5��z���R����{�[?���H���i��tjd��{��"�Fj�6�����������:�F���7��75r���z?����_�'��M��}&���~5R��u��o�T���s�_:�0������7N��������c�x`��\���g45���_��(���L��;��m2�y������n+,�W�:k�.77r���I4����|E0�S�D���ti�:k���3����K������@��s�����v�/��)�#�|��������N�<��o$���<�h$O�9������!y�(^#��7�1_~�H?����#�7��t: �{#����%��F���C���_����_^{hJ����6r���i�
1������&O{cx���+Z�!���-�W����1�X�!��l��;�_z~���*��,��������w��[���1|����w�_z�Ec��36�k���=��1�78���c�1������]o���?�Dw����I��^#�� �C����Bx��U.���u�o=j�B�>y���1����x��F6�uoy��T��]�_��<��=�b|�[�E��K�+�>/�#��X��R��K;�E���l����5��}�����]o��k���!��^'�4�����ML�9c�#�����������<�4���z�wIY������nb1|�_����:y�����K�1����Qp�5�����k�o���;�����e-����oK�G����A��^'��l����"!|��<F��;�dx��kj�Bx�������#���H-��;c��`!|�[|���w��k��g�5�]����:y����5�7������]���k��r�;����z���:y�������bs��o�w	|�uoyP�w�fr�����a�7���}Wl%�OiglO��62j����
&�w��g��x���OSi��?�I���C��L4�W������C������+�=d%s�Kv�A
�;�����|�[�e��1+�����5���?f7J��8t�k�n�:g�����5���5z���]�g�F�]>�����/�E�!z�n���8��(����Fo1��/�"�w��<d1���'���e��g���������bx�;2��k���D�q�����.��������^#���8e1�N������:���L�T���q�L�y�S�5��E����5/f��@��x-��{K�~�oi~������~@��i�s�
������Qj���OX%�;��x���\����_2^%��]`�N��C�x���!�Q���[��q��=�n!��/��	�!����G����/�/>���/m�Bx�9��k��_���q��<`*�F��%]��w)��5������\�����]�&���]{],��9c�& ������g��1����)������������e*g��cx��/�W9���?4�R%T�!��6����:��C����1|��k����Cgln����<�h�����5��������A������gl��������:�����5��;R~�#��W��u��� ^#��-�����6[Y��??D��<��9�u���6� ~���ug|���g��e� ^��.O� ^'��iA|��w����L`Y�����x��3�����~��x���k���u��Y���w��Z��w�CX��w-�Z�����:Z?r���nA��/]��X��w
?����J�N�u���������?����f}A����/���>boa�>���y��Y��?���5��YH�u�)�:�c����J|��v�X��w� ^'���tA��Hs���_�����pF�i��fg���nM��>��}��������5������^'��x���`���:yW��^'�)��~����;����);���3�Y����e����l����yh�^'���"1<
��C�1����7���>���*�[��X�!�:2�A�J�����x��QW���?f]s��k�� ~����J�����\�fA|��k�C���pF
�;��G���A|������gl��
A|�=p��k���S~�#fS��y?@�I�*��|N�;����^`-����9e��1�~�!�����Q�|~�+�^����yX�
�h!|���]�B����"��>���-A��o���*g��q!��=Q����s�� ��_��:�f!�N�u(�Bx��v�M�u��;�,���uo�n�|�t��+v�X�{K�%������r���u��;g���s��!����1������U�]���5��M����6X��w~��Bx���!�|v�sq��3����<����x���������367��}g�?�y�/�J�'�^�N����}Q�N��5�N��@����D��:{g<�H���u�EC��c�����5w�����yu�F�c�l��p���]��i<�k���,
�{�]�TiD�c����!��'�c�{��rA}�;���k��[U������|��g��1������'���]��.������]g�]L��w���,��{d�Dy���A�����m��}��e��k����u���Z��^���Mq.��;G��#z��s�J��qW��t�	�!����w���5��}��.�������
�5���>����S�{^���������{��g�����']P�
>����O�9Y����dz��7��H{�b��^�������z�����^g�:w��z��{����u����.���_���|P_��#�^j��'�?�Yg�{�sA���uC��5���:{�e1.���{��O?�KL_�!�a��k�����x�w8���}h>��:8��q�����;npw�X�[������ws,I���g�����K�������t�;�}GQ��T��_v��fU�_v�������c6h�������t�m~��e�;���_j�m�����X�_j��>��e���.����K�j�sw����e�ww\�w�����K�;��_j�m+a���X��e�_j����s:�~�{����;�W�[m�qDv�����p������������������K��!�/U�G��:�C�_~w��N�V{�_fm�i��u|�z����_��f���R�m�����*��t.�����cu�.]���Qc��/�V���tc>���;�=���,�_������R����rv�����]gJ�_v�2�pw��������r��{~�wpk���/5���1��g�;�|������e�����������}�����V���X��!zl{�:�K��y~L|���9������/u��UQ�/���{>���y��������Rc��+��>���x�����}_�B�1w�3i�swi=�/������n�/��;���_Re-�����=?Ve�|����3F�K���/�����~�����������������L��4TVB�_�������=Ve�\�C�����_~w���oko�/���O���o�����G�5
endstream
endobj
8
0
obj
25397
endobj
9
0
obj
[
]
endobj
12
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
13
0
R
/Resources
14
0
R
/Annots
16
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
13
0
obj
<<
/Filter
/FlateDecode
/Length
15
0
R
>>
stream
x���K�$�u�gi������������6`����0�0,H�R������sx�MfUOf�4����� ���`d���|>=m�������y~�������?\~|�k�%��U��p��^B�N%?-�~^j.�����5	������u~ZO!�B�J���
�?<����S�����O���{_>���/���]�GD����K}���\s�>^�^KL���N�����R�����6���1!{���
G�-Y	���t���y#j���p�����x�%oI{-�V�����^>�������M�z��9,�i��~���i��e���]�O�Rr<��m�.IKX%)4�dW��ms$�q��^��~���[�b�v�ZC����R\��6M��
���z��l�Zm�.Y[��Y��{�����XN��n�l�v ���F��*��qz�S�u
{�S������JO�JO�JO�J���R[7N�&��U)����Kt���DEI��&�����i����� MruO�6���)\~�{o����G�v���DeY��|��A��rZ����maY�JTv����m��4���m�i	KTx����Wm��&F�q2��v��i��M[�'_{rRz2��C���x�������6I�.�d�i������^K���i�G�E�$W������\�"��9	?�J^8���%��gm\��O�o��=zI�����A�W��m��~�+z2��� )i�%IuU��4��v��A���W,������� Is5o�2��m�p�f��>��I��Y���wOUH�����(���h�$� /Q�h���^k����������/6��L��3�T���dI��I[��}�&3)��,9�JR8���%��gm��x�GhI	
G� I��z��������t*G�(i��{�����5l�pL�����T���$���i�����dc�j���"�����e������9i����1�0)A���5� �kI��vZ<�d�v6�pp_$KrUO�4[Vw�:z}J���U4kC\���,��f%��d)��EY�i��yR�u;*8�G�R]���M�:&G��WsC�l������ IsUo�4?����w���#���I����_�yc
��_��F�{-y�~
_l�������c�"Y��}��AmM�?Zz���p8_%Ov���:0;2�B�-<HJX8�IR\��6NOj[s��M#W[9����<J��*_�q��fkg��.���\����Wv(t�������eQ\���XV��,�/���iA�B#
�����W�:.n��Y|��D���I��I[�+o2����'r�_%Kv���8�7;�7f�o������9��R\���M��i>V!������`%r������Wm,�O�SOOKX�
5�4W����us`�pRv
��C�UY���������R�/�,����F�{�i�vY�i�
��E�$W������q��3��������Y��^���|}�AN��>H���^�mp����g��~�3��Q�TW��
s}m�&w7H;P9):H�\����ki���,,�N����a�k�c�e�]]w�On��(,��v�,=����wd����D��"i��|������d�ARr��}�$��=k��6:������$��{�,�U�h��VL�Q�hB����Q�TW��M�[<����O�� ��dn����7m?.�E�b.��`lm�	|[�zYMR�������U�aZh�A���6s)b,J���6�0))���H�����mzRv��}H�L�5LJJ8���$��gm���X�0D_�3�0+)��$KqU/�4?m��Y�	G�(Y��{��A]��z���6� ��{��i��M�Wv�Z�����|}���n 8�A'���+n>��?<"���y���^k�7�,���a'pp_$Ir�O�8���'�����k�AN����#��gm��^�(�n��Q�W��m�F"����$�i���5T�����ie�+%}�h���>���I���`{�5�E�J}�����v�T��ol�H�r�'���dI��IC�q��c\x��������*Y��{��������b�f.��m�$�U�h�\?�7+��O������|����U��z�tt,���l�;I�3Ks�o�8[��~��/�����~�?�.:�?���g}�����{��C���&��x�,�a�L��Y<��o�Q��<�Y6�ka��$v�3�5�gy8�2������Q�/>��(��)�6�����H�,�C0k�x:�x_�<��g�E~����xYX�����|�Y�`���me�~�jf
�2a��:�����x����\2`L��^�������*�L�;�����������^W���gQ/P���u�����O��U6�u��;��U���u��b���������^��b��A����^��" �m����*��u�(�R�:�Y}:$u�("����t"��k�u��bW����2����!60Re�S��"M����T��u���g�e}���N��%g��J�6L3I����;j�%�����r3n1�/�v��1���[e�X/��`�����-(��3'	����I}��������$f|p�-r(�$d�"f|��-w(�$9d�*f\���^��#��5�}�3'I$�����-�(�dp�K���H���l��Y6E�Pb�h�3I6�����
t��d�g��9���(��$�/b�q��;/b���&��5n�x�L�M�71��A��\�3I6��9���"��$�_������Xg�l2|��[�\��u"���Ab]!��$��d�.�ep�M���H���l�1q������$�dx��dNR������)r��,��$�/b�q��*��.z5�"��Z��I���&f7�����d��M�������d����u� �"��$�_���t�:��9�W5�"�Yb�$�d�� �ty/�#l�fE���d3.f^zV�B!�L�M�'1��A�$�����:�"�Y�R�I���"f�.c/��`����7�����d��M�:n��Ii'�&�;�"�Y9Ii'�&�1��A�EJ;I6��Y��K��K;9�dx�f�
21s�l2\n�I����6E�"]���f3����J���L6��l6���������r��Vfr���"f`��$��v����j��f�I,��l"���l2�Ei'���K�*�`le&�_�l6������M��b6�Lb{i'
���l6������&����d��NG��r���^��T������2B�f�l2<�Y�
2')�d�g��9�j��N�M�1��u	{i'����A�U��$�ob�q��MJ;I6��9��IJ;I6��Y�
�-R�I���U�:n]�^���&��5n�!���d��Bp�L�����)r���^���������2B��$�Ob�q��IJ;��u6E�����d��E�:n]�^���&��5n�k3'�&���u� c��N�M�w6E�r��N�M�/b�q�l��v�l2|��[����vr���`��db�$�d�� �ty/�&���Y��{I6K4������g�l2<�Y�
2')�d�g��9�j��N�M�1��u	{i'����A�U��$�ob�q��MJ;I6��9��IJ;I6��Y�
�-R�I���U�:n]�^���&��5n�!���d��Bp�L�����)r���^��f^zV�B�Mm&�&3/=�f�9�f2�3�V��Br�@�I����K�*i6����{
����r�q#A�I����K�*�w��4�����r�qKA�I����K�*����4������fcsA�H����K�*�w��4;����r�%��^G��w �T�"]�K����#��2B�3I6�������v2�3�l�f5Ki'�&���u�������M�Wk� �*fN�M�71��A�&��$��l�f�$��$�_����)�$�d�*f�.c/��`����7���I��p!�A&��^G�9��ty/�M��|��-��`�I��f�I,J;���M+�`li&�/b6�Lb{i'
����l6������&�����&�X�vr��d_��a[���&�1�M&�(��`�����&��^�I�&��5#�M&�0sr��p!�l6������@6�fE���c����� �/CG�:�c�K$u��=�����xu���'�8��^�-|,Q�������������U��z�������M��:6u�z��C	l�m�I���XbQGl��n�:x=���������5��u��?���-~�!���c�K(3�����+<3��'�8e�J�k��q���K��Os}�����
rv/����N�+���_��o���������D�������wB�#g���^�j������{��e�t�^d��Kre�z������K��L�+�X/rj�29�2e��j�H�}��,j����x
6��'gQ�����~Pi_F9�2e���������(S�J�~��v���_�D�����5�^k��(�������|��;���~Y�����������6�tL�'g�goM������O�NO��\���%���h��u�u��X�O�NO�dsM�frvz��b�k8�� �Q���:��q��P��k^�t��VG��}����/e�_��������y�U�r�kkQ�Z��Jgu�7t�'�/��;?'�v(���;�E_?�����f�8b{���o��|)��G��}�5Z��
��b�������]h�-*ro<�[^����L�^Bn��cPf�g�}!��M�K6�I���y������V�8e���A��������v���tz����?\f��������o������U6=����+�=���J�����7��8S�����g�]����N����R���|�C��{1s|��xNz<'=~A�����m<��m�_����6�/���]�����������]���������whs<i��Y�?�R����A���y|�Q��]��6?������jo�w��m����@�x���M2�{��q��I����x�?<���H{���{��q������p��G����x�'��s�8��m?�O�~�?�_��_.���/��������>�����K��������}^����r������~���o�?���O���'z�%.�������/��������7?�������{���������?]����;�����mR��������������y�w{���61�\.�����Aff�G[�������03����afn{z����w��v:}���{���3�mb���G13s��Aff��u��{z�_�����83�������mrw{�:s���~�^�~h�����A�L�������������q������37��{���+�w��`��6yxn�����H{����6�\�8?�����ff���5���������w�������f�[��=���6��w��8��y��p?c�(ff�?ln6�|�~�<���{d��y�}
3���y�:s������}7��m�w���9Mz��03s~p�l��`���{����
���87�o2s��n�������f�x���4���Q���\7�z>8���{?�?����~x
����k2��d������������3�T>������f�?M]��~f���:��o0������`�������o�.��f���a�0���bf�������]��L��+�����G��df���{<�y��������5�yf�����03s~x�l���������%3��^k����u�����q�yo0s���cn�?�:3u~�����<��:?x�Hfn{�}
#3����������:���vnv���__��?M|�
endstream
endobj
15
0
obj
5837
endobj
16
0
obj
[
]
endobj
10
0
obj
<<
/CA
0.14901961
/ca
0.14901961
>>
endobj
7
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
/Alpha0
10
0
R
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
14
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
/Alpha0
10
0
R
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
11
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+ArialMT
/Encoding
/Identity-H
/DescendantFonts
[
17
0
R
]
/ToUnicode
18
0
R
>>
endobj
18
0
obj
<<
/Filter
/FlateDecode
/Length
21
0
R
>>
stream
x��R�n�0��>�����J�JU�C*����H�X�����&�Cj�x�wvfWL�-J�M,~���`bmg��q�{
��]g#!���������Eqh��q�����9��Bq���n����m�x��;v������}Avb<*
f�
BO�{�{`1��J��4�B�/�}v�$bA�����j
��;�r�����*"������i	��S*������%/m�5!K�����H��{)	�������W*�)q�$����,h.����+��h(��RZ-�)D������� �l��N�V_���&��@��x�JS:�}3I�5�ZYvf���%��L���!Na������1�npK���N��
endstream
endobj
20
0
obj
<<
/Filter
/FlateDecode
/Length
22
0
R
>>
stream
x���	xTE�?|���{��t'�Y;�l���`�fG�}� @vYUTTDT�Qe  3��8.3�3:.���22
��N�{C�3�����{�'���Zn���U��:�T5;,	�q�������W��8�:t��I����r�O"��t���p�����	c�
����X��d�����az!��'O�;ot�s�1����kf�+=��z�e6LO�>v�L������#3gO�Y�n���^������t�Y��H~��������t�B�%��5�6�)�����$�zvA
���;��|�5,Fb�]0�
�����5�
��8���X�J�vC*%��p��6����]` ��{���`|,��
�f����{�$����K�C�����{0������k<�����v�c+��0VIU2KNJ����
���A������3b��n�������X*�`2������s�Q�~����m���>
[a~k���>s)'���'!��r|�x���u�%:��)8J��������p�E�K|��R�*q���;�60{�,����������g�+xp\����W�o,��b�p�������`���w<L��~������IO���g�����g��'�%��7��9�v�.�;��G���'�������������p<�f~��
bW��l>[��g����0��w�C�4��4Y�%�V���!��e�r��ybDb���'�&� ������j|�]p�����	S��y�a�l����{�:��md5��a�	��}��cg9�W�<���7�g������~���4)O�I��
�R���Z"���v�or�|HN�8�UV*O*�������K���7�=UWR�QK+[5��A�0G!*��c�;�{%R��6s����v�Gf4��f�y8���*����o��?�o��n�)�����]����O���
�����3�&9%��J�^R�4A�+�(����7��O���9�&e��#���rL�%����W����)��7���C��.Vk�j������ �J�O���c���{�;����n�zH��^^*��[�-���0^���R�����jx�2O��_���I�������H�X_6��6�����	�
��pB����>y��b��oTle����W��rLz��>f���*;X;���"�V�L����i�������H���&�CY[�����G*� ���i�/p�x)<�����^(e��3x���r�Z��k|�������F|�r��$%�X��J�����!�I�c���H����`69�X�������OlHl8�GQ������.@�2
e����(�H�0'��s��0�����	)h
���(���u(��I������H���g���$�6��@y�$9����}�����fB6r�G�
�'?��L����{|_y���h�|���`�2�EX&��@������b��������[~�-���Ai�?���)����%�M�0LN^`<�)0V��W�?������\iBb
��}8
q��P���6lh�x��.���Sy���J��i��e����f�E�����HNvVfFz8����}���v9v��*��4��9&R]8�Z.���������1�A���f���Lud�(��dKNlT2n����dz�*Z4���F�v�Fj��A#0~O�he������"��xn.V��M��fc"=�{^?yY�1��q[��n�n-����N�U�Egnai�1�i=:m�`sc�����{T�����RA������{Fnne����������Z���"�M4S�v��D3�)�6pwdK�}����p���k|t��Q#������/��v�N��X�|��6bI�����)J.[�$R�f���ws�ZY��������e=���8�}�D�5~g��jv'6�7��2�oB������G�F'/�:�&}Y5�1wkzz|W�(���,:"�[�9#Z9�{��,|��p<��N��[t�1�[<^3�r7�L��'b�8���YF=�^�Q�����;u�����l\G,��J������L��w�L�D�T�Z)���e�R@��?.�k���w@Q��zR��V�:�.)!����b/�v-�__����z>�c;��S+��\���k�p5&�a�#pu�V���UV�1tg�u'8��,���WEJ�RW�����?����cr�j���'����4rD���1���zA�������N�6B��f�gH�.���������OD=�V�!U��Y���m\+���a���I�%����nVw�]������s-����T�:r�2�����/7�x:"7���!g�_mr_GBeFu��@�3���3�x%~�:[4���n����H�ec���M.�:���v�����f�cNmr���=�W�XMf��)8t�eKm���CF������t�����nc�Vn��{#vE�"�S.eR"B	���%�r�(��+�P��E�H��e �lV�q�����<�y��y�!�m����#X��R#gB�V5v
 ���+��E�\D�w.��Y�����
���'���jH�(B�A�;V�W�������Z�*�
��P8�������{��`������;��pXb�PA1��`~K���g��
b��6|�������b����G�U�+�rd(���:q��~/��D�	�B����1�|��r����[��mZ����J�e�r�4�&��(�X�|���7��`�90�b�XIVqk����E-s�u/��a�
^t�?^ f~���U���/m����+dyj0�Z��}��B��2�3�nX�r���������U���!�=p�o���������2{T��G>��+{w�v��2��{��]u��3sT��f���+�x�������t���S�t�����L>S��X?�Yx�2�����b��cU������6�a�D��`���n��j�d����'AA<�+��+F��yy
�_#�}DLIMF����J��<�4����>e����q���ru�u9�]������>J+!Jk:]3(�I1��O�S����2U�z]t�t�n�D�qGz�����4;V��z�0���n7�|��Hu����<h��u/�����-���t
����>�q�E��x��T��:���r����y��5j$�g�HD�v���T��M����%|�s��5�b��!�#��`�p���)�����3�i���R�	N������^����D}D[��z�������7����[I�|[C�������9�xn�����]�s�bTBQ&������y�y���F��9�b1�� y"J��J�(+�_���_�*v���%�� ��Yh�w�CTVZ�C�I��������'�
St��^���a��^s���]��������q���o������dv����%��'��o~d'{"��7'�������D�#��E�s��������o��0�B\���W6M^�������efU��]a.�^w��p�(#!T�<Z��HH��/�>A^���<��p��a"�B���d�2����g),������Y�F�e���`��_]t��V�|�4�Pe����OU�v��w(�gk��=��OZ��o�l~�oz�>{Y�6�����C�@C���!a����W("rS!il|ep����C))�07���'"_��������M�F���nv��d��m�k��qw��Ert������9H����H$��������]~?
��^��9w�S�����������D���������#�=jM4o�r����W}Q{��Z�v���5�3�5�s������{����g�Lw�u�L�z���g����'q��
C;�pz�C���������t[f���-=Srg��|��>��e���@]����a1��bT1��f�9io#����m�u��|�;��|_�J�n����}[�U���O�VQW�+a�1��h/K<-c\i=D�n�`G�bU�MN,�v@
jO�P4O����k���������
�Z���Go��q�+��?�}���/���ys��q�n�|��_�r��/7���g���m��'?�S��b�f�B���@�lk ���P�&/�	�x#��ku�j��v8��3��f�J3w��
��#:q[D+�����H�lE_��w����O���8�����cm	4���;�����-��]��>C�z�>50>�:�����e��2�v;��$���r{d�a����t�� ���LP���!�'���K���[S����_/��sFGfDx$D,Y�YE5��V_T�S(�v!�B������I�
W����[�o���#��^F�h^�0�#&$K��U����c�'tA,��dA�@��fU�xI��j*'(�;�G-� ��
����yh����RzE���S�x������/3��i���"���/%��G�T�>m`5�w���-�lu����o���{�%�����t�X������������4>nU�����
q.������
�.��M<q���e��4�6qU�UW�����k�k�k��5�r�t�����G��*i��)i�����Kr@�d�
���5�E�"�P�^w�,cx�!���;���)sX���*D�k�^8jY��[��E������^N��t����8U�:9������Z�\��?h
�u��u���.�
�T��
_y93�a������F�n��m����}'�,-��Z�KrVV=���������p`�+^X�����E����zo���fh�J}���O�1��n��j�������<�X�����i$'H?�U��lv.��"��/�N!4�����y!O�X����*�,�F0�Q�b)BY���3$��
���P
������e�Gv0���%�����L�����^(�""$-.$q��������!!��/A���M�1kY���gS~�z����=%�)I���������H(i�G��������<^7��x
u$�Y���7df�vze�6�"\^�>��o�>C_�����E�JH����!K��VD�{X;�����:n�l�)��s�p�@�TU�X}�8T�(��-���)Hr � ���YUH:�����A!����P�@0���4���G���f��+�o���W�s����1��{N���-�������:��>��~T��_��Q�(�����eA	�v��}R�����D���Y�"����d�k�'���$J�A���0I\���/�a�����q7s�Q��P��nG6��j	�D�N�������&��4��8��A����N�����ZL��Z<�=�=2�?42M���M�����]�y�mq���wR}Z�&���*�.g�74�Q�Fr���z9�����m�kiI���h����X�UPOYstAY:�������+��d�O�\�-s�����<��9mt���ir�0���(��R��T�iZ-����4�u�!��05���0�t&s�a
��L+,����J�'M'�>��k,�����B�/�6����y�=�j�n8��o�cO�����:������n�i�<�3�u���}���1��i��[Y_6�m|i���>��TY���^��k[��,�af����d��6���	����v�{�$q�B��x��6��@b����`�
{Lf�����U������i�t��}����G��
��E����J��'N�m��%�����3��?��'���u3����8�� ���k� 
���� �2����B&���y"�J\[Xz^�z�]x�����'e���E���x-�s)�J�(W�:N������t�nfIvQ]h������y1_������������l���V��l�;�4
�88�C�
b^���^������
It�Vr�*%����tQ��xbP�t���B�Y('���};EGT�#���c���~�}����xeR�T3(V��x���g5�m�Vf,�X��.�5K�<���O]_�V���-p�/���r�v]��)�nX����m�9���I{�w ����ckf$��H�-"�$S��".�����%k0�����H6�!D����3�n>	bf��U(V/kbs���/����Zv<�6$LH��0�9��j��x0_��7����9A���A\��Ukm�ph��_���gy�D��LY#�����`V�O*E���������g���@��P��3��t��/�M_rO��{�%N���i���k����._1��
�o_�������k��x����5��tO�w�Kl��Ew��d��d��Yx��
�?�8;��5���������X�5jp���zVN��N�,�o�k�X&�J�[�������9�K����N���TN
|����2VZ9/&����s��5�9��������` c��t�>u#�(�lp������b�h+����Q����W�
:q^�����'��gZ���FM�j�P��S����U�WF'J��NO��)�����wg�J���'���O#�#)���N��*uj6^�E�-F��B�5R�=�3�T�Lj��=�X�k�9��X��cyyrv�rp���4�t�i�v�j��PM\��4�x��
����W�G����V�.�#\�M29�^I�Vj����*��k���veE�Jc�>~�p2�$�}����c��2�=k�����v��7���u����xz�������
�o�����

��l����O|��,��7{���v�|9.���uJ%��xx��[N�s5�Y����0"2I�����*V�Y���6��2�"��0Jw\��*����SX��x���=��"��8�xua~[cF�����b1�"�����$�d�}��BM9���O�N�R�����dg��/������e���=��aWd����TMsa���4@o�^���
�-.I��<)"�X���(�lURk���l�/��q��9QL9�~W&h����!�cYZ!3����s�k���KZ�b.J�^���h5�=�}��!!�����O���' ��"�D�c�	�[��������!�Bh���{��_�!a�j������F�����M��<����D��l6�+?���R�r��\��H�$^�G>����[����=�2K��gz�=��|$[���{�&�wb�3�z�m�J��>�O����*��?���X����L���e��U�s���]R��U�a8���,�������J�b_If$��{x�����deZ�������G=�27��|���'���t������BF�(.����T���l��z�@a�YzN{
���B1/��5�@.��-�=n��BQ3�G,�z(FL��'\Kx��p	���T9������/m+�i���<��5/_�����??������%{K_~p��GM�t�S�p����/�k�t�
�r��kX����_,�Ck�j�����^�b��?����2����|z�li����vIN��������GH��;�>RH�#	FJ����x�B_��<�#bW bn	�k����gj��&�]#6L}cU���
�;��*�b��1�U���6���o/eh6���d���C�!�:.��!���@jJ��fHi����K����R�\{�%���U��r��������=<Z���t��]�������[+������Lla��?��G�����9���;�u���C��M$6�m��}�_<���K�����uC�\�P�J���i �4�{�lQ[+�_�
��D7w��e;7D���
���cy��n��Q�]sP]�\e��9����
�<u����2T�����sM������b��H������O�7c��#u��29���SS�e��mw�qvp�����3c��"2�]�LD�Tr��;a�[R�0&l����)����U����b��b��b��sj_��]���"���M;����q�Z�c�3��%�!4��7���������+^�
��I�X���������_�S��u���gz�mu����p������	�i�W��J�* ;]�Dy -D�6�5|x�w7^�I���8�mb��|1��f��H:�?4���������/JW�`����1U���5:��Y���3��"+1:U`�[6
8����x=��f�������%�����������]�n����/&���������i�h�c�[h$���x;M\UM���H� �{��aA�
��&�x�l
E��Y4����e	���k��mc��FoE�c��Io������C�����Xcx���P@��h��hC�7���O���1\b��>�>&>�)�,yY"Cqo�|�_HE}�����eP��xs��^v��4s�����;dt*����]U2�=�dL�e���V�>���,��"q*�b��7��X�?|��O��m�S����P������v��(����5/)+���_.�n>�V�h���������1_�2��V�eims���f4��2[y:{��<�Iz�'=/x�As�^!���	�3hh=��\�_����	�)$��{�})��o��C����P 3S���C�"G�L��l�>,�,-���vJ1��)+���M�C�MC��4���u 7F�qb@��D���&<��_��i�*R����B��S�����&��"��)�Y6Z�����rVN����iBeN+���E+�����*�zH�9jg��A�c�Fp�G�U�nQCB�
7�*8Z�����fT�t</�h���T�>CSEg8a-1�����I�9C�%v��U��;��[�i �E���Yd����k,}��1<��q�FKK
���-�T��
SI�wM}aO�9��M{+��t��Y��k��t�@����'3���3F��>e����;��|�����x����k[\Z9+4�����}Z�;y��K;��3��~�z��j��7���3���1��J{��\v�����J�@��+�P�����I|�2�>.0&s_�;�����S��I�*|<�hN2'5''�^�Z��7}f���%�w�L����������3�twOrW?K=�Nyt�<NT832��A���eyRd�������v� ���*���Z/-�E��P���1-�1��>9'N3n�q>?��O�Q�O%2�	��'_�����G��p'X�E�c�^�\��%h�N	J��k�b�C�����ZR�s���M�����A-���
j���tjHg�
l��������R&2��[q��?U|���~���mGD�T&����b���:N����uS��c��V��"�_w��n��v���g�z�I�u�\��o�������O��em6�� R��xZd���R�����i���-H��c��������3���r�p��.�~�]2�G�g��OO�9O�<�O�tHe^wZ���1�3S��L�
}��u]��th��o�]9!���"�JG����L��f��\p�I����e�X������,���e/*)�F�*=S�

�(�I")�����%|T���-��/�@����%e�\�7���4��L1�1��b���E����r��	���!
���@�8Q�R�t�T���6��:DR�*cJ��/��
"����������E�����s�;��9ny��|������od����a9Lb.V��(��ya�d���n��!JX��G�		4�l<���Br4pV8L�������6a��L��%��F������+V����:��Rg�U�G#
(:�%���j�*Zv����-^#"Z�$6{��������W���?���k`����/�?if�'��;��\���I���TE��V�����F����Geu�|R� G�����	$�U�>���,�R�A��"�X�E�P�D?D"r[��
tL�+�F>�}��`+���Y?�����+
J(�����_:t6(�}��H�M�{�a�����4��R�����x��{���u�B��:\i���)�U���7V
�,4����u�0������2l��6Q���C�"2rEF|�����j{/Y)P[8F8n��s�/�]��QYT-�
l�jG{g�w�\���*���7*�����U��_h�V���"I2WU�n�a�n�hjM1I�G@Q�n����������]��*�g�T��<�����@�a�eB.8�s�E&�\d��V���p��m�6�\ g�*(����'L�k���[n���X��9;���hY0�-�N���T��#�d�����F�=0��`�n��UH�j�s����"��Cn_��+��!q��yV����U���(��w�FD�%�<��.A���$jr���rb���|�U/W�@�\"��4*�G�)��2�R��@�B\����!���-FqVUihV3I�%��NVG����6}����~�X�@�}n�N\_7�����
i�d�$�X�����@vVSCIu�t2�;�������iC�<�u�(�{n+kg���a^�q� �V�Ur�'��y^N*R�2SY�$�8�T L}z�0�������N")A�Q��<����%7 +!x�FR��q�LZ��)~��|��!�C*0I q�w���
M�5��$	�@-D�!�?�)�H���F�gE�����"V$��dYn�L+�aE����������x�H�a����w2��A�����nE�V�gE���������K���]V �����v<�QNGx�-��2"vI�fg��L�������8\�V�)�ii���>����������N�a�q�����*�0���h����xv��`GTk�`E�
d�7�!����>j C�2�@	+ql$�EMeX&_�P�4*�.*�]��wk���<I����<a�A��o9Wc���x@��6H�cH���Z6o[.��yAh�����7�4<�O���1����f�|��������c�B��vc��
�\��w3���6K�����N����ib[���	w�����m�����������EG]6��5#�_q['���������uE��kFwzh}��|��yW�_���I���I�lK<E���A���.}�rR:����N�!����G�����dH���@�?SAKu;���"Z��q���#8$?'�	��YLqg���I����r
1��%�}+K�\9��$��������X�����eI'�?g����e���C'C|fhM�:�/$�$^Lt�*h(UPO��w�k|>��[/��~$�d�UG�����n��~��I���%&���������;U!���5*""��B�9�+g&���>�������I����I0�^�E�� ��o@K�]����uGM���s��~���~mo���_;��o��}��h}���!������)��q������
���z��m��J�$u��V�w�wJm�����M�������U��C������t�����X��*�����P�U�k�	��5.GZ���PP,�	Xvw�l���@~���3�h�?���uM���'��HD�Cg���0�D������M�"hP�?Q��c�VTb:f�r��O���|py���El�C��az��CHK�� ���`� �6��-b����qf���!%�iV�jV���a����p�d�C�!�����2�+T*E,��Y�������_Y��_��q����Ko�v���<��{}�ou���e3��o���W�x;�$1E�E��C6[���[���}u�s�:�s"�\�����Y]�fFVDl��:e�I��Qi��5*mT�T�4�}z���}���>L;�X�X��H2��cz,�N������#��������A�<����L�<a� �A�� rh���Lw�crD�E$n�p�4�$�p����R�Z�������h�M�c.K)�����p����}�L�5���dr����*�LL�0L�0L�!�FB��@M���p���G
a��^B��]n,.b���1��|���?!������0+�ZR�qD��'5��%�;=0y����}|���Z���~�s����%1E���A��'y*q��+:��������7^�3iTw�H8�������Z�0]fQ�L�&�'�se����mvw������b��a/^ac��H
K�y�������V���"��&�}��@����.X��(���Z�����B�������S�������[�Am����4��ic��4M�HQ^����)����e]�^��@�\�vV�N���<fv�;������-82�%\!�]jC�	�kq��F���V#�oE�V$���Z:"_ �y��N�>�����&����k_��L�s�_�����PZ����MS2�0����ee�����j�j���������)�)���7k�?�Q�_8�xntn�����(~��C��;6��*Z_������bK���"Q+�oE��U�WP��R��TI��(��.i+*p9��HaPv��J���ps�<
w��>V�������a9'|_���Dz~�x���,������tF�m��	����1�rT�5Y<+3���N���|j�@>����-�9�,=?O	�����Hh�C���:,~��P�p�j��-�5��s��_Z�����_���Y~���P�T����h�qRU���bKh���R�.z�[TR6�����s��my[r=�C�Pa�G����9r�����[DPa$�+��W���GIe)GK=���;�8�����r������Y�N5XuPX�N��omV�b��_�@�=��.�|b��� �
%���
s���xQ���h^���z�.�y�H������Kv���h�E�.[3G+.�;���9zi>1r:�*��n����_G��x�l\!�3.��QQaQK���� ��8��o*vP:o��u��y�
<���.K�r�oG��]s������*c����O9p�������fO�~i4T��������8'���I���w�ff�8�K��5��+�'��(;�B�+�w[�f����r>2��H.
L�/�}i]�������X�����1�~�������F���/�.������;w^o�i���^[�
������q�L��<������<h����SnY����Q������R��g���z���
�l���������cxC��[Q�i�Ob�B�T�n�J���1.;d���w������w����i���l��)W�+��9x�U�{}�/b������lE�wq{Nn���e8�;s��d�eOQ3�a�"��:�N���C��2m�,�2�[�S��N�K<��^j\�g����������;�?Mo���Q�I�k���������;}�P�.�{�������b�#���g�����k;�Pw{� ���g�\������z���T��.q�Uc;��B�7=x�W�����
4o��$�&����]�|7���7/`�h	s�@��p�
1�Py�c���|�2����%`4����*���)���~������x@R�j�bw8lH��G�v}�)�Gqxy|��������"����)Z@Q4�s��p�=6�b[�����`��E
�/��>��-��w�\t��s���tj�8��^Y�������#2��f88���������Q��+l���I
��N���($l�����JH�K�a��zg�n~�����)0�{�~-cKn U.���5�^A�8�ou��5��+��$��*z<��5��A>>Z���ou��8[�����bF������F���[����7�.����������V�5=q+t�����^_/M��%�nsD���[?
�$���/��������R�c���Ue�9���s�\V����C
^���R���&^����\�q���.��B������,�=v��:����7��g�����;��� �e����2e�^��7\�"m�
+�eNU�v��n$l�P%��b�������^?�����:>0\>��R^i{����O�����������t)�t���X'�m�^����J�R�t��<�q<���k]p��yS_:b�����q���Q��>o����89���W�����DQV�����OTUI���LU��,Ih>�h�2���;���)�t���^�~��^��]��n�.pI�Kr����U�U\.p�3���[]y�X�~k�Q�2v����Bq>�[��n�yp�/���o�����.=tB?��:�i�lA[U&�W���+�z���W�*l&m�xBY�NogV�+/�\BPzkn�.N��Y^n�=�Yn�O����]J�2V��A���/1/[�x�oO��l^���������)�/f�z��Zz6��{���LT������*����ho�w�����������7^���u�t����8�D
QO�,���p��37�`������E�V��j�g��<����,�A#��:�|��q�y��e7a����.���Q��H7��!�"�P��������������O�a{+�Wa"b5����
j9L��z��W�@e��Ju<�����q����^��QX���k�@�B�����9w��[$���9����T�3� c1����eR0��X�^�����:��!���/�|Dw3�����w�z������C/"Q���r�=�����xo��0�������f�~��}��-"����1�7�[c���K���i�� ~��W��zT9)���#���x��i�����*J#�	�I�����t�{7�+�=��x�A��V��B-�H_����!V�3?�0�b�-1,��Z�X�m}c��
�o�y�m�#���C�p^"���`��h�i���D9�=�eF0?M��h��P}|V�I���������zC�>Xtf����*"�q�1
�	�Q�m�+	zE�!�������c�}4k��j1���5�E�����4��L��Y�����SD3V(�{�����=���C�=�+�E}<��e��w�g����R/�p��D��?+�q!Zc�<a�
����%��I�wX�5��dX���^�2e
���Bo�~�Z>	��f�Ri�y�>X���m���r�m>B�����>|��p<��8���#<O>����
������"���1�>���������	���s(3�K~�I&�} ���b�+�����4�#�i�V:���3�8tR��A���D9��������^��K>�|�-���,��0��D��m�w�=��
���kLKVh�k��d�IS9��o�8f�4�;���H�aZH>��e4b�A��3���<���}6��i�����.�bmA�n�)��.��I>��#Ir���U�q���2�	����Ai�u��>��OL�G9��}e2��L>��$7H���-���P���{��_SG$�z��ZK�|pZ��R
�My�^��o��b.�gW_��Y�w����kL���~O��������ai	�#�#F������@k���3�E���_Q_�����Eg������5�B�S��u�W�V��v������������-�r���7b� 8��1qxV����*��64���X���V������b,D}�E��i,��j}�+xRW"���Zu�\6�3��z��/X/]���U�_KQ6-E���G&�J����C����=!e!��4���eC�.!��6A!����a�'�erz�����GA9����y��[#����sL�
��]�Ou;�.C:���u��@���l_��J}`)�q�C8wB$iR�m���&�y��\I�[(�����'@���]�m0Em�6��>h!�y�xL��h�uxL�����S�X����A����@����#0R���K�Zy4��� ��y"�5�S�E:�����sM���Hi8��b���|���6j�W���B�k�W�����[��9��R���b_��i��'�'���������{h�F�����k���M�7�
��8�=�w�����G����N�mo���/i����_w��_�p�*\�xWhO!��F<�x�����O�7����}Az;�5v*����<�s{l��|ir7i�A]�zHE����������/A����/?��c��;Z��a���
������)p~ ���~
A�����M~��p�.��uH�L��t�5��<a��"���!��y������z�4�����zzx.#���<�q��\FP_�{��8-?{��i�	i���iux>�5�� �!���PF �����^�]�A{
Q�� 4��4��*��5?��4��_\~.���r�`��
�lc�m�g���*��7Z��3������"�?��U��P?@=�3��GP?��N���,9�
������p�N4C�1���I>p�;����#�\��5�^��f]���!F��8s
��Q��&�T������1|	�G���Xo�/��Fc�z�L��k#0�� ��)?�_I }�Gv��z������2��C>/�7�!��������m
k�/6�4
�q@������>�d�X!�g�!�a�:�)]�G�.K����P�oB��vVH�3���;����Z�3PD���/�e�
e+;�:"��a�xQ�-�=^����6ZO��1�@��]^\�����������;kM�d��d�E������F����&F7���[�h�rB����������Z�3kt�u��6m����A[�O�&4�K�\$}1=��M7�;��t#��J7���7�=K�I��z4���d[�����V�q=��i�
�r��\C���@�?���5*���j;mm��-��#p�L��p<���Iv���_�I��i]>(��01�b���nI?�!����+���
q	�����n�5����{W]�s�������t����`b3�����,�>��qx���:0t�|t����Soe���\��r�Z������O/^�%�Q��54���a:H�!-B��d����~������i��v��=�i2�q��_KN�n���/��S�^�-A~��~��CytGt��}�a������%�w���s���	V�_�������@�7���;0�q�^x\����a��Jk�O�����^��4���N �7
� L�v�z����~+�P��"�
�����������w&�4����5���5���~�nE����>����+y1���}���q��9]��M0~����=��Ji��S��b�.�9S���^���.����*�����7��(���7������W#�^���4(?(������gD�k�}3u�3�k"��D{G���H���2�WK����.g�7������a��i�����o],E��#n�t�xw3c�z^����7����KL�YV�	=�8��z*� _����>�u�8w}p^�p��	d�a�����3��
C��	�K������1����}5�?
g^G]a��&�&�k�V��,#�y��C�z�&x��\��Y�kl�+�������#����u�6Ql��K�y�Btk�Ka���|
��Oo���kc`~��������~�\~ac`~��B?~�������/��oc`~���?7�������������������h�>��_������/��h_$'�����F���|��r��	�yI���`����A��x
�L���N�AD	b���M�h�-`���f������S����I���0�Xe��R��j�����Od�(�U�GRB��99��v��c��E_5�E�ls<��w����8#�B�1��������
!s]�V�����A��$��
h��Qy���@2\� ������	P?.����GA�_��rF��Bwi���P�bb_�Mr�t�.��{�bO��N��G��_t,�?��>
{�f[����U�%�W����)7�M���W=�}=q��QGC�r;��l[u:��fh{�i�1D�O!�����0����v��������C��!�}���&�m��
(U8&�E��=�� ���7�v_��u�V���5�i|������h�}k��&Ae	Z��z�y(8����,�nk��B_��}-�PZ��������J4�k,=�^�0}�>�}0���������a���~���	�H���������d���^��I>��0D����C���_�!��4M�u�i�V~@]t��t3���R/���b����H�E��71�n��?11��O���x�7�����F�:��$������
;$�PO5�R-�Qx~�����E����F<Lg�~b��q� ���4�y#�>�u#�����q>�V#�!�O��SDk��5�_���,���|f��{��_�aa�����
����I3��O}w��-�
�����;�?���bN@2�X����}~:���?�u;���N��?W�v��0����z�C�r#�/��6��L|ebAbhK#��#�/��>_�]}�E�Z�^3 ��_�h��6�UZ�e�oL�m!�$X�n��5.�n��{O���������y����������7�yF�
����������q�f��Pq\_D<�x�����t:�$M@z� �+������)�L��oT5;-d���1�?5>����"c���C�:���6��N4e_�} �5����l�u�����L�P�K1���:\',�S�BO�F�)�&�	'�P�.����x��C�K�`��T�y�M��!���&Plo�iS�&=v���gF��~Y�W���,��������S���a�+����7Ic��R{�����<��,�>����e�����:W�grh��@����}_,�K�P��r���'���3���EZ���AXvP���#�6�=�Z��p��Ai"���Q�	b�,���0�"*�#��6"�,��,��dL�������y?�Xn��{{�G�x<>�(wD�1��x��hk�����8ZJjR���x�N�������@��=���|��tL��j
���n�Kr7�*����9u#��\�e��GB�h%Wc�u��\��On��f��\
-�:�>@:8
�ixL���@\����Kd�M����,���[����:��W��!��
+���?���;L�G�,=#��9C#��
^z���@>���m���h�����'�T���!�����J�h%���l���RHg���L]�&��"�6���*�@:�%�^e��I�W�A>���?�$������C��~���f\��������J��;�q�t�=����!����������L�D���u=A�dR�G
}�.���=m��!�����4�W��\�����M]
�|���;�O%���RH�"��]=�����������0��$S��Ag��NCy3�d
J�9C�#@�H�%�b��P�y�)S������<h@�&_2�c�,	�y�$�p�5�U��n�/��!��GX��)���WC�����<�F�6�1���������=�a?y��w0��L���F��Vx1�����Y�����\KR���*4����v@�8���Wz�}�A�����]���������Csk����,�N�p��N�8~�z���+D(��~���)��d���N����,[��5.�W�zi�B��L�X��4�o�����iq�C�;��z��XC^AF��5�]�OU��v����z��$v)�x��
��
}EZ��/N���o�,lB�"9>��7�S��`�epg>���:�(��#p�&����q��
��;~S
d�����>���_F��"o����G1��$p!b�h�����t�h��[�[��@�j������~�����&4�	MhB���&4�	MhB���&4�	MhB���&4�	MhB���&4�	MhB���&4�	MhB���&���G+�*�	�����_}��w���`�T��0�sx���"��lk,+g�T$em�$'^+E���m�]ZH|Z+q��u��^���l������b/�0B�+�� f �D�;R���5��w)��X7�}�Ji�
"��?}]oL���(>���K�W�����8u��@���=�Z$��f,��4��H� MZr�iQ�PuR��R��a�VU=�'v@%����u��e�����>�:�~X�O��}��F�v���=����?��������
�1`����GN��e��1�V��=X{��������y�m�~[4+�,���C.gwe=���n7�+��C;]�G&����J: p�,�,e�"^J�J�I`R}-bH�J���[����D�	�VW$j7�$����6������c7�>�lkI���`���e@b���}@��u^s�0,��
����|�{�=�e��8���9`�<�]X{��a��{����e�
�e���b���?��}����5G�������$�'���(w;jI� }d��aGQ)h�Ru��*ZL����n`X�
�|�h� 0�����w���s�5���`}�������d7`����hc�]�����?���VT������bo����7�!�{��$��<A��#�e���V�-l�Sa�@
��Y��-����� KdM&P��#�?%/��8��l@���1x0s�����O��F��<<n���7�w.��F�,<n���q�����F�������x�ji/�A�fP�Ti���~�;u|m/�]]��U#��K5��:5S�%jNR�5/Ps?5�R3FM��!j�\�{Q
�?����Rs���R�DM��QjvRS�I�aa��=�r�*i��?������Q�0�|��e��@U���W����JW�m��I�Nd������J��p�V��V1�*����1`��@=�X���^�8�����P/��0r��������E�[���3��F�O��|�Y�zCt T
�$	!�������5���f��n`��,i��x�����v��?��%5��	�a��}D�Q�^R�G�"s�&
{���#������"��{��w���)��eI����Q[�"���7�K��qGF�u���EMH����kBz���z����]�_}J�I7q����U�#�A��U��F	c��)����U=���������va�;1i$$�F��'���+��g��UO������vO�g���}�6�~�Q��z�Nf2��;�u#�����^����K7%u��1n��JHFeF� ���<�eh�Z� �������C�X�E2���I~8c���O�����-���
eJ/���J�����6���@(m��l��\|�X$���T0��k����=�x�~������v�J~�`���h%�Sm/��i���)�$�]��9R�4w����l��w��#�
v�m�����uD�C������?t���kh Q��64]��r�3�-wv
M�FJBSj���Y�B�
M�$kB�0���E�$�	}�(B�������xMriKrI�$��5��i^��4�C��m{�1���h��81�����#�I`�z����e���D�'4K���O��|l�*F&��D$��{G�����H�LFs����1��{��\�X�X��N�5������1� ������G:���|�$�+���7��\D���BY&���Q�+���u�-\�|g����
��-�����)V���d�f��N?��)|�xj��Z*x�7��H_��|�D2$65]�&���Y�����i^p��J��@.g���)B�V�P�J)�=D��%Y=�����S]q����AI���~kh�	���O�X�E�dKj��)%+�fx��ZGG
��-��JE\`��his�����,s�������W��T����R�,�������$�8P
endstream
endobj
17
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+ArialMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
19
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
0
0
277
]
4
7
0
8
[
889
]
9
15
0
16
[
333
277
0
]
19
28
556
29
35
0
36
[
666
0
722
722
666
0
0
722
]
44
50
0
51
[
666
]
52
54
0
55
[
610
]
56
67
0
68
[
556
0
0
556
556
]
73
75
0
76
[
222
0
500
222
833
]
81
83
556
84
[
0
333
500
]
87
89
0
90
[
722
500
]
]
>>
endobj
19
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+ArialMT
/Flags
4
/FontBBox
[
-664
-324
2000
1005
]
/Ascent
728
/Descent
-210
/ItalicAngle
0
/CapHeight
716
/StemV
80
/FontFile2
20
0
R
>>
endobj
21
0
obj
352
endobj
22
0
obj
21762
endobj
1
0
obj
<<
/Type
/Pages
/Kids
[
5
0
R
12
0
R
]
/Count
2
>>
endobj
xref
0 23
0000000002 65535 f 
0000055526 00000 n 
0000000000 00000 f 
0000000016 00000 n 
0000000142 00000 n 
0000000237 00000 n 
0000000402 00000 n 
0000032088 00000 n 
0000025873 00000 n 
0000025894 00000 n 
0000032036 00000 n 
0000032403 00000 n 
0000025913 00000 n 
0000026082 00000 n 
0000032245 00000 n 
0000031995 00000 n 
0000032016 00000 n 
0000054813 00000 n 
0000032547 00000 n 
0000055287 00000 n 
0000032975 00000 n 
0000055484 00000 n 
0000055504 00000 n 
trailer
<<
/Size
23
/Root
3
0
R
/Info
4
0
R
>>
startxref
55592
%%EOF

v7-0004-raidxtree.h-support-shared-iteration.patchapplication/octet-stream; name=v7-0004-raidxtree.h-support-shared-iteration.patchDownload

From 9dcb6f018c299a1d37dbadd0cd39d7a3b175e8a9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:29:51 -0700
Subject: [PATCH v7 4/8] raidxtree.h: support shared iteration.

This commit supports a shared iteration operation on a radix tree with
multiple processes. The radix tree must be in shared mode to start a
shared itereation. Parallel workers can attach the shared iteration
using the iterator handle given by the leader process. Same as the
normal interation, it's guarnteed that the shared iteration returns
key-values in an ascending order.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/include/lib/radixtree.h                   | 216 +++++++++++++++---
 .../modules/test_radixtree/test_radixtree.c   | 128 +++++++----
 2 files changed, 272 insertions(+), 72 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index f0abb0df389..c8efa61ac7c 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -136,6 +136,9 @@
  * RT_LOCK_SHARE 	- Lock the radix tree in share mode
  * RT_UNLOCK		- Unlock the radix tree
  * RT_GET_HANDLE	- Return the handle of the radix tree
+ * RT_BEGIN_ITERATE_SHARED	- Begin iterating in shared mode.
+ * RT_ATTACH_ITERATE_SHARED	- Attach to the shared iterator.
+ * RT_GET_ITER_HANDLE		- Get the handle of the shared iterator.
  *
  * Optional Interface
  * ---------
@@ -179,6 +182,9 @@
 #define RT_ATTACH RT_MAKE_NAME(attach)
 #define RT_DETACH RT_MAKE_NAME(detach)
 #define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#define RT_BEGIN_ITERATE_SHARED RT_MAKE_NAME(begin_iterate_shared)
+#define RT_ATTACH_ITERATE_SHARED RT_MAKE_NAME(attach_iterate_shared)
+#define RT_GET_ITER_HANDLE RT_MAKE_NAME(get_iter_handle)
 #define RT_LOCK_EXCLUSIVE RT_MAKE_NAME(lock_exclusive)
 #define RT_LOCK_SHARE RT_MAKE_NAME(lock_share)
 #define RT_UNLOCK RT_MAKE_NAME(unlock)
@@ -238,15 +244,19 @@
 #define RT_SHRINK_NODE_16 RT_MAKE_NAME(shrink_child_16)
 #define RT_SHRINK_NODE_48 RT_MAKE_NAME(shrink_child_48)
 #define RT_SHRINK_NODE_256 RT_MAKE_NAME(shrink_child_256)
+#define RT_INITIALIZE_ITER RT_MAKE_NAME(initialize_iter)
 #define RT_NODE_ITERATE_NEXT RT_MAKE_NAME(node_iterate_next)
 #define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
 
 /* type declarations */
 #define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
 #define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER_CONTROL RT_MAKE_NAME(iter_control)
 #define RT_ITER RT_MAKE_NAME(iter)
 #ifdef RT_SHMEM
 #define RT_HANDLE RT_MAKE_NAME(handle)
+#define RT_ITER_CONTROL_SHARED RT_MAKE_NAME(iter_control_shared)
+#define RT_ITER_HANDLE RT_MAKE_NAME(iter_handle)
 #endif
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_CHILD_PTR RT_MAKE_NAME(child_ptr)
@@ -272,6 +282,7 @@ typedef struct RT_ITER RT_ITER;
 
 #ifdef RT_SHMEM
 typedef dsa_pointer RT_HANDLE;
+typedef dsa_pointer RT_ITER_HANDLE;
 #endif
 
 #ifdef RT_SHMEM
@@ -282,6 +293,9 @@ RT_SCOPE	RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_LOCK_EXCLUSIVE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_LOCK_SHARE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_UNLOCK(RT_RADIX_TREE * tree);
+RT_SCOPE	RT_ITER *RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree);
+RT_SCOPE	RT_ITER_HANDLE RT_GET_ITER_HANDLE(RT_ITER * iter);
+RT_SCOPE	RT_ITER *RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle);
 #else
 RT_SCOPE	RT_RADIX_TREE *RT_CREATE(MemoryContext ctx);
 #endif
@@ -689,6 +703,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 	RT_HANDLE	handle;
 	uint32		magic;
 	LWLock		lock;
+	int			tranche_id;
 #endif
 
 	RT_PTR_ALLOC root;
@@ -739,11 +754,9 @@ typedef struct RT_NODE_ITER
 	int			idx;
 }			RT_NODE_ITER;
 
-/* state for iterating over the whole radix tree */
-struct RT_ITER
+/* Contain the iteration state data */
+typedef struct RT_ITER_CONTROL
 {
-	RT_RADIX_TREE *tree;
-
 	/*
 	 * A stack to track iteration for each level. Level 0 is the lowest (or
 	 * leaf) level
@@ -754,8 +767,36 @@ struct RT_ITER
 
 	/* The key constructed during iteration */
 	uint64		key;
-};
+}			RT_ITER_CONTROL;
 
+#ifdef RT_SHMEM
+/* Contain the shared iteration state data */
+typedef struct RT_ITER_CONTROL_SHARED
+{
+	/* Actual shared iteration state data */
+	RT_ITER_CONTROL common;
+
+	/* protect the control data */
+	LWLock		lock;
+
+	RT_ITER_HANDLE handle;
+	pg_atomic_uint32 refcnt;
+}			RT_ITER_CONTROL_SHARED;
+#endif
+
+/* state for iterating over the whole radix tree */
+struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* pointing to either local memory or DSA */
+	RT_ITER_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	/* True if the iterator is for shared iteration */
+	bool		shared;
+#endif
+};
 
 /* verification (available only in assert-enabled builds) */
 static void RT_VERIFY_NODE(RT_NODE * node);
@@ -1833,6 +1874,7 @@ RT_CREATE(MemoryContext ctx)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
 	tree->ctl->handle = dp;
 	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	tree->ctl->tranche_id = tranche_id;
 	LWLockInitialize(&tree->ctl->lock, tranche_id);
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
@@ -2044,6 +2086,28 @@ RT_FREE(RT_RADIX_TREE * tree)
 
 /***************** ITERATION *****************/
 
+/*
+ * Common routine to initialize the given iterator.
+ */
+static void
+RT_INITIALIZE_ITER(RT_RADIX_TREE * tree, RT_ITER * iter)
+{
+	RT_CHILD_PTR root;
+
+	iter->tree = tree;
+
+	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
+	root.alloc = iter->tree->ctl->root;
+	RT_PTR_SET_LOCAL(tree, &root);
+
+	iter->ctl->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+
+	/* Set the root to start */
+	iter->ctl->cur_level = iter->ctl->top_level;
+	iter->ctl->node_iters[iter->ctl->cur_level].node = root;
+	iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
+}
+
 /*
  * Create and return an iterator for the given radix tree
  * in the caller's memory context.
@@ -2055,24 +2119,50 @@ RT_SCOPE	RT_ITER *
 RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 {
 	RT_ITER    *iter;
-	RT_CHILD_PTR root;
 
 	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
-	iter->tree = tree;
+	iter->ctl = (RT_ITER_CONTROL *) palloc0(sizeof(RT_ITER_CONTROL));
+	RT_INITIALIZE_ITER(tree, iter);
 
-	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
-	root.alloc = iter->tree->ctl->root;
-	RT_PTR_SET_LOCAL(tree, &root);
+#ifdef RT_SHMEM
+	/* we will non-shared iteration on a shared radix tree */
+	iter->shared = false;
+#endif
 
-	iter->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+	return iter;
+}
 
-	/* Set the root to start */
-	iter->cur_level = iter->top_level;
-	iter->node_iters[iter->cur_level].node = root;
-	iter->node_iters[iter->cur_level].idx = 0;
+#ifdef RT_SHMEM
+/*
+ * Create and return the shared iterator for the given shard radix tree.
+ *
+ * Taking a lock on a radix tree in shared mode during the shared iteration to
+ * prevent concurrent writes is the caller's responsibility.
+ */
+RT_SCOPE	RT_ITER *
+RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl_shared;
+	dsa_pointer dp;
+
+	/* The radix tree must be in shared mode */
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	dp = dsa_allocate0(tree->dsa, sizeof(RT_ITER_CONTROL_SHARED));
+	ctl_shared = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, dp);
+	ctl_shared->handle = dp;
+	LWLockInitialize(&ctl_shared->lock, tree->ctl->tranche_id);
+	pg_atomic_init_u32(&ctl_shared->refcnt, 1);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->ctl = (RT_ITER_CONTROL *) ctl_shared;
+	iter->shared = true;
+	RT_INITIALIZE_ITER(tree, iter);
 
 	return iter;
 }
+#endif
 
 /*
  * Scan the inner node and return the next child pointer if one exists, otherwise
@@ -2086,12 +2176,18 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	RT_CHILD_PTR node;
 	RT_PTR_ALLOC *slot = NULL;
 
+	node_iter = &(iter->ctl->node_iters[level]);
+	node = node_iter->node;
+
 #ifdef RT_SHMEM
-	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
-#endif
 
-	node_iter = &(iter->node_iters[level]);
-	node = node_iter->node;
+	/*
+	 * Since the iterator is shared, the local pointer of the node might be
+	 * set by other backends, we need to make sure to use the local pointer.
+	 */
+	if (iter->shared)
+		RT_PTR_SET_LOCAL(iter->tree, &node);
+#endif
 
 	Assert(node.local != NULL);
 
@@ -2164,8 +2260,8 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	}
 
 	/* Update the key */
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
-	iter->key |= (((uint64) key_chunk) << (level * RT_SPAN));
+	iter->ctl->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
+	iter->ctl->key |= (((uint64) key_chunk) << (level * RT_SPAN));
 
 	return slot;
 }
@@ -2179,18 +2275,29 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 {
 	RT_PTR_ALLOC *slot = NULL;
 
-	while (iter->cur_level <= iter->top_level)
+#ifdef RT_SHMEM
+	/* Prevent the shared iterator from being updated concurrently */
+	if (iter->shared)
+		LWLockAcquire(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock, LW_EXCLUSIVE);
+#endif
+
+	while (iter->ctl->cur_level <= iter->ctl->top_level)
 	{
 		RT_CHILD_PTR node;
 
-		slot = RT_NODE_ITERATE_NEXT(iter, iter->cur_level);
+		slot = RT_NODE_ITERATE_NEXT(iter, iter->ctl->cur_level);
 
-		if (iter->cur_level == 0 && slot != NULL)
+		if (iter->ctl->cur_level == 0 && slot != NULL)
 		{
 			/* Found a value at the leaf node */
-			*key_p = iter->key;
+			*key_p = iter->ctl->key;
 			node.alloc = *slot;
 
+#ifdef RT_SHMEM
+			if (iter->shared)
+				LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 			if (RT_CHILDPTR_IS_VALUE(*slot))
 				return (RT_VALUE_TYPE *) slot;
 			else
@@ -2206,17 +2313,23 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 			node.alloc = *slot;
 			RT_PTR_SET_LOCAL(iter->tree, &node);
 
-			iter->cur_level--;
-			iter->node_iters[iter->cur_level].node = node;
-			iter->node_iters[iter->cur_level].idx = 0;
+			iter->ctl->cur_level--;
+			iter->ctl->node_iters[iter->ctl->cur_level].node = node;
+			iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
 		}
 		else
 		{
 			/* Not found the child slot, move up the tree */
-			iter->cur_level++;
+			iter->ctl->cur_level++;
 		}
+
 	}
 
+#ifdef RT_SHMEM
+	if (iter->shared)
+		LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 	/* We've visited all nodes, so the iteration finished */
 	return NULL;
 }
@@ -2227,9 +2340,44 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 RT_SCOPE void
 RT_END_ITERATE(RT_ITER * iter)
 {
+#ifdef RT_SHMEM
+	RT_ITER_CONTROL_SHARED *ctl = (RT_ITER_CONTROL_SHARED *) iter->ctl;
+
+	if (iter->shared &&
+		pg_atomic_sub_fetch_u32(&ctl->refcnt, 1) == 0)
+		dsa_free(iter->tree->dsa, ctl->handle);
+#endif
 	pfree(iter);
 }
 
+#ifdef	RT_SHMEM
+RT_SCOPE	RT_ITER_HANDLE
+RT_GET_ITER_HANDLE(RT_ITER * iter)
+{
+	Assert(iter->shared);
+	return ((RT_ITER_CONTROL_SHARED *) iter->ctl)->handle;
+
+}
+
+RT_SCOPE	RT_ITER *
+RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl;
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+	ctl = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, handle);
+	iter->ctl = (RT_ITER_CONTROL *) ctl;
+	iter->shared = true;
+
+	/* For every iterator, increase the refcnt by 1 */
+	pg_atomic_add_fetch_u32(&ctl->refcnt, 1);
+
+	return iter;
+}
+#endif
+
 /***************** DELETION *****************/
 
 #ifdef RT_USE_DELETE
@@ -2929,7 +3077,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_PTR_ALLOC
 #undef RT_INVALID_PTR_ALLOC
 #undef RT_HANDLE
+#undef RT_ITER_HANDLE
+#undef RT_ITER_CONTROL
+#undef RT_ITER_HANDLE
 #undef RT_ITER
+#undef RT_SHARED_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
 #undef RT_NODE_KIND_4
@@ -2966,6 +3118,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_LOCK_SHARE
 #undef RT_UNLOCK
 #undef RT_GET_HANDLE
+#undef RT_BEGIN_ITERATE_SHARED
+#undef RT_ATTACH_ITERATE_SHARED
+#undef RT_GET_ITER_HANDLE
+#undef RT_ATTACH_ITER
+#undef RT_GET_ITER_HANDLE
 #undef RT_FIND
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
@@ -3022,5 +3179,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_SHRINK_NODE_256
 #undef RT_NODE_DELETE
 #undef RT_NODE_INSERT
+#undef RT_INITIALIZE_ITER
 #undef RT_NODE_ITERATE_NEXT
 #undef RT_VERIFY_NODE
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 32de6a3123e..dcba1508a29 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -158,12 +158,86 @@ test_empty(void)
 #endif
 }
 
+/* Iteration test for test_basic() */
+static void
+test_iterate_basic(rt_radix_tree *radixtree, uint64 *keys, int children,
+				   bool asc, bool shared)
+{
+	rt_iter    *iter;
+
+#ifdef TEST_SHARED_RT
+	if (!shared)
+		iter = rt_begin_iterate(radixtree);
+	else
+		iter = rt_begin_iterate_shared(radixtree);
+#else
+	iter = rt_begin_iterate(radixtree);
+#endif
+
+	for (int i = 0; i < children; i++)
+	{
+		uint64		expected;
+		uint64		iterkey;
+		TestValueType *iterval;
+
+		/* iteration is ordered by key, so adjust expected value accordingly */
+		if (asc)
+			expected = keys[i];
+		else
+			expected = keys[children - 1 - i];
+
+		iterval = rt_iterate_next(iter, &iterkey);
+
+		EXPECT_TRUE(iterval != NULL);
+		EXPECT_EQ_U64(iterkey, expected);
+		EXPECT_EQ_U64(*iterval, expected);
+	}
+
+	rt_end_iterate(iter);
+}
+
+/* Iteration test for test_random() */
+static void
+test_iterate_random(rt_radix_tree *radixtree, uint64 *keys, int num_keys,
+					bool shared)
+{
+	rt_iter    *iter;
+
+#ifdef TEST_SHARED_RT
+	if (!shared)
+		iter = rt_begin_iterate(radixtree);
+	else
+		iter = rt_begin_iterate_shared(radixtree);
+#else
+	iter = rt_begin_iterate(radixtree);
+#endif
+
+	for (int i = 0; i < num_keys; i++)
+	{
+		uint64		expected;
+		uint64		iterkey;
+		TestValueType *iterval;
+
+		/* skip duplicate keys */
+		if (i < num_keys - 1 && keys[i + 1] == keys[i])
+			continue;
+
+		expected = keys[i];
+		iterval = rt_iterate_next(iter, &iterkey);
+
+		EXPECT_TRUE(iterval != NULL);
+		EXPECT_EQ_U64(iterkey, expected);
+		EXPECT_EQ_U64(*iterval, expected);
+	}
+
+	rt_end_iterate(iter);
+}
+
 /* Basic set, find, and delete tests */
 static void
 test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 {
 	rt_radix_tree *radixtree;
-	rt_iter    *iter;
 	uint64	   *keys;
 	int			children = test_info->nkeys;
 #ifdef TEST_SHARED_RT
@@ -244,28 +318,12 @@ test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 	}
 
 	/* test that iteration returns the expected keys and values */
-	iter = rt_begin_iterate(radixtree);
-
-	for (int i = 0; i < children; i++)
-	{
-		uint64		expected;
-		uint64		iterkey;
-		TestValueType *iterval;
-
-		/* iteration is ordered by key, so adjust expected value accordingly */
-		if (asc)
-			expected = keys[i];
-		else
-			expected = keys[children - 1 - i];
-
-		iterval = rt_iterate_next(iter, &iterkey);
-
-		EXPECT_TRUE(iterval != NULL);
-		EXPECT_EQ_U64(iterkey, expected);
-		EXPECT_EQ_U64(*iterval, expected);
-	}
+	test_iterate_basic(radixtree, keys, children, asc, false);
 
-	rt_end_iterate(iter);
+#ifdef TEST_SHARED_RT
+	/* test shared-iteration as well */
+	test_iterate_basic(radixtree, keys, children, asc, true);
+#endif
 
 	/* delete all keys again */
 	for (int i = 0; i < children; i++)
@@ -295,7 +353,6 @@ static void
 test_random(void)
 {
 	rt_radix_tree *radixtree;
-	rt_iter    *iter;
 	pg_prng_state state;
 
 	/* limit memory usage by limiting the key space */
@@ -387,27 +444,12 @@ test_random(void)
 	}
 
 	/* test that iteration returns the expected keys and values */
-	iter = rt_begin_iterate(radixtree);
-
-	for (int i = 0; i < num_keys; i++)
-	{
-		uint64		expected;
-		uint64		iterkey;
-		TestValueType *iterval;
+	test_iterate_random(radixtree, keys, num_keys, false);
 
-		/* skip duplicate keys */
-		if (i < num_keys - 1 && keys[i + 1] == keys[i])
-			continue;
-
-		expected = keys[i];
-		iterval = rt_iterate_next(iter, &iterkey);
-
-		EXPECT_TRUE(iterval != NULL);
-		EXPECT_EQ_U64(iterkey, expected);
-		EXPECT_EQ_U64(*iterval, expected);
-	}
-
-	rt_end_iterate(iter);
+#ifdef TEST_SHARED_RT
+	/* test shared-iteration as well */
+	test_iterate_random(radixtree, keys, num_keys, true);
+#endif
 
 	/* reset random number generator for deletion */
 	pg_prng_seed(&state, seed);
-- 
2.43.5

v7-0006-radixtree.h-Add-RT_NUM_KEY-API-to-get-the-number-.patchapplication/octet-stream; name=v7-0006-radixtree.h-Add-RT_NUM_KEY-API-to-get-the-number-.patchDownload

From 13995c11064f42e4440e9ded11198faaf5e6b7bc Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Dec 2024 16:54:46 -0800
Subject: [PATCH v7 6/8] radixtree.h: Add RT_NUM_KEY API to get the number of
 keys.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/include/lib/radixtree.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index c8efa61ac7c..4b5598a8104 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -126,6 +126,7 @@
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
  * RT_END_ITERATE	- End iteration
  * RT_MEMORY_USAGE	- Get the memory as measured by space in memory context blocks
+ * RT_NUM_KEYS		- Get the number of key-value pairs in radix tree
  *
  * Interface for Shared Memory
  * ---------
@@ -197,6 +198,7 @@
 #define RT_DELETE RT_MAKE_NAME(delete)
 #endif
 #define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_NUM_KEYS RT_MAKE_NAME(num_keys)
 #define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
 #define RT_STATS RT_MAKE_NAME(stats)
 
@@ -313,6 +315,7 @@ RT_SCOPE	RT_VALUE_TYPE *RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p);
 RT_SCOPE void RT_END_ITERATE(RT_ITER * iter);
 
 RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE * tree);
+RT_SCOPE int64 RT_NUM_KEYS(RT_RADIX_TREE * tree);
 
 #ifdef RT_DEBUG
 RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree);
@@ -2807,6 +2810,15 @@ RT_MEMORY_USAGE(RT_RADIX_TREE * tree)
 	return total;
 }
 
+RT_SCOPE int64
+RT_NUM_KEYS(RT_RADIX_TREE * tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	return tree->ctl->num_keys;
+}
+
 /*
  * Perform some sanity checks on the given node.
  */
@@ -3130,6 +3142,7 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_END_ITERATE
 #undef RT_DELETE
 #undef RT_MEMORY_USAGE
+#undef RT_NUM_KEYS
 #undef RT_DUMP_NODE
 #undef RT_STATS
 
-- 
2.43.5

v7-0005-tidstore.c-Support-shared-itereation.patchapplication/octet-stream; name=v7-0005-tidstore.c-Support-shared-itereation.patchDownload

From ade42aa4b26a59870d235941c56c4209a1db8860 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:34:57 -0700
Subject: [PATCH v7 5/8] tidstore.c: Support shared itereation.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/common/tidstore.c          | 59 ++++++++++++++++++
 src/include/access/tidstore.h                 |  3 +
 .../modules/test_tidstore/test_tidstore.c     | 62 ++++++++++++++-----
 3 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 5bd75fb499c..720bc86c266 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -475,6 +475,7 @@ TidStoreBeginIterate(TidStore *ts)
 	iter = palloc0(sizeof(TidStoreIter));
 	iter->ts = ts;
 
+	/* begin iteration on the radix tree */
 	if (TidStoreIsShared(ts))
 		iter->tree_iter.shared = shared_ts_begin_iterate(ts->tree.shared);
 	else
@@ -525,6 +526,56 @@ TidStoreEndIterate(TidStoreIter *iter)
 	pfree(iter);
 }
 
+/*
+ * Prepare to iterate through a shared TidStore in shared mode. This function
+ * is aimed to start the iteration on the given TidStore with parallel workers.
+ *
+ * The TidStoreIter struct is created in the caller's memory context, and it
+ * will be freed in TidStoreEndIterate.
+ *
+ * The caller is responsible for locking TidStore until the iteration is
+ * finished.
+ */
+TidStoreIter *
+TidStoreBeginIterateShared(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	if (!TidStoreIsShared(ts))
+		elog(ERROR, "cannot begin shared iteration on local TidStore");
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* begin the shared iteration on radix tree */
+	iter->tree_iter.shared =
+		(shared_ts_iter *) shared_ts_begin_iterate_shared(ts->tree.shared);
+
+	return iter;
+}
+
+/*
+ * Attach to the shared TidStore iterator. 'iter_handle' is the dsa_pointer
+ * returned by TidStoreGetSharedIterHandle(). The returned object is allocated
+ * in backend-local memory using CurrentMemoryContext.
+ */
+TidStoreIter *
+TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle)
+{
+	TidStoreIter *iter;
+
+	Assert(TidStoreIsShared(ts));
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* Attach to the shared iterator */
+	iter->tree_iter.shared = shared_ts_attach_iterate_shared(ts->tree.shared,
+															 iter_handle);
+
+	return iter;
+}
+
 /*
  * Return the memory usage of TidStore.
  */
@@ -556,6 +607,14 @@ TidStoreGetHandle(TidStore *ts)
 	return (dsa_pointer) shared_ts_get_handle(ts->tree.shared);
 }
 
+dsa_pointer
+TidStoreGetSharedIterHandle(TidStoreIter *iter)
+{
+	Assert(TidStoreIsShared(iter->ts));
+
+	return (dsa_pointer) shared_ts_get_iter_handle(iter->tree_iter.shared);
+}
+
 /*
  * Given a TidStoreIterResult returned by TidStoreIterateNext(), extract the
  * offset numbers.  Returns the number of offsets filled in, if <=
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 041091df278..c886cef0f7d 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -37,6 +37,9 @@ extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
 extern void TidStoreUnlock(TidStore *ts);
+extern TidStoreIter *TidStoreBeginIterateShared(TidStore *ts);
+extern TidStoreIter *TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle);
+extern dsa_pointer TidStoreGetSharedIterHandle(TidStoreIter *iter);
 extern void TidStoreDestroy(TidStore *ts);
 extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 									int num_offsets);
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index eb16e0fbfa6..36654cf0110 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -33,6 +33,7 @@ PG_FUNCTION_INFO_V1(test_is_full);
 PG_FUNCTION_INFO_V1(test_destroy);
 
 static TidStore *tidstore = NULL;
+static bool tidstore_is_shared;
 static size_t tidstore_empty_size;
 
 /* array for verification of some tests */
@@ -107,6 +108,7 @@ test_create(PG_FUNCTION_ARGS)
 		LWLockRegisterTranche(tranche_id, "test_tidstore");
 
 		tidstore = TidStoreCreateShared(tidstore_max_size, tranche_id);
+		tidstore_is_shared = true;
 
 		/*
 		 * Remain attached until end of backend or explicitly detached so that
@@ -115,8 +117,11 @@ test_create(PG_FUNCTION_ARGS)
 		dsa_pin_mapping(TidStoreGetDSA(tidstore));
 	}
 	else
+	{
 		/* VACUUM uses insert only, so we test the other option. */
 		tidstore = TidStoreCreateLocal(tidstore_max_size, false);
+		tidstore_is_shared = false;
+	}
 
 	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
 
@@ -212,14 +217,42 @@ do_set_block_offsets(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(blkno);
 }
 
+/* Collect TIDs stored in the tidstore, in order */
+static void
+check_iteration(TidStore *tidstore, int *num_iter_tids, bool shared_iter)
+{
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+
+	TidStoreLockShare(tidstore);
+
+	if (shared_iter)
+		iter = TidStoreBeginIterateShared(tidstore);
+	else
+		iter = TidStoreBeginIterate(tidstore);
+
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	{
+		OffsetNumber offsets[MaxOffsetNumber];
+		int			num_offsets;
+
+		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
+		Assert(num_offsets <= lengthof(offsets));
+		for (int i = 0; i < num_offsets; i++)
+			ItemPointerSet(&(items.iter_tids[(*num_iter_tids)++]), iter_result->blkno,
+						   offsets[i]);
+	}
+
+	TidStoreEndIterate(iter);
+	TidStoreUnlock(tidstore);
+}
+
 /*
  * Verify TIDs in store against the array.
  */
 Datum
 check_set_block_offsets(PG_FUNCTION_ARGS)
 {
-	TidStoreIter *iter;
-	TidStoreIterResult *iter_result;
 	int			num_iter_tids = 0;
 	int			num_lookup_tids = 0;
 	BlockNumber prevblkno = 0;
@@ -261,22 +294,23 @@ check_set_block_offsets(PG_FUNCTION_ARGS)
 	}
 
 	/* Collect TIDs stored in the tidstore, in order */
+	check_iteration(tidstore, &num_iter_tids, false);
 
-	TidStoreLockShare(tidstore);
-	iter = TidStoreBeginIterate(tidstore);
-	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	/* If the tidstore is shared, check the shared-iteration as well */
+	if (tidstore_is_shared)
 	{
-		OffsetNumber offsets[MaxOffsetNumber];
-		int			num_offsets;
+		int			num_iter_tids_shared = 0;
 
-		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
-		Assert(num_offsets <= lengthof(offsets));
-		for (int i = 0; i < num_offsets; i++)
-			ItemPointerSet(&(items.iter_tids[num_iter_tids++]), iter_result->blkno,
-						   offsets[i]);
+		check_iteration(tidstore, &num_iter_tids_shared, true);
+
+		/*
+		 * verify that normal iteration and shared iteration returned the
+		 * number of TIDs.
+		 */
+		if (num_lookup_tids != num_iter_tids_shared)
+			elog(ERROR, "shared-iteration should have %d TIDs, have %d aaa",
+				 items.num_tids, num_iter_tids_shared);
 	}
-	TidStoreEndIterate(iter);
-	TidStoreUnlock(tidstore);
 
 	/*
 	 * Sort verification and lookup arrays and test that all arrays are the
-- 
2.43.5

v7-0007-tidstore.c-Add-TidStoreNumBlocks-API-to-get-the-n.patchapplication/octet-stream; name=v7-0007-tidstore.c-Add-TidStoreNumBlocks-API-to-get-the-n.patchDownload

From 76272d4b587a53dca7a99106280480759c242222 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Dec 2024 16:55:52 -0800
Subject: [PATCH v7 7/8] tidstore.c: Add TidStoreNumBlocks API to get the
 number of blocks.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/common/tidstore.c | 12 ++++++++++++
 src/include/access/tidstore.h        |  1 +
 2 files changed, 13 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 720bc86c266..e1614da9e89 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -588,6 +588,18 @@ TidStoreMemoryUsage(TidStore *ts)
 		return local_ts_memory_usage(ts->tree.local);
 }
 
+/*
+ * Return the number of entries in TidStore.
+ */
+BlockNumber
+TidStoreNumBlocks(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+		return shared_ts_num_keys(ts->tree.shared);
+	else
+		return local_ts_num_keys(ts->tree.local);
+}
+
 /*
  * Return the DSA area where the TidStore lives.
  */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index c886cef0f7d..fd739d20da1 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -51,6 +51,7 @@ extern int	TidStoreGetBlockOffsets(TidStoreIterResult *result,
 									int max_offsets);
 extern void TidStoreEndIterate(TidStoreIter *iter);
 extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern BlockNumber TidStoreNumBlocks(TidStore *ts);
 extern dsa_pointer TidStoreGetHandle(TidStore *ts);
 extern dsa_area *TidStoreGetDSA(TidStore *ts);
 
-- 
2.43.5

v7-0008-Support-parallel-heap-vacuum-during-lazy-vacuum.patchapplication/octet-stream; name=v7-0008-Support-parallel-heap-vacuum-during-lazy-vacuum.patchDownload

From 14a5edaed9b6a2e97225e919695f0c58ea60dcec Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:37:45 -0700
Subject: [PATCH v7 8/8] Support parallel heap vacuum during lazy vacuum.

This commit further extends parallel vacuum to perform the heap vacuum
phase with parallel workers. It leverages the shared TidStore iteration.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/ref/vacuum.sgml          |  17 +-
 src/backend/access/heap/vacuumlazy.c  | 280 +++++++++++++++++++-------
 src/backend/commands/vacuumparallel.c |  10 +-
 src/include/commands/vacuum.h         |   2 +-
 4 files changed, 223 insertions(+), 86 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index b099d3a03cf..b9158a74f11 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -279,20 +279,21 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
       <para>
-       Perform scanning heap, index vacuum, and index cleanup phases of
-       <command>VACUUM</command> in parallel using
+       Perform scanning heap, vacuuming heap, index vacuum, and index cleanup
+       phases of <command>VACUUM</command> in parallel using
        <replaceable class="parameter">integer</replaceable> background workers
        (for the details of each vacuum phase, please refer to
        <xref linkend="vacuum-phases"/>).
       </para>
       <para>
        For heap tables, the number of workers used to perform the scanning
-       heap is determined based on the size of table. A table can participate in
-       parallel scanning heap if and only if the size of the table is more than
-       <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
-       the heap table's blocks will be divided into ranges and shared among the
-       cooperating processes. Each worker process will complete the scanning of
-       its given range of blocks before requesting an additional range of blocks.
+       heap and vacuuming heap is determined based on the size of table. A table
+       can participate in parallel scanning heap if and only if the size of the
+       table is more than <xref linkend="guc-min-parallel-table-scan-size"/>.
+       During scanning heap, the heap table's blocks will be divided into ranges
+       and shared among the cooperating processes. Each worker process will
+       complete the scanning of its given range of blocks before requesting an
+       additional range of blocks.
       </para>
       <para>
        The number of workers used to perform parallel index vacuum and index
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 908b2b8f2a4..b5789c1fa32 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -20,6 +20,41 @@
  * that there only needs to be one call to lazy_vacuum, after the initial pass
  * completes.
  *
+ * Parallel Vacuum
+ * ----------------
+ * Lazy vacuum on heap tables supports parallel processing for three vacuum
+ * phases: scanning heap, vacuuming indexes, and vacuuming heap. Before the
+ * scanning heap phase, we initialize parallel vacuum state, ParallelVacuumState,
+ * and allocate the TID store in a DSA area if we can use parallel mode for any
+ * of these three phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size, number of indexes, and the number
+ * of pages having dead tuples. Parallel workers are launched at the beginning
+ * of each phase and exit at the end of each phase.
+ *
+ * For scanning the heap table with parallel workers, we utilize the
+ * table_block_parallelscan_xxx facility which splits the table into several
+ * chunks and parallel workers allocate chunks to scan. If dead_items TIDs is
+ * close to overrunning the available space during parallel heap scan, parallel
+ * workers exit and leader process gathers the scan results. Then, it performs
+ * a round of index and heap vacuuming that could also use the parallelism. After
+ * vacuuming both indexes and heap table, the leader process vacuums FSM to make
+ * newly-freed space visible. Then, it relaunches parallel workers to resume the
+ * scanning heap phase with parallel workers again. In order to be able to resume
+ * the parallel heap scan from the previous status, the workers' parallel scan
+ * descriptions are stored in the shared memory (DSM) space to share among parallel
+ * workers. If the leader could launch fewer workers to resume the parallel heap
+ * scan, some blocks are remained as un-scanned. The leader process serially deals
+ * with such blocks at the end of scanning heap phase (see
+ * parallel_heap_complete_unfinished_scan()).
+ *
+ * At the beginning of the vacuuming heap phase, the leader launches parallel
+ * workers and initiates the shared iteration on the shared TID store. At the
+ * end of the phase, the leader process waits for all workers to finish and gather
+ * the workers' results.
+ *
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -172,6 +207,7 @@ typedef struct LVRelScanState
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	BlockNumber vacuumed_pages; /* # pages vacuumed in one second pass */
 
 	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
@@ -205,11 +241,15 @@ typedef struct PHVShared
 	 * The final value is OR of worker's skippedallvis.
 	 */
 	bool		skippedallvis;
+	bool		do_index_vacuuming;
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState vistest;
 
+	dsa_pointer shared_iter_handle;
+	bool		do_heap_vacuum;
+
 	/* per-worker scan stats for parallel heap vacuum scan */
 	LVRelScanState worker_scan_state[FLEXIBLE_ARRAY_MEMBER];
 } PHVShared;
@@ -257,6 +297,14 @@ typedef struct PHVState
 	/* Assigned per-worker scan state */
 	PHVScanWorkerState *myscanstate;
 
+	/*
+	 * The number of parallel workers to launch for parallel heap scanning.
+	 * Note that the number of parallel workers for parallel heap vacuuming
+	 * could vary but is less than num_heapscan_workers. So this works also as
+	 * the maximum number of workers for parallel heap scanning and vacuuming.
+	 */
+	int			num_heapscan_workers;
+
 	/*
 	 * All blocks up to this value has been scanned, i.e. the minimum of all
 	 * PHVScanWorkerState->last_blkno. This field is updated by
@@ -373,6 +421,7 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
+static void do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter);
 static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, OffsetNumber *deadoffsets,
 								  int num_offsets, Buffer vmbuffer);
@@ -403,6 +452,7 @@ static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
 static void parallel_heap_vacuum_compute_min_scanned_blkno(LVRelState *vacrel);
 static void parallel_heap_vacuum_gather_scan_results(LVRelState *vacrel);
 static void parallel_heap_complete_unfinished_scan(LVRelState *vacrel);
+static int	compute_heap_vacuum_parallel_workers(Relation rel, BlockNumber nblocks);
 
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
@@ -550,6 +600,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	scan_state->lpdead_item_pages = 0;
 	scan_state->missed_dead_pages = 0;
 	scan_state->nonempty_pages = 0;
+	scan_state->vacuumed_pages = 0;
 	scan_state->tuples_deleted = 0;
 	scan_state->tuples_frozen = 0;
 	scan_state->lpdead_items = 0;
@@ -2423,46 +2474,14 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	return allindexes;
 }
 
-/*
- *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
- *
- * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
- * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
- *
- * We may also be able to truncate the line pointer array of the heap pages we
- * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
- * array, it can be reclaimed as free space.  These LP_UNUSED items usually
- * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
- * each page to LP_UNUSED, and then consider if it's possible to truncate the
- * page's line pointer array).
- *
- * Note: the reason for doing this as a second pass is we cannot remove the
- * tuples until we've removed their index entries, and we want to process
- * index entry removal in batches as large as possible.
- */
 static void
-lazy_vacuum_heap_rel(LVRelState *vacrel)
+do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter)
 {
-	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
-	LVSavedErrInfo saved_err_info;
-	TidStoreIter *iter;
-	TidStoreIterResult *iter_result;
-
-	Assert(vacrel->do_index_vacuuming);
-	Assert(vacrel->do_index_cleanup);
-	Assert(vacrel->num_index_scans > 0);
-
-	/* Report that we are now vacuuming the heap */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
 
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrel, &saved_err_info,
-							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 InvalidBlockNumber, InvalidOffsetNumber);
+	/* LVSavedErrInfo saved_err_info; */
+	TidStoreIterResult *iter_result;
 
-	iter = TidStoreBeginIterate(vacrel->dead_items);
 	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
@@ -2500,26 +2519,106 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		UnlockReleaseBuffer(buf);
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-		vacuumed_pages++;
+		vacrel->scan_state->vacuumed_pages++;
 	}
-	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
+}
+
+/*
+ *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
+ *
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
+ *
+ * We may also be able to truncate the line pointer array of the heap pages we
+ * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
+ * array, it can be reclaimed as free space.  These LP_UNUSED items usually
+ * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
+ * each page to LP_UNUSED, and then consider if it's possible to truncate the
+ * page's line pointer array).
+ *
+ * Note: the reason for doing this as a second pass is we cannot remove the
+ * tuples until we've removed their index entries, and we want to process
+ * index entry removal in batches as large as possible.
+ */
+static void
+lazy_vacuum_heap_rel(LVRelState *vacrel)
+{
+	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	int			nworkers = 0;
+
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
+	Assert(vacrel->num_index_scans > 0);
+
+	/* Report that we are now vacuuming the heap */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+							 InvalidBlockNumber, InvalidOffsetNumber);
+
+	vacrel->scan_state->vacuumed_pages = 0;
+
+	/* Compute parallel workers required to scan blocks to vacuum */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		nworkers = compute_heap_vacuum_parallel_workers(vacrel->rel,
+														TidStoreNumBlocks(vacrel->dead_items));
+
+	if (nworkers > 0)
+	{
+		PHVState   *phvstate = vacrel->phvstate;
+
+		iter = TidStoreBeginIterateShared(vacrel->dead_items);
+
+		/* launch workers */
+		phvstate->shared->do_heap_vacuum = true;
+		phvstate->shared->shared_iter_handle = TidStoreGetSharedIterHandle(iter);
+		phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs,
+																	   nworkers);
+	}
+	else
+		iter = TidStoreBeginIterate(vacrel->dead_items);
+
+	/* do the real work */
+	do_lazy_vacuum_heap_rel(vacrel, iter);
+
+	if (ParallelHeapVacuumIsActive(vacrel) && nworkers > 0)
+	{
+		PHVState   *phvstate = vacrel->phvstate;
+
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+
+		/* Gather the heap vacuum statistics that workers collected */
+		for (int i = 0; i < phvstate->nworkers_launched; i++)
+		{
+			LVRelScanState *ss = &(phvstate->shared->worker_scan_state[i]);
+
+			vacrel->scan_state->vacuumed_pages += ss->vacuumed_pages;
+		}
+	}
+
+	TidStoreEndIterate(iter);
+
 	/*
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
 		   (vacrel->dead_items_info->num_items == vacrel->scan_state->lpdead_items &&
-			vacuumed_pages == vacrel->scan_state->lpdead_item_pages));
+			vacrel->scan_state->vacuumed_pages == vacrel->scan_state->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
 					vacrel->relname, (long long) vacrel->dead_items_info->num_items,
-					vacuumed_pages)));
+					vacrel->scan_state->vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3228,6 +3327,11 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		{
 			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
 																&vacrel->dead_items_info);
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+				vacrel->phvstate->num_heapscan_workers =
+					parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
 			return;
 		}
 	}
@@ -3475,37 +3579,41 @@ update_relstats_all_indexes(LVRelState *vacrel)
  *
  * The calculation logic is borrowed from compute_parallel_worker().
  */
-int
-heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+static int
+compute_heap_vacuum_parallel_workers(Relation rel, BlockNumber nblocks)
 {
 	int			parallel_workers = 0;
 	int			heap_parallel_threshold;
 	int			heap_pages;
 
-	if (nrequested == 0)
+	/*
+	 * Select the number of workers based on the log of the size of the
+	 * relation.Note that the upper limit of the min_parallel_table_scan_size
+	 * GUC is chosen to prevent overflow here.
+	 */
+	heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+	heap_pages = BlockNumberIsValid(nblocks) ?
+		nblocks : RelationGetNumberOfBlocks(rel);
+	while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
 	{
-		/*
-		 * Select the number of workers based on the log of the size of the
-		 * relation. Note that the upper limit of the
-		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
-		 * here.
-		 */
-		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
-		heap_pages = RelationGetNumberOfBlocks(rel);
-		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
-		{
-			parallel_workers++;
-			heap_parallel_threshold *= 3;
-			if (heap_parallel_threshold > INT_MAX / 3)
-				break;
-		}
+		parallel_workers++;
+		heap_parallel_threshold *= 3;
+		if (heap_parallel_threshold > INT_MAX / 3)
+			break;
 	}
-	else
-		parallel_workers = nrequested;
 
 	return parallel_workers;
 }
 
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	if (nrequested == 0)
+		return compute_heap_vacuum_parallel_workers(rel, InvalidBlockNumber);
+	else
+		return nrequested;
+}
+
 /* Estimate shared memory sizes required for parallel heap vacuum */
 static inline void
 heap_parallel_estimate_shared_memory_size(Relation rel, int nworkers, Size *pscan_len,
@@ -3587,6 +3695,7 @@ heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
 	shared->NewRelfrozenXid = vacrel->scan_state->NewRelfrozenXid;
 	shared->NewRelminMxid = vacrel->scan_state->NewRelminMxid;
 	shared->skippedallvis = vacrel->scan_state->skippedallvis;
+	shared->do_index_vacuuming = vacrel->do_index_vacuuming;
 
 	/*
 	 * XXX: we copy the contents of vistest to the shared area, but in order
@@ -3639,7 +3748,6 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	PHVScanWorkerState *scanstate;
 	LVRelScanState *scan_state;
 	ErrorContextCallback errcallback;
-	bool		scan_done;
 
 	phvstate = palloc(sizeof(PHVState));
 
@@ -3676,10 +3784,11 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	/* initialize per-worker relation statistics */
 	MemSet(scan_state, 0, sizeof(LVRelScanState));
 
-	/* Set fields necessary for heap scan */
+	/* Set fields necessary for heap scan and vacuum */
 	vacrel.scan_state->NewRelfrozenXid = shared->NewRelfrozenXid;
 	vacrel.scan_state->NewRelminMxid = shared->NewRelminMxid;
 	vacrel.scan_state->skippedallvis = shared->skippedallvis;
+	vacrel.do_index_vacuuming = shared->do_index_vacuuming;
 
 	/* Initialize the per-worker scan state if not yet */
 	if (!phvstate->myscanstate->initialized)
@@ -3701,25 +3810,44 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
 	vacrel.relname = pstrdup(RelationGetRelationName(rel));
 	vacrel.indname = NULL;
-	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = &vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
-	scan_done = do_lazy_scan_heap(&vacrel);
+	if (shared->do_heap_vacuum)
+	{
+		TidStoreIter *iter;
+
+		iter = TidStoreAttachIterateShared(vacrel.dead_items, shared->shared_iter_handle);
+
+		/* Join parallel heap vacuum */
+		vacrel.phase = VACUUM_ERRCB_PHASE_VACUUM_HEAP;
+		do_lazy_vacuum_heap_rel(&vacrel, iter);
+
+		TidStoreEndIterate(iter);
+	}
+	else
+	{
+		bool		scan_done;
+
+		/* Join parallel heap scan */
+		vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+		scan_done = do_lazy_scan_heap(&vacrel);
+
+		/*
+		 * If the leader or a worker finishes the heap scan because dead_items
+		 * TIDs is close to the limit, it might have some allocated blocks in
+		 * its scan state. Since this scan state might not be used in the next
+		 * heap scan, we remember that it might have some unconsumed blocks so
+		 * that the leader complete the scans after the heap scan phase
+		 * finishes.
+		 */
+		phvstate->myscanstate->maybe_have_blocks = !scan_done;
+	}
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
-
-	/*
-	 * If the leader or a worker finishes the heap scan because dead_items
-	 * TIDs is close to the limit, it might have some allocated blocks in its
-	 * scan state. Since this scan state might not be used in the next heap
-	 * scan, we remember that it might have some unconsumed blocks so that the
-	 * leader complete the scans after the heap scan phase finishes.
-	 */
-	phvstate->myscanstate->maybe_have_blocks = !scan_done;
 }
 
 /*
@@ -3841,7 +3969,10 @@ do_parallel_lazy_scan_heap(LVRelState *vacrel)
 	Assert(!IsParallelWorker());
 
 	/* launcher workers */
-	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+	vacrel->phvstate->shared->do_heap_vacuum = false;
+	vacrel->phvstate->nworkers_launched =
+		parallel_vacuum_table_scan_begin(vacrel->pvs,
+										 vacrel->phvstate->num_heapscan_workers);
 
 	/* initialize parallel scan description to join as a worker */
 	scanstate = palloc0(sizeof(PHVScanWorkerState));
@@ -3900,7 +4031,8 @@ do_parallel_lazy_scan_heap(LVRelState *vacrel)
 
 		/* Re-launch workers to restart parallel heap scan */
 		vacrel->phvstate->nworkers_launched =
-			parallel_vacuum_table_scan_begin(vacrel->pvs);
+			parallel_vacuum_table_scan_begin(vacrel->pvs,
+											 vacrel->phvstate->num_heapscan_workers);
 	}
 
 	/*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 9f8c8f09576..2a096ed4128 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1054,8 +1054,10 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
  * table vacuum.
  */
 int
-parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs, int nworkers_request)
 {
+	int			nworkers;
+
 	Assert(!IsParallelWorker());
 
 	if (pvs->shared->nworkers_for_table == 0)
@@ -1069,11 +1071,13 @@ parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
 	if (pvs->num_table_scans > 0)
 		ReinitializeParallelDSM(pvs->pcxt);
 
+	nworkers = Min(nworkers_request, pvs->shared->nworkers_for_table);
+
 	/*
 	 * The number of workers might vary between table vacuum and index
 	 * processing
 	 */
-	ReinitializeParallelWorkers(pvs->pcxt, pvs->shared->nworkers_for_table);
+	ReinitializeParallelWorkers(pvs->pcxt, nworkers);
 	LaunchParallelWorkers(pvs->pcxt);
 
 	if (pvs->pcxt->nworkers_launched > 0)
@@ -1097,7 +1101,7 @@ parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
 			(errmsg(ngettext("launched %d parallel vacuum worker for table processing (planned: %d)",
 							 "launched %d parallel vacuum workers for table processing (planned: %d)",
 							 pvs->pcxt->nworkers_launched),
-					pvs->pcxt->nworkers_launched, pvs->shared->nworkers_for_table)));
+					pvs->pcxt->nworkers_launched, nworkers)));
 
 	return pvs->pcxt->nworkers_launched;
 }
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d45866d61e5..7bec04395e9 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -371,7 +371,7 @@ extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												bool estimated_count);
-extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs, int nworkers_request);
 extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
 extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
 extern int	parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs);
-- 
2.43.5

v7-0003-Support-parallel-heap-scan-during-lazy-vacuum.patchapplication/octet-stream; name=v7-0003-Support-parallel-heap-scan-during-lazy-vacuum.patchDownload

From d1b09e2f31cbd43cc8fe700092c7fec517bf9415 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 1 Jul 2024 15:17:46 +0900
Subject: [PATCH v7 3/8] Support parallel heap scan during lazy vacuum.

Commit 40d964ec99 allowed vacuum command to process indexes in
parallel. This change extends the parallel vacuum to support parallel
heap scan during lazy vacuum.
---
 doc/src/sgml/ref/vacuum.sgml             |  58 +-
 src/backend/access/heap/heapam_handler.c |   6 +
 src/backend/access/heap/vacuumlazy.c     | 922 ++++++++++++++++++++---
 src/backend/access/table/tableam.c       |  15 +
 src/backend/commands/vacuumparallel.c    | 305 ++++++--
 src/backend/storage/ipc/procarray.c      |  74 --
 src/include/access/heapam.h              |   8 +
 src/include/access/tableam.h             |  92 +++
 src/include/commands/vacuum.h            |   8 +-
 src/include/utils/snapmgr.h              |   2 +-
 src/include/utils/snapmgr_internal.h     |  91 +++
 src/tools/pgindent/typedefs.list         |   3 +
 12 files changed, 1331 insertions(+), 253 deletions(-)
 create mode 100644 src/include/utils/snapmgr_internal.h

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 971b1237d47..b099d3a03cf 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -278,27 +278,43 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
    <varlistentry>
     <term><literal>PARALLEL</literal></term>
     <listitem>
-     <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
-     </para>
+      <para>
+       Perform scanning heap, index vacuum, and index cleanup phases of
+       <command>VACUUM</command> in parallel using
+       <replaceable class="parameter">integer</replaceable> background workers
+       (for the details of each vacuum phase, please refer to
+       <xref linkend="vacuum-phases"/>).
+      </para>
+      <para>
+       For heap tables, the number of workers used to perform the scanning
+       heap is determined based on the size of table. A table can participate in
+       parallel scanning heap if and only if the size of the table is more than
+       <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+       the heap table's blocks will be divided into ranges and shared among the
+       cooperating processes. Each worker process will complete the scanning of
+       its given range of blocks before requesting an additional range of blocks.
+      </para>
+      <para>
+       The number of workers used to perform parallel index vacuum and index
+       cleanup is equal to the number of indexes on the relation that support
+       parallel vacuum. An index can participate in parallel vacuum if and only
+       if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+       Only one worker can be used per index. So parallel workers for index vacuum
+       and index cleanup are launched only when there are at least <literal>2</literal>
+       indexes in the table.
+      </para>
+      <para>
+       Workers for vacuum are launched before the start of each phase and exit
+       at the end of the phase. The number of workers for each phase is limited by
+       the number of workers specified with <literal>PARALLEL</literal> option if
+       any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+       Please note that in any parallel vacuum phase, it is not guaanteed that the
+       number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+       will be used during execution. It is possible for a vacuum to run with fewer
+       workers than specified, or even with no workers at all. These behaviors might
+       change in a future release. This option can't be used with the <literal>FULL</literal>
+       option.
+      </para>
     </listitem>
    </varlistentry>
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e817f8f8f84..9484a2fdb3f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2656,6 +2656,12 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_relation_worker = heap_parallel_vacuum_worker,
+
 	.scan_analyze_next_block = heapam_scan_analyze_next_block,
 	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 116c0612ca5..908b2b8f2a4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -48,6 +48,7 @@
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -115,10 +116,24 @@
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
 /*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
+ * DSM keys for heap parallel vacuum scan. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_KEY_SCAN_SHARED			0xFFFF0001
+#define LV_PARALLEL_KEY_SCAN_DESC			0xFFFF0002
+#define LV_PARALLEL_KEY_SCAN_DESC_WORKER	0xFFFF0003
+
+/*
+ * Macros to check if we are in parallel heap vacuuming, parallel index vacuuming,
+ * or both. If ParallelVacuumIsActive() is true, we are in the parallel mode, meaning
+ * that we have dead items TIDs on shared memory area.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelIndexVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_index((vacrel)->pvs) > 0)
+#define ParallelHeapVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_table((vacrel)->pvs) > 0)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -172,6 +187,87 @@ typedef struct LVRelScanState
 	bool		skippedallvis;
 } LVRelScanState;
 
+/*
+ * Struct for information that needs to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/*
+	 * Have we skipped any all-visible pages?
+	 *
+	 * The final value is OR of worker's skippedallvis.
+	 */
+	bool		skippedallvis;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState vistest;
+
+	/* per-worker scan stats for parallel heap vacuum scan */
+	LVRelScanState worker_scan_state[FLEXIBLE_ARRAY_MEMBER];
+} PHVShared;
+#define SizeOfPHVShared (offsetof(PHVShared, worker_scan_state))
+
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerState
+{
+	bool		initialized;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData state;
+
+	/*
+	 * True if a parallel vacuum scan worker allocated blocks in state but
+	 * might have not scanned all of them. The leader process will take over
+	 * for scanning these remaining blocks.
+	 */
+	bool		maybe_have_blocks;
+
+	/* last block number the worker scanned */
+	BlockNumber last_blkno;
+} PHVScanWorkerState;
+
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pscandesc;
+
+	/* Shared information */
+	PHVShared  *shared;
+
+	/*
+	 * Points to all per-worker scan state array stored on DSM area.
+	 *
+	 * During parallel heap scan, each worker allocates some chunks of blocks
+	 * to scan in its scan state, and could exit while leaving some chunks
+	 * un-scanned if the size of dead_items TIDs is close to overrunning the
+	 * the available space. We store the scan states on shared memory area so
+	 * that workers can resume heap scans from the previous point.
+	 */
+	PHVScanWorkerState *scanstates;
+
+	/* Assigned per-worker scan state */
+	PHVScanWorkerState *myscanstate;
+
+	/*
+	 * All blocks up to this value has been scanned, i.e. the minimum of all
+	 * PHVScanWorkerState->last_blkno. This field is updated by
+	 * parallel_heap_vacuum_compute_min_scanned_blkno().
+	 */
+	BlockNumber min_scanned_blkno;
+
+	/* The number of workers launched for parallel heap vacuum */
+	int			nworkers_launched;
+} PHVState;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -183,6 +279,9 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
+	/* Parallel heap vacuum state and sizes for each struct */
+	PHVState   *phvstate;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -223,6 +322,8 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
+	BlockNumber next_fsm_block_to_vacuum;	/* next block to check for FSM
+											 * vacuum */
 
 	/* Working state for heap scanning and vacuuming */
 	LVRelScanState *scan_state;
@@ -254,9 +355,11 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static bool do_lazy_scan_heap(LVRelState *vacrel);
 static bool heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 									 bool *all_visible_according_to_vm);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -296,6 +399,11 @@ static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 static void update_relstats_all_indexes(LVRelState *vacrel);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_heap_vacuum_compute_min_scanned_blkno(LVRelState *vacrel);
+static void parallel_heap_vacuum_gather_scan_results(LVRelState *vacrel);
+static void parallel_heap_complete_unfinished_scan(LVRelState *vacrel);
+
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
 									 LVSavedErrInfo *saved_vacrel,
@@ -432,6 +540,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		Assert(params->index_cleanup == VACOPTVALUE_AUTO);
 	}
 
+	vacrel->next_fsm_block_to_vacuum = 0;
+
 	/* Initialize page counters explicitly (be tidy) */
 	scan_state = palloc(sizeof(LVRelScanState));
 	scan_state->scanned_pages = 0;
@@ -452,6 +562,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->scan_state = scan_state;
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
+	/* dead_items_alloc allocates vacrel->dead_items later on */
+
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
 	vacrel->new_live_tuples = 0;
@@ -861,12 +973,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_fsm_block_to_vacuum = 0;
-	bool		all_visible_according_to_vm;
-
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -886,6 +993,77 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_allvis = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/*
+	 * Do the actual work. If parallel heap vacuum is active, we scan and
+	 * vacuum heap using parallel workers.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+	{
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/* We must have scanned all heap pages */
+		Assert(scan_done);
+	}
+
+	/* report that everything is now scanned */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_state->scanned_pages,
+													 vacrel->scan_state->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_state->recently_dead_tuples +
+		vacrel->scan_state->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+								rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * Return true if we processed all blocks, otherwise false if we exit from this function
+ * while not completing the heap scan due to full of dead item TIDs. In serial heap scan
+ * case, this function always returns true. In parallel heap vacuum scan, this function
+ * is called by both worker processes and the leader process, and could return false.
+ */
+static bool
+do_lazy_scan_heap(LVRelState *vacrel)
+{
+	bool		all_visible_according_to_vm;
+	BlockNumber blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		scan_done = true;
+
 	while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
 	{
 		Buffer		buf;
@@ -911,46 +1089,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scan_state->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (!IsParallelWorker() &&
+			vacrel->scan_state->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
-		/*
-		 * Consider if we definitely have enough space to process TIDs on page
-		 * already.  If we are close to overrunning the available space for
-		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page.
-		 */
-		if (TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
-		{
-			/*
-			 * Before beginning index vacuuming, we release any pin we may
-			 * hold on the visibility map page.  This isn't necessary for
-			 * correctness, but we do it anyway to avoid holding the pin
-			 * across a lengthy, unrelated operation.
-			 */
-			if (BufferIsValid(vmbuffer))
-			{
-				ReleaseBuffer(vmbuffer);
-				vmbuffer = InvalidBuffer;
-			}
-
-			/* Perform a round of index and heap vacuuming */
-			vacrel->consider_bypass_optimization = false;
-			lazy_vacuum(vacrel);
-
-			/*
-			 * Vacuum the Free Space Map to make newly-freed space visible on
-			 * upper-level FSM pages.  Note we have not yet processed blkno.
-			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-									blkno);
-			next_fsm_block_to_vacuum = blkno;
-
-			/* Report that we are once again scanning the heap */
-			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
-		}
-
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
@@ -1039,9 +1181,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1055,57 +1198,97 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				BlockNumber fsm_vac_up_to;
+
+				/*
+				 * If parallel heap vacuum scan is active, compute the minimum
+				 * block number we scanned so far.
+				 */
+				if (ParallelHeapVacuumIsActive(vacrel))
+				{
+					parallel_heap_vacuum_compute_min_scanned_blkno(vacrel);
+					fsm_vac_up_to = vacrel->phvstate->min_scanned_blkno;
+				}
+				else
+				{
+					/* blkno is already processed */
+					fsm_vac_up_to = blkno + 1;
+				}
+
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+										fsm_vac_up_to);
+				vacrel->next_fsm_block_to_vacuum = fsm_vac_up_to;
 			}
 		}
 		else
 			UnlockReleaseBuffer(buf);
-	}
 
-	vacrel->blkno = InvalidBlockNumber;
-	if (BufferIsValid(vmbuffer))
-		ReleaseBuffer(vmbuffer);
+		/*
+		 * Consider if we definitely have enough space to process TIDs on page
+		 * already.  If we are close to overrunning the available space for
+		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
+		 * this page.
+		 */
+		if (TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
+		{
+			/*
+			 * Before beginning index vacuuming, we release any pin we may
+			 * hold on the visibility map page.  This isn't necessary for
+			 * correctness, but we do it anyway to avoid holding the pin
+			 * across a lengthy, unrelated operation.
+			 */
+			if (BufferIsValid(vmbuffer))
+			{
+				ReleaseBuffer(vmbuffer);
+				vmbuffer = InvalidBuffer;
+			}
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+			/*
+			 * In parallel heap scan, we pause the heap scan without invoking
+			 * index and heap vacuuming, and return to the caller with
+			 * scan_done being false. The parallel vacuum workers will exit as
+			 * theirs jobs are done. The leader process will wait for all
+			 * workers to finish and perform index and heap vacuuming, and
+			 * then performs FSM vacuum too.
+			 */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/* Remember the last scanned block */
+				vacrel->phvstate->myscanstate->last_blkno = blkno;
 
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_state->scanned_pages,
-													 vacrel->scan_state->live_tuples);
+				/* Remember we might have some unprocessed blocks */
+				scan_done = false;
 
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_state->recently_dead_tuples +
-		vacrel->scan_state->missed_dead_tuples;
+				break;
+			}
 
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
 
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+			/*
+			 * Vacuum the Free Space Map to make newly-freed space visible on
+			 * upper-level FSM pages.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
+			continue;
+		}
+	}
+
+	vacrel->blkno = InvalidBlockNumber;
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
+
+	return scan_done;
 }
 
 /*
@@ -1131,12 +1314,29 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 						 bool *all_visible_according_to_vm)
 {
 	BlockNumber next_block;
+	PHVState   *phvstate = vacrel->phvstate;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+retry:
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		/*
+		 * Get the next block to scan using parallel scan.
+		 *
+		 * If we reach the end of the relation,
+		 * table_block_parallelscan_nextpage return InvalidBlockNumber.
+		 */
+		next_block = table_block_parallelscan_nextpage(vacrel->rel,
+													   &(phvstate->myscanstate->state),
+													   phvstate->pscandesc);
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
+	}
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->rel_pages)
+	if (!BlockNumberIsValid(next_block) || next_block >= vacrel->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1159,8 +1359,19 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		 * visibility map.
 		 */
 		bool		skipsallvis;
+		BlockNumber end_block = InvalidBlockNumber;
+		BlockNumber nblocks_skip;
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		/*
+		 * In parallel heap scan, compute how many blocks are remaining in the
+		 * current chunk. We look for the next unskippable block within the
+		 * chunk.
+		 */
+		if (ParallelHeapVacuumIsActive(vacrel))
+			end_block = next_block + phvstate->myscanstate->state.phsw_chunk_remaining;
+
+		find_next_unskippable_block(vacrel, &skipsallvis, next_block,
+									end_block);
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1177,9 +1388,30 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/* Tell the parallel scans to skip blocks */
+				table_block_parallelscan_skip_pages_in_chunk(vacrel->rel,
+															 &(phvstate->myscanstate->state),
+															 phvstate->pscandesc,
+															 nblocks_skip);
+
+				/*
+				 * If we skip all the remaining blocks in the chunk  we reset
+				 * the next_unskippable_block so that we can find the next
+				 * unskippable block in the next chunk.
+				 */
+				if (phvstate->myscanstate->state.phsw_chunk_remaining == 0)
+					vacrel->next_unskippable_block = InvalidBlockNumber;
+
+				goto retry;
+			}
+			else
+				next_block = vacrel->next_unskippable_block;
+
 			if (skipsallvis)
 				vacrel->scan_state->skippedallvis = true;
 		}
@@ -1212,7 +1444,10 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 }
 
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in the range of [start_blk, end_blk]. If the end_blk is invalid, we search
+ * until (rel_pages - 1).
+ *
  * The next unskippable block and its visibility information is updated in
  * vacrel.
  *
@@ -1225,13 +1460,16 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
  * to skip such a range is actually made, making everything safe.)
  */
 static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start_blk, BlockNumber end_blk)
 {
 	BlockNumber rel_pages = vacrel->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block = start_blk;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_allvis;
 
+	Assert(BlockNumberIsValid(next_unskippable_block));
+
 	*skipsallvis = false;
 
 	for (;;)
@@ -1254,11 +1492,12 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 
 		/*
 		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
+		 * (caller must have the opportunity to set
+		 * vacrel->scan_state->nonempty_pages). This rule avoids having
+		 * lazy_truncate_heap() take access-exclusive lock on rel to attempt a
+		 * truncation that fails anyway, just because there are tuples on the
+		 * last page (it is likely that there will be tuples on other nearby
+		 * pages as well, but those can be skipped).
 		 *
 		 * Implement this by always treating the last block as unsafe to skip.
 		 */
@@ -1287,6 +1526,10 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 		}
 
 		next_unskippable_block++;
+
+		/* Stop if we reached the end of block range */
+		if (BlockNumberIsValid(end_blk) && next_unskippable_block > end_blk)
+			break;
 	}
 
 	/* write the local variables back to vacrel */
@@ -2117,7 +2360,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2493,7 +2736,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2943,12 +3186,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -2966,11 +3205,20 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize parallel heap scan/vacuuming or index vacuuming
+			 * or both based on the table size and the number of indexes.
+			 * Since only one worker can be used for an index, we will invoke
+			 * parallelism for index vacuuming only if there are at least two
+			 * indexes on a table.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3010,9 +3258,19 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	/*
+	 * Protect both dead_items and dead_items_info from concurrent updates in
+	 * parallel heap scan cases.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
@@ -3212,6 +3470,448 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of parallel workers for parallel vacuum heap scan.
+ *
+ * The calculation logic is borrowed from compute_parallel_worker().
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	int			parallel_workers = 0;
+	int			heap_parallel_threshold;
+	int			heap_pages;
+
+	if (nrequested == 0)
+	{
+		/*
+		 * Select the number of workers based on the log of the size of the
+		 * relation. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		heap_pages = RelationGetNumberOfBlocks(rel);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nrequested;
+
+	return parallel_workers;
+}
+
+/* Estimate shared memory sizes required for parallel heap vacuum */
+static inline void
+heap_parallel_estimate_shared_memory_size(Relation rel, int nworkers, Size *pscan_len,
+										  Size *shared_len, Size *pscanwork_len)
+{
+	Size		size = 0;
+
+	size = add_size(size, SizeOfPHVShared);
+	size = add_size(size, mul_size(sizeof(LVRelScanState), nworkers));
+	*shared_len = size;
+
+	*pscan_len = table_block_parallelscan_estimate(rel);
+
+	*pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+}
+
+/*
+ * Compute the amount of space we'll need in the parallel heap vacuum
+ * DSM, and inform pcxt->estimator about our needs.
+ *
+ * nworkers is the number of workers for the table vacuum. Note that it could
+ * be different than pcxt->nworkers since it is the maximum of number of
+ * workers for table vacuum and index vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+							  int nworkers, void *state)
+{
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+
+	/* space for PHVShared */
+	shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for ParallelBlockTableScanDesc */
+	shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for per-worker scan state, PHVScanWorkerState */
+	shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+								int nworkers, void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	PHVState   *phvstate = vacrel->phvstate;
+	ParallelBlockTableScanDesc pscan;
+	PHVScanWorkerState *pscanwork;
+	PHVShared  *shared;
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	phvstate = (PHVState *) palloc0(sizeof(PHVState));
+	phvstate->min_scanned_blkno = InvalidBlockNumber;
+
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+
+	shared = shm_toc_allocate(pcxt->toc, shared_len);
+
+	/* Prepare the shared information */
+
+	MemSet(shared, 0, shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_state->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_state->NewRelminMxid;
+	shared->skippedallvis = vacrel->scan_state->skippedallvis;
+
+	/*
+	 * XXX: we copy the contents of vistest to the shared area, but in order
+	 * to do that, we need to either expose GlobalVisTest or to provide
+	 * functions to copy contents of GlobalVisTest to somewhere. Currently we
+	 * do the former but not sure it's the best choice.
+	 *
+	 * Alternative idea is to have each worker determine cutoff and have their
+	 * own vistest. But we need to carefully consider it since parallel
+	 * workers end up having different cutoff and horizon.
+	 */
+	shared->vistest = *vacrel->vistest;
+
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_SHARED, shared);
+
+	phvstate->shared = shared;
+
+	/* prepare the  parallel block table scan description */
+	pscan = shm_toc_allocate(pcxt->toc, pscan_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_DESC, pscan);
+
+	/* initialize parallel scan description */
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pscan);
+
+	/* Disable sync scan to always start from the first block */
+	pscan->base.phs_syncscan = false;
+
+	phvstate->pscandesc = pscan;
+
+	/* prepare the workers' parallel block table scan state */
+	pscanwork = shm_toc_allocate(pcxt->toc, pscanwork_len);
+	MemSet(pscanwork, 0, pscanwork_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_DESC_WORKER, pscanwork);
+	phvstate->scanstates = pscanwork;
+
+	vacrel->phvstate = phvstate;
+}
+
+/*
+ * Main function for parallel heap vacuum workers.
+ */
+void
+heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
+							ParallelWorkerContext *pwcxt)
+{
+	LVRelState	vacrel = {0};
+	PHVState   *phvstate;
+	PHVShared  *shared;
+	ParallelBlockTableScanDesc pscandesc;
+	PHVScanWorkerState *scanstate;
+	LVRelScanState *scan_state;
+	ErrorContextCallback errcallback;
+	bool		scan_done;
+
+	phvstate = palloc(sizeof(PHVState));
+
+	pscandesc = (ParallelBlockTableScanDesc) shm_toc_lookup(pwcxt->toc,
+															LV_PARALLEL_KEY_SCAN_DESC,
+															false);
+	phvstate->pscandesc = pscandesc;
+
+	shared = (PHVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_SCAN_SHARED,
+										  false);
+	phvstate->shared = shared;
+
+	scanstate = (PHVScanWorkerState *) shm_toc_lookup(pwcxt->toc,
+													  LV_PARALLEL_KEY_SCAN_DESC_WORKER,
+													  false);
+
+	phvstate->myscanstate = &(scanstate[ParallelWorkerNumber]);
+	scan_state = &(shared->worker_scan_state[ParallelWorkerNumber]);
+
+	/* Prepare LVRelState */
+	vacrel.rel = rel;
+	vacrel.indrels = parallel_vacuum_get_table_indexes(pvs, &vacrel.nindexes);
+	vacrel.pvs = pvs;
+	vacrel.phvstate = phvstate;
+	vacrel.aggressive = shared->aggressive;
+	vacrel.skipwithvm = shared->skipwithvm;
+	vacrel.cutoffs = shared->cutoffs;
+	vacrel.vistest = &(shared->vistest);
+	vacrel.dead_items = parallel_vacuum_get_dead_items(pvs,
+													   &vacrel.dead_items_info);
+	vacrel.rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel.scan_state = scan_state;
+
+	/* initialize per-worker relation statistics */
+	MemSet(scan_state, 0, sizeof(LVRelScanState));
+
+	/* Set fields necessary for heap scan */
+	vacrel.scan_state->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel.scan_state->NewRelminMxid = shared->NewRelminMxid;
+	vacrel.scan_state->skippedallvis = shared->skippedallvis;
+
+	/* Initialize the per-worker scan state if not yet */
+	if (!phvstate->myscanstate->initialized)
+	{
+		table_block_parallelscan_startblock_init(rel,
+												 &(phvstate->myscanstate->state),
+												 phvstate->pscandesc);
+
+		phvstate->myscanstate->last_blkno = InvalidBlockNumber;
+		phvstate->myscanstate->maybe_have_blocks = false;
+		phvstate->myscanstate->initialized = true;
+	}
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel.dbname = get_database_name(MyDatabaseId);
+	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel.relname = pstrdup(RelationGetRelationName(rel));
+	vacrel.indname = NULL;
+	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	scan_done = do_lazy_scan_heap(&vacrel);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	/*
+	 * If the leader or a worker finishes the heap scan because dead_items
+	 * TIDs is close to the limit, it might have some allocated blocks in its
+	 * scan state. Since this scan state might not be used in the next heap
+	 * scan, we remember that it might have some unconsumed blocks so that the
+	 * leader complete the scans after the heap scan phase finishes.
+	 */
+	phvstate->myscanstate->maybe_have_blocks = !scan_done;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+parallel_heap_complete_unfinished_scan(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		PHVScanWorkerState *wstate = &(vacrel->phvstate->scanstates[i]);
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		if (!wstate->maybe_have_blocks)
+			continue;
+
+		/* Attach the worker's scan state and do heap scan */
+		vacrel->phvstate->myscanstate = wstate;
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		Assert(scan_done);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Compute the minimum block number we have scanned so far and update
+ * vacrel->min_scanned_blkno.
+ */
+static void
+parallel_heap_vacuum_compute_min_scanned_blkno(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * We check all worker scan states here to compute the minimum block
+	 * number among all scan states.
+	 */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		PHVScanWorkerState *wstate = &(phvstate->scanstates[i]);
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!wstate->initialized)
+			continue;
+
+		if (!BlockNumberIsValid(phvstate->min_scanned_blkno) ||
+			wstate->last_blkno < phvstate->min_scanned_blkno)
+			phvstate->min_scanned_blkno = wstate->last_blkno;
+	}
+}
+
+/* Accumulate each worker's scan results into the leader's */
+static void
+parallel_heap_vacuum_gather_scan_results(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		LVRelScanState *ss = &(phvstate->shared->worker_scan_state[i]);
+
+		vacrel->scan_state->scanned_pages += ss->scanned_pages;
+		vacrel->scan_state->removed_pages += ss->removed_pages;
+		vacrel->scan_state->vm_new_frozen_pages += ss->vm_new_frozen_pages;
+		vacrel->scan_state->lpdead_item_pages += ss->lpdead_item_pages;
+		vacrel->scan_state->missed_dead_pages += ss->missed_dead_pages;
+		vacrel->scan_state->tuples_deleted += ss->tuples_deleted;
+		vacrel->scan_state->tuples_frozen += ss->tuples_frozen;
+		vacrel->scan_state->lpdead_items += ss->lpdead_items;
+		vacrel->scan_state->live_tuples += ss->live_tuples;
+		vacrel->scan_state->recently_dead_tuples += ss->recently_dead_tuples;
+		vacrel->scan_state->missed_dead_tuples += ss->missed_dead_tuples;
+
+		if (ss->nonempty_pages < vacrel->scan_state->nonempty_pages)
+			vacrel->scan_state->nonempty_pages = ss->nonempty_pages;
+
+		if (TransactionIdPrecedes(ss->NewRelfrozenXid, vacrel->scan_state->NewRelfrozenXid))
+			vacrel->scan_state->NewRelfrozenXid = ss->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(ss->NewRelminMxid, vacrel->scan_state->NewRelminMxid))
+			vacrel->scan_state->NewRelminMxid = ss->NewRelminMxid;
+
+		if (!vacrel->scan_state->skippedallvis && ss->skippedallvis)
+			vacrel->scan_state->skippedallvis = true;
+	}
+
+	/* Also, compute the minimum block number we scanned so far */
+	parallel_heap_vacuum_compute_min_scanned_blkno(vacrel);
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches parallel
+ * workers to scan the heap in parallel.
+ */
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	PHVScanWorkerState *scanstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+	/* initialize parallel scan description to join as a worker */
+	scanstate = palloc0(sizeof(PHVScanWorkerState));
+	scanstate->last_blkno = InvalidBlockNumber;
+	table_block_parallelscan_startblock_init(vacrel->rel, &(scanstate->state),
+											 vacrel->phvstate->pscandesc);
+	vacrel->phvstate->myscanstate = scanstate;
+
+	for (;;)
+	{
+		bool		scan_done;
+
+		/*
+		 * Scan the table until either we are close to overrunning the
+		 * available space for dead_items TIDs or we reach the end of the
+		 * table.
+		 */
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/* wait for parallel workers to finish and gather scan results */
+		parallel_vacuum_table_scan_end(vacrel->pvs);
+		parallel_heap_vacuum_gather_scan_results(vacrel);
+
+		/* We reach the end of the table */
+		if (scan_done)
+			break;
+
+		/*
+		 * The parallel heap scan paused in the middle of the table due to
+		 * full of dead_items TIDs. We perform a round of index and heap
+		 * vacuuming and FSM vacuum.
+		 */
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (vacrel->phvstate->min_scanned_blkno > vacrel->next_fsm_block_to_vacuum)
+		{
+			/*
+			 * min_scanned_blkno was updated when gathering the workers' scan
+			 * results.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									vacrel->phvstate->min_scanned_blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = vacrel->phvstate->min_scanned_blkno;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+		/* Re-launch workers to restart parallel heap scan */
+		vacrel->phvstate->nworkers_launched =
+			parallel_vacuum_table_scan_begin(vacrel->pvs);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process could launch fewer workers in the next cycle.
+	 */
+	parallel_heap_complete_unfinished_scan(vacrel);
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e18a8f8250f..f74809f394b 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -599,6 +599,21 @@ table_block_parallelscan_nextpage(Relation rel,
 	return page;
 }
 
+/*
+ * skip some blocks to scan.
+ *
+ * Consume the given number of blocks in the current chunk. It doesn't skip blocks
+ * beyond the current chunk.
+ */
+void
+table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+											 ParallelBlockTableScanWorker pbscanwork,
+											 ParallelBlockTableScanDesc pbscan,
+											 BlockNumber nblocks_skip)
+{
+	pbscanwork->phsw_chunk_remaining -= Min(nblocks_skip, pbscanwork->phsw_chunk_remaining);
+}
+
 /* ----------------------------------------------------------------------------
  * Helper functions to implement relation sizing for block oriented AMs.
  * ----------------------------------------------------------------------------
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 08011fde23f..9f8c8f09576 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -6,15 +6,24 @@
  * This file contains routines that are intended to support setting up, using,
  * and tearing down a ParallelVacuumState.
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan or both index bulk deletion and
+ * index cleanup or all of them with parallel worker processes. Different
+ * numbers of workers are launched for the table vacuuming and index processing.
+ * ParallelVacuumState contains shared information as well as the memory space
+ * for storing dead items allocated in the DSA area.
+ *
+ * When initializing parallel table vacuum scan, we invoke table AM routines for
+ * estimating DSM sizes and initializing DSM memory. Parallel table vacuum
+ * workers invoke the table AM routine for vacuuming the table.
+ *
+ * For processing indexes in parallel, individual indexes are processed by one
+ * vacuum process. We launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the parallel
+ * worker processes exit.
+ *
+ * Each time we process table or indexes in parallel, the parallel context is
+ * re-initialized so that the same DSM can be used for multiple passes of table vacuum
+ * or index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -28,6 +37,7 @@
 
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * True if the caller wants parallel workers to invoke vacuum table scan
+	 * callback.
+	 */
+	bool		do_vacuum_table_scan;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -101,6 +117,13 @@ typedef struct PVShared
 	 */
 	pg_atomic_uint32 cost_balance;
 
+	/*
+	 * The number of workers for parallel table scan/vacuuming and index
+	 * vacuuming, respectively.
+	 */
+	int			nworkers_for_table;
+	int			nworkers_for_index;
+
 	/*
 	 * Number of active parallel workers.  This is used for computing the
 	 * minimum threshold of the vacuum cost balance before a worker sleeps for
@@ -164,6 +187,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Passed to parallel table scan workers. NULL for leader process */
+	ParallelWorkerContext *pwcxt;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -193,6 +219,9 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/* How many times parallel table vacuum scan is called? */
+	int			num_table_scans;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
@@ -224,8 +253,9 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static void parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											int *nworkers_for_index, bool *will_parallel_vacuum);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_unsafe_indexes(ParallelVacuumState *pvs);
@@ -244,7 +274,7 @@ static void parallel_vacuum_error_callback(void *arg);
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -258,6 +288,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
+	int			nworkers_for_index;
 	int			querylen;
 
 	/*
@@ -265,15 +297,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	 * relation
 	 */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
-													   nrequested_workers,
-													   will_parallel_vacuum);
+	parallel_vacuum_compute_workers(rel, indrels, nindexes, nrequested_workers,
+									&nworkers_for_table, &nworkers_for_index,
+									will_parallel_vacuum);
+
+	parallel_workers = Max(nworkers_for_table, nworkers_for_index);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
@@ -329,6 +363,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -373,6 +411,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shared->relid = RelationGetRelid(rel);
 	shared->elevel = elevel;
 	shared->queryid = pgstat_get_my_query_id();
+	shared->nworkers_for_table = nworkers_for_table;
+	shared->nworkers_for_index = nworkers_for_index;
 	shared->maintenance_work_mem_worker =
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
@@ -421,6 +461,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Prepare AM-specific DSM for parallel table vacuum */
+	if (nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -534,33 +578,48 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.
+ *
+ * For parallel table vacuum, it asks AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to *nworkers_table.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel index vacuum, The index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. nrequested is the number of
+ * parallel workers that user requested.  If nrequested is 0, we compute the
+ * parallel degree based on nindexes, that is the number of indexes that
+ * support parallel vacuum.  This function also sets will_parallel_vacuum to
+ * remember indexes that participate in parallel vacuum.
  */
-static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+static void
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_for_table,
+								int *nworkers_for_index, bool *will_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			parallel_workers_table = 0;
+	int			parallel_workers_index = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
 	 * when parallelism is disabled.
 	 */
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
-		return 0;
+	{
+		*nworkers_for_table = 0;
+		*nworkers_for_index = 0;
+		return;
+	}
+
+	/*
+	 * Compute the number of workers for parallel table scan. Cap by
+	 * max_parallel_maintenance_workers.
+	 */
+	parallel_workers_table = Min(table_parallel_vacuum_compute_workers(rel, nrequested),
+								 max_parallel_maintenance_workers);
 
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
@@ -591,17 +650,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	nindexes_parallel--;
 
 	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Compute the parallel degree for parallel index vacuum */
+		parallel_workers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		parallel_workers_index = Min(parallel_workers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_for_table = parallel_workers_table;
+	*nworkers_for_index = parallel_workers_index;
 }
 
 /*
@@ -669,8 +729,12 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
-		/* Reinitialize parallel context to relaunch parallel workers */
-		if (pvs->num_index_scans > 0)
+		/*
+		 * Reinitialize parallel context to relaunch parallel workers if we
+		 * have used the parallel context for either index vacuuming or table
+		 * vacuuming.
+		 */
+		if (pvs->num_index_scans > 0 || pvs->num_table_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -982,6 +1046,146 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Prepare DSM and shared vacuum delays, and launch parallel workers for parallel
+ * table vacuum. Return the number of parallel workers launched.
+ *
+ * The caller must call parallel_vacuum_table_scan_end() to finish the parallel
+ * table vacuum.
+ */
+int
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->shared->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->do_vacuum_table_scan = true;
+
+	if (pvs->num_table_scans > 0)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->shared->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	ereport(pvs->shared->elevel,
+			(errmsg(ngettext("launched %d parallel vacuum worker for table processing (planned: %d)",
+							 "launched %d parallel vacuum workers for table processing (planned: %d)",
+							 pvs->pcxt->nworkers_launched),
+					pvs->pcxt->nworkers_launched, pvs->shared->nworkers_for_table)));
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel table vacuum scan, and gather statistics.
+ */
+void
+parallel_vacuum_table_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->shared->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	pvs->shared->do_vacuum_table_scan = false;
+	pvs->num_table_scans++;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/*
+ * Return the number of workers for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->shared->nworkers_for_table;
+}
+
+/*
+ * Return the number of workers for parallel index processing.
+ */
+int
+parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs)
+{
+	return pvs->shared->nworkers_for_index;
+}
+
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->do_vacuum_table_scan);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do table vacuum scan */
+	table_parallel_vacuum_relation_worker(pvs->heaprel, pvs, pvs->pwcxt);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -1033,7 +1237,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	if (shared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = shared->maintenance_work_mem_worker;
@@ -1064,6 +1267,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
 	pvs.heaprel = rel;
 
+	pvs.pwcxt = palloc(sizeof(ParallelWorkerContext));
+	pvs.pwcxt->toc = toc;
+	pvs.pwcxt->seg = seg;
+
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
 	pvs.status = PARALLEL_INDVAC_STATUS_INITIAL;
@@ -1081,8 +1288,16 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	if (pvs.shared->do_vacuum_table_scan)
+	{
+		/* Process table to perform vacuum */
+		parallel_vacuum_process_table(&pvs);
+	}
+	else
+	{
+		/* Process indexes to perform vacuum/cleanup */
+		parallel_vacuum_process_safe_indexes(&pvs);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 2e54c11f880..4813a07860d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,80 +99,6 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
-/*
- * State for the GlobalVisTest* family of functions. Those functions can
- * e.g. be used to decide if a deleted row can be removed without violating
- * MVCC semantics: If the deleted row's xmax is not considered to be running
- * by anyone, the row can be removed.
- *
- * To avoid slowing down GetSnapshotData(), we don't calculate a precise
- * cutoff XID while building a snapshot (looking at the frequently changing
- * xmins scales badly). Instead we compute two boundaries while building the
- * snapshot:
- *
- * 1) definitely_needed, indicating that rows deleted by XIDs >=
- *    definitely_needed are definitely still visible.
- *
- * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
- *    definitely be removed
- *
- * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
- * && XID < definitely_needed), the boundaries can be recomputed (using
- * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
- * maintaining an accurate value all the time.
- *
- * As it is not cheap to compute accurate boundaries, we limit the number of
- * times that happens in short succession. See GlobalVisTestShouldUpdate().
- *
- *
- * There are three backend lifetime instances of this struct, optimized for
- * different types of relations. As e.g. a normal user defined table in one
- * database is inaccessible to backends connected to another database, a test
- * specific to a relation can be more aggressive than a test for a shared
- * relation.  Currently we track four different states:
- *
- * 1) GlobalVisSharedRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in any database, nor a
- *    replication slot's xmin, nor a replication slot's catalog_xmin might
- *    still consider XID as running.
- *
- * 2) GlobalVisCatalogRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin, nor a replication slot's
- *    catalog_xmin might still consider XID as running.
- *
- *    I.e. the difference to GlobalVisSharedRels is that
- *    snapshot in other databases are ignored.
- *
- * 3) GlobalVisDataRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin consider XID as running.
- *
- *    I.e. the difference to GlobalVisCatalogRels is that
- *    replication slot's catalog_xmin is not taken into account.
- *
- * 4) GlobalVisTempRels, which only considers the current session, as temp
- *    tables are not visible to other sessions.
- *
- * GlobalVisTestFor(relation) returns the appropriate state
- * for the relation.
- *
- * The boundaries are FullTransactionIds instead of TransactionIds to avoid
- * wraparound dangers. There e.g. would otherwise exist no procarray state to
- * prevent maybe_needed to become old enough after the GetSnapshotData()
- * call.
- *
- * The typedef is in the header.
- */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7d06dad83fc..94438eff25c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -21,6 +21,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "commands/vacuum.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -401,6 +402,13 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int requested);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+										  int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
+										ParallelWorkerContext *pwcxt);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 09b9b394e0e..4675d95c8ba 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -654,6 +655,47 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * function must return 0 to disable parallel table vacuum.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel, int requested);
+
+	/*
+	 * Estimate the size of shared memory that the parallel table vacuum needs
+	 * for AM
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * This callback is called for parallel table vacuum workers.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_relation_worker) (Relation rel,
+													ParallelVacuumState *pvs,
+													ParallelWorkerContext *pwcxt);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1715,6 +1757,52 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int requested)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							   void *state)
+{
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Start a parallel table vacuuming for this relation.
+ */
+static inline void
+table_parallel_vacuum_relation_worker(Relation rel, ParallelVacuumState *pvs,
+									  ParallelWorkerContext *pwcxt)
+{
+	rel->rd_tableam->parallel_vacuum_relation_worker(rel, pvs, pwcxt);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
@@ -2091,6 +2179,10 @@ extern void table_block_parallelscan_reinitialize(Relation rel,
 extern BlockNumber table_block_parallelscan_nextpage(Relation rel,
 													 ParallelBlockTableScanWorker pbscanwork,
 													 ParallelBlockTableScanDesc pbscan);
+extern void table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+														 ParallelBlockTableScanWorker pbscanwork,
+														 ParallelBlockTableScanDesc pbscan,
+														 BlockNumber nblocks_skip);
 extern void table_block_parallelscan_startblock_init(Relation rel,
 													 ParallelBlockTableScanWorker pbscanwork,
 													 ParallelBlockTableScanDesc pbscan);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e7b7753b691..d45866d61e5 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -360,7 +360,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -370,6 +371,11 @@ extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												bool estimated_count);
+extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..3b6fb603544 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -17,6 +17,7 @@
 #include "utils/relcache.h"
 #include "utils/resowner.h"
 #include "utils/snapshot.h"
+#include "utils/snapmgr_internal.h"
 
 
 extern PGDLLIMPORT bool FirstSnapshotSet;
@@ -95,7 +96,6 @@ extern char *ExportSnapshot(Snapshot snapshot);
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
  */
-typedef struct GlobalVisState GlobalVisState;
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
diff --git a/src/include/utils/snapmgr_internal.h b/src/include/utils/snapmgr_internal.h
new file mode 100644
index 00000000000..4363adf7f62
--- /dev/null
+++ b/src/include/utils/snapmgr_internal.h
@@ -0,0 +1,91 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapmgr_internal.h
+ *		This file contains declarations of structs for snapshot manager
+ *		for internal use.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/snapmgr_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SNAPMGR_INTERNAL_H
+#define SNAPMGR_INTERNAL_H
+
+#include "access/transam.h"
+
+/*
+ * State for the GlobalVisTest* family of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track four different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisDataRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * 4) GlobalVisTempRels, which only considers the current session, as temp
+ *    tables are not visible to other sessions.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ */
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
+#endif							/* SNAPMGR_INTERNAL_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91809dfe766..571f2db570a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1843,6 +1843,9 @@ PGresAttValue
 PGresParamDesc
 PGresult
 PGresult_data
+PHVScanWorkerState
+PHVShared
+PHVState
 PIO_STATUS_BLOCK
 PLAINTREE
 PLAssignStmt
-- 
2.43.5

v7-0002-Remember-the-number-of-times-parallel-index-vacuu.patchapplication/octet-stream; name=v7-0002-Remember-the-number-of-times-parallel-index-vacuu.patchDownload

From 55fa4b0cc6f0beef423ca6b99291cd99c581b0ec Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Dec 2024 15:54:32 -0800
Subject: [PATCH v7 2/8] Remember the number of times parallel index
 vacuuming/cleanup is executed in ParallelVacuumState.

Previously, the caller can passes an arbitrary value for
'num_index_scans' to parallel index vacuuming or cleaning up APIs, but
it didn't make sense since if the caller needs to be careful about
counting how many times it executed index vacuuming or cleaning
up. Otherwise, it fails to reinitialize parallel DSM.

This commit changes parallel vacuum APIs so that ParallelVacuumState
has the counter num_index_scans and re-initialize parallel DSM based
on that.

An upcoming patch for parallel table scan will do a similar thing.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c  |  4 +---
 src/backend/commands/vacuumparallel.c | 27 +++++++++++++++------------
 src/include/commands/vacuum.h         |  4 +---
 3 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 75cd67395f4..116c0612ca5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2143,8 +2143,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	else
 	{
 		/* Outsource everything to parallel variant */
-		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, old_live_tuples,
-											vacrel->num_index_scans);
+		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, old_live_tuples);
 
 		/*
 		 * Do a postcheck to consider applying wraparound failsafe now.  Note
@@ -2514,7 +2513,6 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	{
 		/* Outsource everything to parallel variant */
 		parallel_vacuum_cleanup_all_indexes(vacrel->pvs, reltuples,
-											vacrel->num_index_scans,
 											estimated_count);
 	}
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0d92e694d6a..08011fde23f 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -200,6 +200,9 @@ struct ParallelVacuumState
 	 */
 	bool	   *will_parallel_vacuum;
 
+	/* How many time index vacuuming or cleaning up is executed? */
+	int			num_index_scans;
+
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
 	 * parallel index cleanup respectively.
@@ -223,8 +226,7 @@ struct ParallelVacuumState
 
 static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 											bool *will_parallel_vacuum);
-static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
-												bool vacuum);
+static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_unsafe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
@@ -497,8 +499,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
  * Do parallel index bulk-deletion with parallel workers.
  */
 void
-parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tuples,
-									int num_index_scans)
+parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tuples)
 {
 	Assert(!IsParallelWorker());
 
@@ -509,7 +510,7 @@ parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 	pvs->shared->reltuples = num_table_tuples;
 	pvs->shared->estimated_count = true;
 
-	parallel_vacuum_process_all_indexes(pvs, num_index_scans, true);
+	parallel_vacuum_process_all_indexes(pvs, true);
 }
 
 /*
@@ -517,7 +518,7 @@ parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tup
  */
 void
 parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tuples,
-									int num_index_scans, bool estimated_count)
+									bool estimated_count)
 {
 	Assert(!IsParallelWorker());
 
@@ -529,7 +530,7 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 	pvs->shared->reltuples = num_table_tuples;
 	pvs->shared->estimated_count = estimated_count;
 
-	parallel_vacuum_process_all_indexes(pvs, num_index_scans, false);
+	parallel_vacuum_process_all_indexes(pvs, false);
 }
 
 /*
@@ -608,8 +609,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
  * must be used by the parallel vacuum leader process.
  */
 static void
-parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
-									bool vacuum)
+parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 {
 	int			nworkers;
 	PVIndVacStatus new_status;
@@ -631,7 +631,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		nworkers = pvs->nindexes_parallel_cleanup;
 
 		/* Add conditionally parallel-aware indexes if in the first time call */
-		if (num_index_scans == 0)
+		if (pvs->num_index_scans == 0)
 			nworkers += pvs->nindexes_parallel_condcleanup;
 	}
 
@@ -659,7 +659,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		indstats->parallel_workers_can_process =
 			(pvs->will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
-													num_index_scans,
+													pvs->num_index_scans,
 													vacuum));
 	}
 
@@ -670,7 +670,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	if (nworkers > 0)
 	{
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (pvs->num_index_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -764,6 +764,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 	}
+
+	/* Increment the counter */
+	pvs->num_index_scans++;
 }
 
 /*
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 12d0b61950d..e7b7753b691 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -366,11 +366,9 @@ extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
-												long num_table_tuples,
-												int num_index_scans);
+												long num_table_tuples);
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
-												int num_index_scans,
 												bool estimated_count);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
-- 
2.43.5

v7-0001-Move-lazy-heap-scanning-related-variables-to-stru.patchapplication/octet-stream; name=v7-0001-Move-lazy-heap-scanning-related-variables-to-stru.patchDownload

From e7e132d9b5a2ef0929e3b351ef933afacfbd8907 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 15 Nov 2024 14:14:13 -0800
Subject: [PATCH v7 1/8] Move lazy heap scanning related variables to struct
 LVRelScanState.

---
 src/backend/access/heap/vacuumlazy.c | 304 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 159 insertions(+), 146 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 09fab08b8e1..75cd67395f4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -131,6 +131,47 @@ typedef enum
 	VACUUM_ERRCB_PHASE_TRUNCATE,
 } VacErrPhase;
 
+/*
+ * Relation statistics collected during heap scanning.
+ */
+typedef struct LVRelScanState
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVRelScanState;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -157,10 +198,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -186,43 +223,18 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
 
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
+	/* Working state for heap scanning and vacuuming */
+	LVRelScanState *scan_state;
 
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
-
-	/* Statistics output by us, for table */
-	double		new_rel_tuples; /* new estimated total # of tuples */
-	double		new_live_tuples;	/* new estimated total # of live tuples */
+	/* New estimated total # of tuples and total # of live tuples */
+	double		new_rel_tuples;
+	double		new_live_tuples;
 	/* Statistics output by index AMs */
 	IndexBulkDeleteResult **indstats;
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -309,6 +321,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVRelScanState *scan_state;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -420,12 +433,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
+	scan_state = palloc(sizeof(LVRelScanState));
+	scan_state->scanned_pages = 0;
+	scan_state->removed_pages = 0;
+	scan_state->new_frozen_tuple_pages = 0;
+	scan_state->lpdead_item_pages = 0;
+	scan_state->missed_dead_pages = 0;
+	scan_state->nonempty_pages = 0;
+	scan_state->tuples_deleted = 0;
+	scan_state->tuples_frozen = 0;
+	scan_state->lpdead_items = 0;
+	scan_state->live_tuples = 0;
+	scan_state->recently_dead_tuples = 0;
+	scan_state->missed_dead_tuples = 0;
+	scan_state->vm_new_visible_pages = 0;
+	scan_state->vm_new_visible_frozen_pages = 0;
+	scan_state->vm_new_frozen_pages = 0;
+	vacrel->scan_state = scan_state;
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/* Allocate/initialize output statistics state */
@@ -434,19 +458,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
-	/* Initialize remaining counters (be tidy) */
-	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
-
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
-
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
 	 * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze.  Then determine
@@ -467,9 +478,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
+	vacrel->scan_state->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_state->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_state->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -550,15 +561,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_state->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_state->NewRelfrozenXid));
+	Assert(vacrel->scan_state->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_state->NewRelminMxid));
+	if (vacrel->scan_state->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -566,8 +577,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_state->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_state->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
@@ -588,7 +599,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_state->NewRelfrozenXid, vacrel->scan_state->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -604,8 +615,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples);
+						 vacrel->scan_state->recently_dead_tuples +
+						 vacrel->scan_state->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	if (instrument)
@@ -678,21 +689,21 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_state->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_state->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->scan_state->scanned_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->scan_state->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->scan_state->recently_dead_tuples);
+			if (vacrel->scan_state->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->scan_state->missed_dead_tuples,
+								 vacrel->scan_state->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -700,33 +711,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_state->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_state->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_state->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_state->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_state->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_state->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 (long long) vacrel->scan_state->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_state->vm_new_visible_pages,
+							 vacrel->scan_state->vm_new_visible_frozen_pages +
+							 vacrel->scan_state->vm_new_frozen_pages,
+							 vacrel->scan_state->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -746,10 +757,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_state->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->scan_state->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->scan_state->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -882,7 +893,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		bool		has_lpdead_items;
 		bool		got_cleanup_lock = false;
 
-		vacrel->scanned_pages++;
+		vacrel->scan_state->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -900,7 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_state->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1064,16 +1075,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_state->scanned_pages,
+													 vacrel->scan_state->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_state->recently_dead_tuples +
+		vacrel->scan_state->missed_dead_tuples;
 
 	/*
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
@@ -1110,10 +1121,10 @@ lazy_scan_heap(LVRelState *vacrel)
  * there are no further blocks to process.
  *
  * vacrel is an in/out parameter here.  Vacuum options and information about
- * the relation are read.  vacrel->skippedallvis is set if we skip a block
- * that's all-visible but not all-frozen, to ensure that we don't update
- * relfrozenxid in that case.  vacrel also holds information about the next
- * unskippable block, as bookkeeping for this function.
+ * the relation are read.  vacrel->scan_state->skippedallvis is set if we skip
+ * a block that's all-visible but not all-frozen, to ensure that we don't
+ * update relfrozenxid in that case.  vacrel also holds information about the
+ * next unskippable block, as bookkeeping for this function.
  */
 static bool
 heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
@@ -1170,7 +1181,7 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_state->skippedallvis = true;
 		}
 	}
 
@@ -1414,11 +1425,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_state->vm_new_visible_pages++;
+				vacrel->scan_state->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_state->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1488,10 +1499,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_state->NewRelfrozenXid,
+							   &vacrel->scan_state->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_state->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_state->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1501,7 +1513,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_state->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -1536,7 +1548,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_state->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -1551,15 +1563,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_state->tuples_deleted += presult.ndeleted;
+	vacrel->scan_state->tuples_frozen += presult.nfrozen;
+	vacrel->scan_state->lpdead_items += presult.lpdead_items;
+	vacrel->scan_state->live_tuples += presult.live_tuples;
+	vacrel->scan_state->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_state->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -1608,13 +1620,13 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_state->vm_new_visible_pages++;
 			if (presult.all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_state->vm_new_visible_frozen_pages++;
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_state->vm_new_frozen_pages++;
 	}
 
 	/*
@@ -1700,8 +1712,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_state->vm_new_visible_pages++;
+			vacrel->scan_state->vm_new_visible_frozen_pages++;
 		}
 
 		/*
@@ -1709,7 +1721,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * above, so we don't need to test the value of old_vmbits.
 		 */
 		else
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_state->vm_new_frozen_pages++;
 	}
 }
 
@@ -1748,8 +1760,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_state->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_state->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1876,8 +1888,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_state->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_state->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -1904,25 +1916,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_state->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_state->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_state->live_tuples += live_tuples;
+	vacrel->scan_state->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_state->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_state->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_state->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -1951,7 +1963,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_state->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -1985,7 +1997,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_state->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2012,7 +2024,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		bypass = (vacrel->scan_state->lpdead_item_pages < threshold &&
 				  (TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L)));
 	}
 
@@ -2150,7 +2162,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_state->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2259,8 +2271,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_state->lpdead_items &&
+			vacuumed_pages == vacrel->scan_state->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2376,14 +2388,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_state->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_state->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_state->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -2459,7 +2471,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_state->scanned_pages < vacrel->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -2640,7 +2652,7 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->scan_state->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
 		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
@@ -2666,7 +2678,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_state->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -2767,7 +2779,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_state->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
@@ -2775,7 +2787,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_state->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -2803,7 +2815,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_state->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2915,7 +2927,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_state->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index eb93debe108..91809dfe766 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1480,6 +1480,7 @@ LPVOID
 LPWSTR
 LSEG
 LUID
+LVRelScanState
 LVRelState
 LVSavedErrInfo
 LWLock
-- 
2.43.5

#34

Masahiko Sawada

sawada.mshk@gmail.com

12 months ago

In reply to: Masahiko Sawada (#33)

9 attachment(s)

Re: Parallel heap vacuum

On Sun, Jan 12, 2025 at 1:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 3, 2025 at 3:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 25, 2024 at 8:52 AM Tomas Vondra <tomas@vondra.me> wrote:

On 12/19/24 23:05, Masahiko Sawada wrote:

On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote:

On 12/13/24 00:04, Tomas Vondra wrote:

...

The main difference is here:

master / no parallel workers:

pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)

1 parallel worker:

pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)

Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
essentially just those with deleted tuples, which is ~1/20 of pages.
That's close to the 15x speedup.

This effect is clearest without indexes, but it does affect even runs
with indexes - having to scan the indexes makes it much less pronounced,
though. However, these indexes are pretty massive (about the same size
as the table) - multiple times larger than the table. Chances are it'd
be clearer on realistic data sets.

So the question is - is this correct? And if yes, why doesn't the
regular (serial) vacuum do that?

There's some more strange things, though. For example, how come the avg
read rate is 0.000 MB/s?

avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s

It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
pages are in shared buffers (thanks to the DELETE earlier in that session).

OK, after looking into this a bit more I think the reason is rather
simple - SKIP_PAGES_THRESHOLD.

With serial runs, we end up scanning all pages, because even with an
update every 5000 tuples, that's still only ~25 pages apart, well within
the 32-page window. So we end up skipping no pages, scan and vacuum all
everything.

But parallel runs have this skipping logic disabled, or rather the logic
that switches to sequential scans if the gap is less than 32 pages.

IMHO this raises two questions:

1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
sequential scans is the pages are close enough. Maybe there is a reason
for this difference? Workers can reduce the difference between random
and sequential I/0, similarly to prefetching. But that just means the
workers should use a lower threshold, e.g. as

SKIP_PAGES_THRESHOLD / nworkers

or something like that? I don't see this discussed in this thread.

Each parallel heap scan worker allocates a chunk of blocks which is
8192 blocks at maximum, so we would need to use the
SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
need to evaluate the differences anyway. WIll do the benchmark test
and share the results.

Right. I don't think this really matters for small tables, and for large
tables the chunks should be fairly large (possibly up to 8192 blocks),
in which case we could apply SKIP_PAGE_THRESHOLD just like in the serial
case. There might be differences at boundaries between chunks, but that
seems like a minor / expected detail. I haven't checked know if the code
would need to change / how much.

2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
storage. If I can get an order of magnitude improvement (or more than
that) by disabling the threshold, and just doing random I/O, maybe
there's time to adjust it a bit.

Yeah, you've started a thread for this so let's discuss it there.

OK. FWIW as suggested in the other thread, it doesn't seem to be merely
a question of VACUUM performance, as not skipping pages gives vacuum the
opportunity to do cleanup that would otherwise need to happen later.

If only for this reason, I think it would be good to keep the serial and
parallel vacuum consistent.

I've not evaluated SKIP_PAGE_THRESHOLD optimization yet but I'd like
to share the latest patch set as cfbot reports some failures. Comments
from Kuroda-san are also incorporated in this version. Also, I'd like
to share the performance test results I did with the latest patch.

I've implemented SKIP_PAGE_THRESHOLD optimization in parallel heap
scan, and attached the updated patch set. I've attached the
performance test results too to compare v6 and v7 patch sets. I can
see there are not big differences in test cases but the v7 patch has a
slightly better performance.

I've made some changes to the patch set. Firstly, I've removed adding
TidStoreNumBlocks() since I figured out that we don't necessarily need
it. Also, I've split the parallel lazy heap scan patches further to
make it easier to review. Feedback is very welcome.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v8-0005-vacuumparallel.c-Support-parallel-table-vacuuming.patchapplication/octet-stream; name=v8-0005-vacuumparallel.c-Support-parallel-table-vacuuming.patchDownload

From b6a8717ad1d89ee0411da1a4386bac1bfb872b45 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:37:59 -0800
Subject: [PATCH v8 5/9] vacuumparallel.c: Support parallel table vacuuming.

This commit extends vacuumparallel.c to support parallel table
vacuuming.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/commands/vacuumparallel.c | 299 ++++++++++++++++++++++----
 src/include/commands/vacuum.h         |   8 +-
 2 files changed, 261 insertions(+), 46 deletions(-)

diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 08011fde23f..84416767187 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -6,15 +6,24 @@
  * This file contains routines that are intended to support setting up, using,
  * and tearing down a ParallelVacuumState.
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan or both index bulk deletion and
+ * index cleanup or all of them with parallel worker processes. Different
+ * numbers of workers are launched for the table vacuuming and index processing.
+ * ParallelVacuumState contains shared information as well as the memory space
+ * for storing dead items allocated in the DSA area.
+ *
+ * When initializing parallel table vacuum scan, we invoke table AM routines for
+ * estimating DSM sizes and initializing DSM memory. Parallel table vacuum
+ * workers invoke the table AM routine for vacuuming the table.
+ *
+ * For processing indexes in parallel, individual indexes are processed by one
+ * vacuum process. We launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the parallel
+ * worker processes exit.
+ *
+ * Each time we process table or indexes in parallel, the parallel context is
+ * re-initialized so that the same DSM can be used for multiple passes of table vacuum
+ * or index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -28,6 +37,7 @@
 
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * True if the caller wants parallel workers to invoke vacuum table scan
+	 * callback.
+	 */
+	bool		do_vacuum_table_scan;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -101,6 +117,13 @@ typedef struct PVShared
 	 */
 	pg_atomic_uint32 cost_balance;
 
+	/*
+	 * The number of workers for parallel table scan/vacuuming and index
+	 * vacuuming, respectively.
+	 */
+	int			nworkers_for_table;
+	int			nworkers_for_index;
+
 	/*
 	 * Number of active parallel workers.  This is used for computing the
 	 * minimum threshold of the vacuum cost balance before a worker sleeps for
@@ -164,6 +187,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Passed to parallel table scan workers. NULL for leader process */
+	ParallelWorkerContext *pwcxt;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -193,6 +219,9 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/* How many times parallel table vacuum scan is called? */
+	int			num_table_scans;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
@@ -224,8 +253,9 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static void parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											int *nworkers_for_index, bool *will_parallel_vacuum);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_unsafe_indexes(ParallelVacuumState *pvs);
@@ -244,7 +274,7 @@ static void parallel_vacuum_error_callback(void *arg);
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -258,6 +288,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
+	int			nworkers_for_index;
 	int			querylen;
 
 	/*
@@ -265,15 +297,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	 * relation
 	 */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
-													   nrequested_workers,
-													   will_parallel_vacuum);
+	parallel_vacuum_compute_workers(rel, indrels, nindexes, nrequested_workers,
+									&nworkers_for_table, &nworkers_for_index,
+									will_parallel_vacuum);
+
+	parallel_workers = Max(nworkers_for_table, nworkers_for_index);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
@@ -329,6 +363,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -373,6 +411,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shared->relid = RelationGetRelid(rel);
 	shared->elevel = elevel;
 	shared->queryid = pgstat_get_my_query_id();
+	shared->nworkers_for_table = nworkers_for_table;
+	shared->nworkers_for_index = nworkers_for_index;
 	shared->maintenance_work_mem_worker =
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
@@ -421,6 +461,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Prepare AM-specific DSM for parallel table vacuum */
+	if (nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -534,33 +578,48 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.
+ *
+ * For parallel table vacuum, it asks AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to *nworkers_table.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel index vacuum, The index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. nrequested is the number of
+ * parallel workers that user requested.  If nrequested is 0, we compute the
+ * parallel degree based on nindexes, that is the number of indexes that
+ * support parallel vacuum.  This function also sets will_parallel_vacuum to
+ * remember indexes that participate in parallel vacuum.
  */
-static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+static void
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_for_table,
+								int *nworkers_for_index, bool *will_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			parallel_workers_table = 0;
+	int			parallel_workers_index = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
 	 * when parallelism is disabled.
 	 */
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
-		return 0;
+	{
+		*nworkers_for_table = 0;
+		*nworkers_for_index = 0;
+		return;
+	}
+
+	/*
+	 * Compute the number of workers for parallel table scan. Cap by
+	 * max_parallel_maintenance_workers.
+	 */
+	parallel_workers_table = Min(table_parallel_vacuum_compute_workers(rel, nrequested),
+								 max_parallel_maintenance_workers);
 
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
@@ -591,17 +650,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	nindexes_parallel--;
 
 	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Compute the parallel degree for parallel index vacuum */
+		parallel_workers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		parallel_workers_index = Min(parallel_workers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_for_table = parallel_workers_table;
+	*nworkers_for_index = parallel_workers_index;
 }
 
 /*
@@ -669,8 +729,12 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
-		/* Reinitialize parallel context to relaunch parallel workers */
-		if (pvs->num_index_scans > 0)
+		/*
+		 * Reinitialize parallel context to relaunch parallel workers if we
+		 * have used the parallel context for either index vacuuming or table
+		 * vacuuming.
+		 */
+		if (pvs->num_index_scans > 0 || pvs->num_table_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -982,6 +1046,140 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Prepare DSM and shared vacuum delays, and launch parallel workers for parallel
+ * table vacuum. Return the number of launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_table_scan_end() to finish the parallel
+ * table vacuum.
+ */
+int
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->shared->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->do_vacuum_table_scan = true;
+
+	if (pvs->num_table_scans > 0)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->shared->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel table vacuum scan, and gather statistics.
+ */
+void
+parallel_vacuum_table_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->shared->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	pvs->shared->do_vacuum_table_scan = false;
+	pvs->num_table_scans++;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/*
+ * Return the number of workers for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->shared->nworkers_for_table;
+}
+
+/*
+ * Return the number of workers for parallel index processing.
+ */
+int
+parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs)
+{
+	return pvs->shared->nworkers_for_index;
+}
+
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->do_vacuum_table_scan);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do table vacuum scan */
+	table_parallel_vacuum_relation_worker(pvs->heaprel, pvs, pvs->pwcxt);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -1033,7 +1231,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	if (shared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = shared->maintenance_work_mem_worker;
@@ -1064,6 +1261,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.relname = pstrdup(RelationGetRelationName(rel));
 	pvs.heaprel = rel;
 
+	pvs.pwcxt = palloc(sizeof(ParallelWorkerContext));
+	pvs.pwcxt->toc = toc;
+	pvs.pwcxt->seg = seg;
+
 	/* These fields will be filled during index vacuum or cleanup */
 	pvs.indname = NULL;
 	pvs.status = PARALLEL_INDVAC_STATUS_INITIAL;
@@ -1081,8 +1282,16 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	if (pvs.shared->do_vacuum_table_scan)
+	{
+		/* Process table to perform vacuum */
+		parallel_vacuum_process_table(&pvs);
+	}
+	else
+	{
+		/* Process indexes to perform vacuum/cleanup */
+		parallel_vacuum_process_safe_indexes(&pvs);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e7b7753b691..d45866d61e5 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -360,7 +360,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -370,6 +371,11 @@ extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												bool estimated_count);
+extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
-- 
2.43.5

v8-0007-raidxtree.h-support-shared-iteration.patchapplication/octet-stream; name=v8-0007-raidxtree.h-support-shared-iteration.patchDownload

From 074583f81f5b2664400a2577eb9f092e962da2f4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:29:51 -0700
Subject: [PATCH v8 7/9] raidxtree.h: support shared iteration.

This commit supports a shared iteration operation on a radix tree with
multiple processes. The radix tree must be in shared mode to start a
shared itereation. Parallel workers can attach the shared iteration
using the iterator handle given by the leader process. Same as the
normal interation, it's guarnteed that the shared iteration returns
key-values in an ascending order.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/include/lib/radixtree.h                   | 216 +++++++++++++++---
 .../modules/test_radixtree/test_radixtree.c   | 128 +++++++----
 2 files changed, 272 insertions(+), 72 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index f0abb0df389..c8efa61ac7c 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -136,6 +136,9 @@
  * RT_LOCK_SHARE 	- Lock the radix tree in share mode
  * RT_UNLOCK		- Unlock the radix tree
  * RT_GET_HANDLE	- Return the handle of the radix tree
+ * RT_BEGIN_ITERATE_SHARED	- Begin iterating in shared mode.
+ * RT_ATTACH_ITERATE_SHARED	- Attach to the shared iterator.
+ * RT_GET_ITER_HANDLE		- Get the handle of the shared iterator.
  *
  * Optional Interface
  * ---------
@@ -179,6 +182,9 @@
 #define RT_ATTACH RT_MAKE_NAME(attach)
 #define RT_DETACH RT_MAKE_NAME(detach)
 #define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#define RT_BEGIN_ITERATE_SHARED RT_MAKE_NAME(begin_iterate_shared)
+#define RT_ATTACH_ITERATE_SHARED RT_MAKE_NAME(attach_iterate_shared)
+#define RT_GET_ITER_HANDLE RT_MAKE_NAME(get_iter_handle)
 #define RT_LOCK_EXCLUSIVE RT_MAKE_NAME(lock_exclusive)
 #define RT_LOCK_SHARE RT_MAKE_NAME(lock_share)
 #define RT_UNLOCK RT_MAKE_NAME(unlock)
@@ -238,15 +244,19 @@
 #define RT_SHRINK_NODE_16 RT_MAKE_NAME(shrink_child_16)
 #define RT_SHRINK_NODE_48 RT_MAKE_NAME(shrink_child_48)
 #define RT_SHRINK_NODE_256 RT_MAKE_NAME(shrink_child_256)
+#define RT_INITIALIZE_ITER RT_MAKE_NAME(initialize_iter)
 #define RT_NODE_ITERATE_NEXT RT_MAKE_NAME(node_iterate_next)
 #define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
 
 /* type declarations */
 #define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
 #define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER_CONTROL RT_MAKE_NAME(iter_control)
 #define RT_ITER RT_MAKE_NAME(iter)
 #ifdef RT_SHMEM
 #define RT_HANDLE RT_MAKE_NAME(handle)
+#define RT_ITER_CONTROL_SHARED RT_MAKE_NAME(iter_control_shared)
+#define RT_ITER_HANDLE RT_MAKE_NAME(iter_handle)
 #endif
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_CHILD_PTR RT_MAKE_NAME(child_ptr)
@@ -272,6 +282,7 @@ typedef struct RT_ITER RT_ITER;
 
 #ifdef RT_SHMEM
 typedef dsa_pointer RT_HANDLE;
+typedef dsa_pointer RT_ITER_HANDLE;
 #endif
 
 #ifdef RT_SHMEM
@@ -282,6 +293,9 @@ RT_SCOPE	RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_LOCK_EXCLUSIVE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_LOCK_SHARE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_UNLOCK(RT_RADIX_TREE * tree);
+RT_SCOPE	RT_ITER *RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree);
+RT_SCOPE	RT_ITER_HANDLE RT_GET_ITER_HANDLE(RT_ITER * iter);
+RT_SCOPE	RT_ITER *RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle);
 #else
 RT_SCOPE	RT_RADIX_TREE *RT_CREATE(MemoryContext ctx);
 #endif
@@ -689,6 +703,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 	RT_HANDLE	handle;
 	uint32		magic;
 	LWLock		lock;
+	int			tranche_id;
 #endif
 
 	RT_PTR_ALLOC root;
@@ -739,11 +754,9 @@ typedef struct RT_NODE_ITER
 	int			idx;
 }			RT_NODE_ITER;
 
-/* state for iterating over the whole radix tree */
-struct RT_ITER
+/* Contain the iteration state data */
+typedef struct RT_ITER_CONTROL
 {
-	RT_RADIX_TREE *tree;
-
 	/*
 	 * A stack to track iteration for each level. Level 0 is the lowest (or
 	 * leaf) level
@@ -754,8 +767,36 @@ struct RT_ITER
 
 	/* The key constructed during iteration */
 	uint64		key;
-};
+}			RT_ITER_CONTROL;
 
+#ifdef RT_SHMEM
+/* Contain the shared iteration state data */
+typedef struct RT_ITER_CONTROL_SHARED
+{
+	/* Actual shared iteration state data */
+	RT_ITER_CONTROL common;
+
+	/* protect the control data */
+	LWLock		lock;
+
+	RT_ITER_HANDLE handle;
+	pg_atomic_uint32 refcnt;
+}			RT_ITER_CONTROL_SHARED;
+#endif
+
+/* state for iterating over the whole radix tree */
+struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* pointing to either local memory or DSA */
+	RT_ITER_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	/* True if the iterator is for shared iteration */
+	bool		shared;
+#endif
+};
 
 /* verification (available only in assert-enabled builds) */
 static void RT_VERIFY_NODE(RT_NODE * node);
@@ -1833,6 +1874,7 @@ RT_CREATE(MemoryContext ctx)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
 	tree->ctl->handle = dp;
 	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	tree->ctl->tranche_id = tranche_id;
 	LWLockInitialize(&tree->ctl->lock, tranche_id);
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
@@ -2044,6 +2086,28 @@ RT_FREE(RT_RADIX_TREE * tree)
 
 /***************** ITERATION *****************/
 
+/*
+ * Common routine to initialize the given iterator.
+ */
+static void
+RT_INITIALIZE_ITER(RT_RADIX_TREE * tree, RT_ITER * iter)
+{
+	RT_CHILD_PTR root;
+
+	iter->tree = tree;
+
+	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
+	root.alloc = iter->tree->ctl->root;
+	RT_PTR_SET_LOCAL(tree, &root);
+
+	iter->ctl->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+
+	/* Set the root to start */
+	iter->ctl->cur_level = iter->ctl->top_level;
+	iter->ctl->node_iters[iter->ctl->cur_level].node = root;
+	iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
+}
+
 /*
  * Create and return an iterator for the given radix tree
  * in the caller's memory context.
@@ -2055,24 +2119,50 @@ RT_SCOPE	RT_ITER *
 RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 {
 	RT_ITER    *iter;
-	RT_CHILD_PTR root;
 
 	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
-	iter->tree = tree;
+	iter->ctl = (RT_ITER_CONTROL *) palloc0(sizeof(RT_ITER_CONTROL));
+	RT_INITIALIZE_ITER(tree, iter);
 
-	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
-	root.alloc = iter->tree->ctl->root;
-	RT_PTR_SET_LOCAL(tree, &root);
+#ifdef RT_SHMEM
+	/* we will non-shared iteration on a shared radix tree */
+	iter->shared = false;
+#endif
 
-	iter->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+	return iter;
+}
 
-	/* Set the root to start */
-	iter->cur_level = iter->top_level;
-	iter->node_iters[iter->cur_level].node = root;
-	iter->node_iters[iter->cur_level].idx = 0;
+#ifdef RT_SHMEM
+/*
+ * Create and return the shared iterator for the given shard radix tree.
+ *
+ * Taking a lock on a radix tree in shared mode during the shared iteration to
+ * prevent concurrent writes is the caller's responsibility.
+ */
+RT_SCOPE	RT_ITER *
+RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl_shared;
+	dsa_pointer dp;
+
+	/* The radix tree must be in shared mode */
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	dp = dsa_allocate0(tree->dsa, sizeof(RT_ITER_CONTROL_SHARED));
+	ctl_shared = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, dp);
+	ctl_shared->handle = dp;
+	LWLockInitialize(&ctl_shared->lock, tree->ctl->tranche_id);
+	pg_atomic_init_u32(&ctl_shared->refcnt, 1);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->ctl = (RT_ITER_CONTROL *) ctl_shared;
+	iter->shared = true;
+	RT_INITIALIZE_ITER(tree, iter);
 
 	return iter;
 }
+#endif
 
 /*
  * Scan the inner node and return the next child pointer if one exists, otherwise
@@ -2086,12 +2176,18 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	RT_CHILD_PTR node;
 	RT_PTR_ALLOC *slot = NULL;
 
+	node_iter = &(iter->ctl->node_iters[level]);
+	node = node_iter->node;
+
 #ifdef RT_SHMEM
-	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
-#endif
 
-	node_iter = &(iter->node_iters[level]);
-	node = node_iter->node;
+	/*
+	 * Since the iterator is shared, the local pointer of the node might be
+	 * set by other backends, we need to make sure to use the local pointer.
+	 */
+	if (iter->shared)
+		RT_PTR_SET_LOCAL(iter->tree, &node);
+#endif
 
 	Assert(node.local != NULL);
 
@@ -2164,8 +2260,8 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	}
 
 	/* Update the key */
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
-	iter->key |= (((uint64) key_chunk) << (level * RT_SPAN));
+	iter->ctl->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
+	iter->ctl->key |= (((uint64) key_chunk) << (level * RT_SPAN));
 
 	return slot;
 }
@@ -2179,18 +2275,29 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 {
 	RT_PTR_ALLOC *slot = NULL;
 
-	while (iter->cur_level <= iter->top_level)
+#ifdef RT_SHMEM
+	/* Prevent the shared iterator from being updated concurrently */
+	if (iter->shared)
+		LWLockAcquire(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock, LW_EXCLUSIVE);
+#endif
+
+	while (iter->ctl->cur_level <= iter->ctl->top_level)
 	{
 		RT_CHILD_PTR node;
 
-		slot = RT_NODE_ITERATE_NEXT(iter, iter->cur_level);
+		slot = RT_NODE_ITERATE_NEXT(iter, iter->ctl->cur_level);
 
-		if (iter->cur_level == 0 && slot != NULL)
+		if (iter->ctl->cur_level == 0 && slot != NULL)
 		{
 			/* Found a value at the leaf node */
-			*key_p = iter->key;
+			*key_p = iter->ctl->key;
 			node.alloc = *slot;
 
+#ifdef RT_SHMEM
+			if (iter->shared)
+				LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 			if (RT_CHILDPTR_IS_VALUE(*slot))
 				return (RT_VALUE_TYPE *) slot;
 			else
@@ -2206,17 +2313,23 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 			node.alloc = *slot;
 			RT_PTR_SET_LOCAL(iter->tree, &node);
 
-			iter->cur_level--;
-			iter->node_iters[iter->cur_level].node = node;
-			iter->node_iters[iter->cur_level].idx = 0;
+			iter->ctl->cur_level--;
+			iter->ctl->node_iters[iter->ctl->cur_level].node = node;
+			iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
 		}
 		else
 		{
 			/* Not found the child slot, move up the tree */
-			iter->cur_level++;
+			iter->ctl->cur_level++;
 		}
+
 	}
 
+#ifdef RT_SHMEM
+	if (iter->shared)
+		LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 	/* We've visited all nodes, so the iteration finished */
 	return NULL;
 }
@@ -2227,9 +2340,44 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 RT_SCOPE void
 RT_END_ITERATE(RT_ITER * iter)
 {
+#ifdef RT_SHMEM
+	RT_ITER_CONTROL_SHARED *ctl = (RT_ITER_CONTROL_SHARED *) iter->ctl;
+
+	if (iter->shared &&
+		pg_atomic_sub_fetch_u32(&ctl->refcnt, 1) == 0)
+		dsa_free(iter->tree->dsa, ctl->handle);
+#endif
 	pfree(iter);
 }
 
+#ifdef	RT_SHMEM
+RT_SCOPE	RT_ITER_HANDLE
+RT_GET_ITER_HANDLE(RT_ITER * iter)
+{
+	Assert(iter->shared);
+	return ((RT_ITER_CONTROL_SHARED *) iter->ctl)->handle;
+
+}
+
+RT_SCOPE	RT_ITER *
+RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl;
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+	ctl = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, handle);
+	iter->ctl = (RT_ITER_CONTROL *) ctl;
+	iter->shared = true;
+
+	/* For every iterator, increase the refcnt by 1 */
+	pg_atomic_add_fetch_u32(&ctl->refcnt, 1);
+
+	return iter;
+}
+#endif
+
 /***************** DELETION *****************/
 
 #ifdef RT_USE_DELETE
@@ -2929,7 +3077,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_PTR_ALLOC
 #undef RT_INVALID_PTR_ALLOC
 #undef RT_HANDLE
+#undef RT_ITER_HANDLE
+#undef RT_ITER_CONTROL
+#undef RT_ITER_HANDLE
 #undef RT_ITER
+#undef RT_SHARED_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
 #undef RT_NODE_KIND_4
@@ -2966,6 +3118,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_LOCK_SHARE
 #undef RT_UNLOCK
 #undef RT_GET_HANDLE
+#undef RT_BEGIN_ITERATE_SHARED
+#undef RT_ATTACH_ITERATE_SHARED
+#undef RT_GET_ITER_HANDLE
+#undef RT_ATTACH_ITER
+#undef RT_GET_ITER_HANDLE
 #undef RT_FIND
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
@@ -3022,5 +3179,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_SHRINK_NODE_256
 #undef RT_NODE_DELETE
 #undef RT_NODE_INSERT
+#undef RT_INITIALIZE_ITER
 #undef RT_NODE_ITERATE_NEXT
 #undef RT_VERIFY_NODE
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 32de6a3123e..dcba1508a29 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -158,12 +158,86 @@ test_empty(void)
 #endif
 }
 
+/* Iteration test for test_basic() */
+static void
+test_iterate_basic(rt_radix_tree *radixtree, uint64 *keys, int children,
+				   bool asc, bool shared)
+{
+	rt_iter    *iter;
+
+#ifdef TEST_SHARED_RT
+	if (!shared)
+		iter = rt_begin_iterate(radixtree);
+	else
+		iter = rt_begin_iterate_shared(radixtree);
+#else
+	iter = rt_begin_iterate(radixtree);
+#endif
+
+	for (int i = 0; i < children; i++)
+	{
+		uint64		expected;
+		uint64		iterkey;
+		TestValueType *iterval;
+
+		/* iteration is ordered by key, so adjust expected value accordingly */
+		if (asc)
+			expected = keys[i];
+		else
+			expected = keys[children - 1 - i];
+
+		iterval = rt_iterate_next(iter, &iterkey);
+
+		EXPECT_TRUE(iterval != NULL);
+		EXPECT_EQ_U64(iterkey, expected);
+		EXPECT_EQ_U64(*iterval, expected);
+	}
+
+	rt_end_iterate(iter);
+}
+
+/* Iteration test for test_random() */
+static void
+test_iterate_random(rt_radix_tree *radixtree, uint64 *keys, int num_keys,
+					bool shared)
+{
+	rt_iter    *iter;
+
+#ifdef TEST_SHARED_RT
+	if (!shared)
+		iter = rt_begin_iterate(radixtree);
+	else
+		iter = rt_begin_iterate_shared(radixtree);
+#else
+	iter = rt_begin_iterate(radixtree);
+#endif
+
+	for (int i = 0; i < num_keys; i++)
+	{
+		uint64		expected;
+		uint64		iterkey;
+		TestValueType *iterval;
+
+		/* skip duplicate keys */
+		if (i < num_keys - 1 && keys[i + 1] == keys[i])
+			continue;
+
+		expected = keys[i];
+		iterval = rt_iterate_next(iter, &iterkey);
+
+		EXPECT_TRUE(iterval != NULL);
+		EXPECT_EQ_U64(iterkey, expected);
+		EXPECT_EQ_U64(*iterval, expected);
+	}
+
+	rt_end_iterate(iter);
+}
+
 /* Basic set, find, and delete tests */
 static void
 test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 {
 	rt_radix_tree *radixtree;
-	rt_iter    *iter;
 	uint64	   *keys;
 	int			children = test_info->nkeys;
 #ifdef TEST_SHARED_RT
@@ -244,28 +318,12 @@ test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 	}
 
 	/* test that iteration returns the expected keys and values */
-	iter = rt_begin_iterate(radixtree);
-
-	for (int i = 0; i < children; i++)
-	{
-		uint64		expected;
-		uint64		iterkey;
-		TestValueType *iterval;
-
-		/* iteration is ordered by key, so adjust expected value accordingly */
-		if (asc)
-			expected = keys[i];
-		else
-			expected = keys[children - 1 - i];
-
-		iterval = rt_iterate_next(iter, &iterkey);
-
-		EXPECT_TRUE(iterval != NULL);
-		EXPECT_EQ_U64(iterkey, expected);
-		EXPECT_EQ_U64(*iterval, expected);
-	}
+	test_iterate_basic(radixtree, keys, children, asc, false);
 
-	rt_end_iterate(iter);
+#ifdef TEST_SHARED_RT
+	/* test shared-iteration as well */
+	test_iterate_basic(radixtree, keys, children, asc, true);
+#endif
 
 	/* delete all keys again */
 	for (int i = 0; i < children; i++)
@@ -295,7 +353,6 @@ static void
 test_random(void)
 {
 	rt_radix_tree *radixtree;
-	rt_iter    *iter;
 	pg_prng_state state;
 
 	/* limit memory usage by limiting the key space */
@@ -387,27 +444,12 @@ test_random(void)
 	}
 
 	/* test that iteration returns the expected keys and values */
-	iter = rt_begin_iterate(radixtree);
-
-	for (int i = 0; i < num_keys; i++)
-	{
-		uint64		expected;
-		uint64		iterkey;
-		TestValueType *iterval;
+	test_iterate_random(radixtree, keys, num_keys, false);
 
-		/* skip duplicate keys */
-		if (i < num_keys - 1 && keys[i + 1] == keys[i])
-			continue;
-
-		expected = keys[i];
-		iterval = rt_iterate_next(iter, &iterkey);
-
-		EXPECT_TRUE(iterval != NULL);
-		EXPECT_EQ_U64(iterkey, expected);
-		EXPECT_EQ_U64(*iterval, expected);
-	}
-
-	rt_end_iterate(iter);
+#ifdef TEST_SHARED_RT
+	/* test shared-iteration as well */
+	test_iterate_random(radixtree, keys, num_keys, true);
+#endif
 
 	/* reset random number generator for deletion */
 	pg_prng_seed(&state, seed);
-- 
2.43.5

v8-0008-Support-shared-itereation-on-TidStore.patchapplication/octet-stream; name=v8-0008-Support-shared-itereation-on-TidStore.patchDownload

From fdd878c002ca84a682f6ba0f97e5bb3d234545f9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 16:29:33 -0800
Subject: [PATCH v8 8/9] Support shared itereation on TidStore.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/common/tidstore.c          | 59 ++++++++++++++++++
 src/include/access/tidstore.h                 |  3 +
 .../modules/test_tidstore/test_tidstore.c     | 62 ++++++++++++++-----
 3 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 5bd75fb499c..720bc86c266 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -475,6 +475,7 @@ TidStoreBeginIterate(TidStore *ts)
 	iter = palloc0(sizeof(TidStoreIter));
 	iter->ts = ts;
 
+	/* begin iteration on the radix tree */
 	if (TidStoreIsShared(ts))
 		iter->tree_iter.shared = shared_ts_begin_iterate(ts->tree.shared);
 	else
@@ -525,6 +526,56 @@ TidStoreEndIterate(TidStoreIter *iter)
 	pfree(iter);
 }
 
+/*
+ * Prepare to iterate through a shared TidStore in shared mode. This function
+ * is aimed to start the iteration on the given TidStore with parallel workers.
+ *
+ * The TidStoreIter struct is created in the caller's memory context, and it
+ * will be freed in TidStoreEndIterate.
+ *
+ * The caller is responsible for locking TidStore until the iteration is
+ * finished.
+ */
+TidStoreIter *
+TidStoreBeginIterateShared(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	if (!TidStoreIsShared(ts))
+		elog(ERROR, "cannot begin shared iteration on local TidStore");
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* begin the shared iteration on radix tree */
+	iter->tree_iter.shared =
+		(shared_ts_iter *) shared_ts_begin_iterate_shared(ts->tree.shared);
+
+	return iter;
+}
+
+/*
+ * Attach to the shared TidStore iterator. 'iter_handle' is the dsa_pointer
+ * returned by TidStoreGetSharedIterHandle(). The returned object is allocated
+ * in backend-local memory using CurrentMemoryContext.
+ */
+TidStoreIter *
+TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle)
+{
+	TidStoreIter *iter;
+
+	Assert(TidStoreIsShared(ts));
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* Attach to the shared iterator */
+	iter->tree_iter.shared = shared_ts_attach_iterate_shared(ts->tree.shared,
+															 iter_handle);
+
+	return iter;
+}
+
 /*
  * Return the memory usage of TidStore.
  */
@@ -556,6 +607,14 @@ TidStoreGetHandle(TidStore *ts)
 	return (dsa_pointer) shared_ts_get_handle(ts->tree.shared);
 }
 
+dsa_pointer
+TidStoreGetSharedIterHandle(TidStoreIter *iter)
+{
+	Assert(TidStoreIsShared(iter->ts));
+
+	return (dsa_pointer) shared_ts_get_iter_handle(iter->tree_iter.shared);
+}
+
 /*
  * Given a TidStoreIterResult returned by TidStoreIterateNext(), extract the
  * offset numbers.  Returns the number of offsets filled in, if <=
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 041091df278..c886cef0f7d 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -37,6 +37,9 @@ extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
 extern void TidStoreUnlock(TidStore *ts);
+extern TidStoreIter *TidStoreBeginIterateShared(TidStore *ts);
+extern TidStoreIter *TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle);
+extern dsa_pointer TidStoreGetSharedIterHandle(TidStoreIter *iter);
 extern void TidStoreDestroy(TidStore *ts);
 extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 									int num_offsets);
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index eb16e0fbfa6..36654cf0110 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -33,6 +33,7 @@ PG_FUNCTION_INFO_V1(test_is_full);
 PG_FUNCTION_INFO_V1(test_destroy);
 
 static TidStore *tidstore = NULL;
+static bool tidstore_is_shared;
 static size_t tidstore_empty_size;
 
 /* array for verification of some tests */
@@ -107,6 +108,7 @@ test_create(PG_FUNCTION_ARGS)
 		LWLockRegisterTranche(tranche_id, "test_tidstore");
 
 		tidstore = TidStoreCreateShared(tidstore_max_size, tranche_id);
+		tidstore_is_shared = true;
 
 		/*
 		 * Remain attached until end of backend or explicitly detached so that
@@ -115,8 +117,11 @@ test_create(PG_FUNCTION_ARGS)
 		dsa_pin_mapping(TidStoreGetDSA(tidstore));
 	}
 	else
+	{
 		/* VACUUM uses insert only, so we test the other option. */
 		tidstore = TidStoreCreateLocal(tidstore_max_size, false);
+		tidstore_is_shared = false;
+	}
 
 	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
 
@@ -212,14 +217,42 @@ do_set_block_offsets(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(blkno);
 }
 
+/* Collect TIDs stored in the tidstore, in order */
+static void
+check_iteration(TidStore *tidstore, int *num_iter_tids, bool shared_iter)
+{
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+
+	TidStoreLockShare(tidstore);
+
+	if (shared_iter)
+		iter = TidStoreBeginIterateShared(tidstore);
+	else
+		iter = TidStoreBeginIterate(tidstore);
+
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	{
+		OffsetNumber offsets[MaxOffsetNumber];
+		int			num_offsets;
+
+		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
+		Assert(num_offsets <= lengthof(offsets));
+		for (int i = 0; i < num_offsets; i++)
+			ItemPointerSet(&(items.iter_tids[(*num_iter_tids)++]), iter_result->blkno,
+						   offsets[i]);
+	}
+
+	TidStoreEndIterate(iter);
+	TidStoreUnlock(tidstore);
+}
+
 /*
  * Verify TIDs in store against the array.
  */
 Datum
 check_set_block_offsets(PG_FUNCTION_ARGS)
 {
-	TidStoreIter *iter;
-	TidStoreIterResult *iter_result;
 	int			num_iter_tids = 0;
 	int			num_lookup_tids = 0;
 	BlockNumber prevblkno = 0;
@@ -261,22 +294,23 @@ check_set_block_offsets(PG_FUNCTION_ARGS)
 	}
 
 	/* Collect TIDs stored in the tidstore, in order */
+	check_iteration(tidstore, &num_iter_tids, false);
 
-	TidStoreLockShare(tidstore);
-	iter = TidStoreBeginIterate(tidstore);
-	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	/* If the tidstore is shared, check the shared-iteration as well */
+	if (tidstore_is_shared)
 	{
-		OffsetNumber offsets[MaxOffsetNumber];
-		int			num_offsets;
+		int			num_iter_tids_shared = 0;
 
-		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
-		Assert(num_offsets <= lengthof(offsets));
-		for (int i = 0; i < num_offsets; i++)
-			ItemPointerSet(&(items.iter_tids[num_iter_tids++]), iter_result->blkno,
-						   offsets[i]);
+		check_iteration(tidstore, &num_iter_tids_shared, true);
+
+		/*
+		 * verify that normal iteration and shared iteration returned the
+		 * number of TIDs.
+		 */
+		if (num_lookup_tids != num_iter_tids_shared)
+			elog(ERROR, "shared-iteration should have %d TIDs, have %d aaa",
+				 items.num_tids, num_iter_tids_shared);
 	}
-	TidStoreEndIterate(iter);
-	TidStoreUnlock(tidstore);
 
 	/*
 	 * Sort verification and lookup arrays and test that all arrays are the
-- 
2.43.5

v8-0004-Add-table-APIs-for-parallel-table-vacuuming.patchapplication/octet-stream; name=v8-0004-Add-table-APIs-for-parallel-table-vacuuming.patchDownload

From f372be2818657270ad1885932be243249488536c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v8 4/9] Add table APIs for parallel table vacuuming.

This commit introdues the following new table AM APIs:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_relation_worker

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/include/access/tableam.h | 92 ++++++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 09b9b394e0e..4675d95c8ba 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -654,6 +655,47 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * function must return 0 to disable parallel table vacuum.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel, int requested);
+
+	/*
+	 * Estimate the size of shared memory that the parallel table vacuum needs
+	 * for AM
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * This callback is called for parallel table vacuum workers.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_relation_worker) (Relation rel,
+													ParallelVacuumState *pvs,
+													ParallelWorkerContext *pwcxt);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1715,6 +1757,52 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int requested)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							   void *state)
+{
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Start a parallel table vacuuming for this relation.
+ */
+static inline void
+table_parallel_vacuum_relation_worker(Relation rel, ParallelVacuumState *pvs,
+									  ParallelWorkerContext *pwcxt)
+{
+	rel->rd_tableam->parallel_vacuum_relation_worker(rel, pvs, pwcxt);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
@@ -2091,6 +2179,10 @@ extern void table_block_parallelscan_reinitialize(Relation rel,
 extern BlockNumber table_block_parallelscan_nextpage(Relation rel,
 													 ParallelBlockTableScanWorker pbscanwork,
 													 ParallelBlockTableScanDesc pbscan);
+extern void table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+														 ParallelBlockTableScanWorker pbscanwork,
+														 ParallelBlockTableScanDesc pbscan,
+														 BlockNumber nblocks_skip);
 extern void table_block_parallelscan_startblock_init(Relation rel,
 													 ParallelBlockTableScanWorker pbscanwork,
 													 ParallelBlockTableScanDesc pbscan);
-- 
2.43.5

v8-0003-Move-GlobalVisState-definition-to-snapmgr_interna.patchapplication/octet-stream; name=v8-0003-Move-GlobalVisState-definition-to-snapmgr_interna.patchDownload

From 26c0bf82e58a09111b019ee3908384e80d3224e5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:00:46 -0800
Subject: [PATCH v8 3/9] Move GlobalVisState definition to snapmgr_internal.h.

This commit expose the GlobalVisState struct in
snapmgr_internal.h. This is a preparatory work for parallel vacuum
heap scan.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/storage/ipc/procarray.c  | 74 ----------------------
 src/include/utils/snapmgr.h          |  2 +-
 src/include/utils/snapmgr_internal.h | 91 ++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+), 75 deletions(-)
 create mode 100644 src/include/utils/snapmgr_internal.h

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 2e54c11f880..4813a07860d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,80 +99,6 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
-/*
- * State for the GlobalVisTest* family of functions. Those functions can
- * e.g. be used to decide if a deleted row can be removed without violating
- * MVCC semantics: If the deleted row's xmax is not considered to be running
- * by anyone, the row can be removed.
- *
- * To avoid slowing down GetSnapshotData(), we don't calculate a precise
- * cutoff XID while building a snapshot (looking at the frequently changing
- * xmins scales badly). Instead we compute two boundaries while building the
- * snapshot:
- *
- * 1) definitely_needed, indicating that rows deleted by XIDs >=
- *    definitely_needed are definitely still visible.
- *
- * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
- *    definitely be removed
- *
- * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
- * && XID < definitely_needed), the boundaries can be recomputed (using
- * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
- * maintaining an accurate value all the time.
- *
- * As it is not cheap to compute accurate boundaries, we limit the number of
- * times that happens in short succession. See GlobalVisTestShouldUpdate().
- *
- *
- * There are three backend lifetime instances of this struct, optimized for
- * different types of relations. As e.g. a normal user defined table in one
- * database is inaccessible to backends connected to another database, a test
- * specific to a relation can be more aggressive than a test for a shared
- * relation.  Currently we track four different states:
- *
- * 1) GlobalVisSharedRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in any database, nor a
- *    replication slot's xmin, nor a replication slot's catalog_xmin might
- *    still consider XID as running.
- *
- * 2) GlobalVisCatalogRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin, nor a replication slot's
- *    catalog_xmin might still consider XID as running.
- *
- *    I.e. the difference to GlobalVisSharedRels is that
- *    snapshot in other databases are ignored.
- *
- * 3) GlobalVisDataRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin consider XID as running.
- *
- *    I.e. the difference to GlobalVisCatalogRels is that
- *    replication slot's catalog_xmin is not taken into account.
- *
- * 4) GlobalVisTempRels, which only considers the current session, as temp
- *    tables are not visible to other sessions.
- *
- * GlobalVisTestFor(relation) returns the appropriate state
- * for the relation.
- *
- * The boundaries are FullTransactionIds instead of TransactionIds to avoid
- * wraparound dangers. There e.g. would otherwise exist no procarray state to
- * prevent maybe_needed to become old enough after the GetSnapshotData()
- * call.
- *
- * The typedef is in the header.
- */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..3b6fb603544 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -17,6 +17,7 @@
 #include "utils/relcache.h"
 #include "utils/resowner.h"
 #include "utils/snapshot.h"
+#include "utils/snapmgr_internal.h"
 
 
 extern PGDLLIMPORT bool FirstSnapshotSet;
@@ -95,7 +96,6 @@ extern char *ExportSnapshot(Snapshot snapshot);
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
  */
-typedef struct GlobalVisState GlobalVisState;
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
diff --git a/src/include/utils/snapmgr_internal.h b/src/include/utils/snapmgr_internal.h
new file mode 100644
index 00000000000..4363adf7f62
--- /dev/null
+++ b/src/include/utils/snapmgr_internal.h
@@ -0,0 +1,91 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapmgr_internal.h
+ *		This file contains declarations of structs for snapshot manager
+ *		for internal use.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/snapmgr_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SNAPMGR_INTERNAL_H
+#define SNAPMGR_INTERNAL_H
+
+#include "access/transam.h"
+
+/*
+ * State for the GlobalVisTest* family of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track four different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisDataRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * 4) GlobalVisTempRels, which only considers the current session, as temp
+ *    tables are not visible to other sessions.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ */
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
+#endif							/* SNAPMGR_INTERNAL_H */
-- 
2.43.5

v8-0006-Support-parallel-heap-scan-during-lazy-vacuum.patchapplication/octet-stream; name=v8-0006-Support-parallel-heap-scan-during-lazy-vacuum.patchDownload

From a5ecbe22fcb05e493808a299c486e8dd1b2ce878 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:40:26 -0800
Subject: [PATCH v8 6/9] Support parallel heap scan during lazy vacuum.

Commit 40d964ec99 allowed vacuum command to process indexes in
parallel. This change extends the parallel vacuum to support parallel
heap scan during lazy vacuum.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/ref/vacuum.sgml             |  58 +-
 src/backend/access/heap/heapam_handler.c |   6 +
 src/backend/access/heap/vacuumlazy.c     | 950 ++++++++++++++++++++---
 src/backend/access/table/tableam.c       |  15 +
 src/include/access/heapam.h              |   8 +
 src/tools/pgindent/typedefs.list         |   3 +
 6 files changed, 903 insertions(+), 137 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 971b1237d47..b099d3a03cf 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -278,27 +278,43 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
    <varlistentry>
     <term><literal>PARALLEL</literal></term>
     <listitem>
-     <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
-     </para>
+      <para>
+       Perform scanning heap, index vacuum, and index cleanup phases of
+       <command>VACUUM</command> in parallel using
+       <replaceable class="parameter">integer</replaceable> background workers
+       (for the details of each vacuum phase, please refer to
+       <xref linkend="vacuum-phases"/>).
+      </para>
+      <para>
+       For heap tables, the number of workers used to perform the scanning
+       heap is determined based on the size of table. A table can participate in
+       parallel scanning heap if and only if the size of the table is more than
+       <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+       the heap table's blocks will be divided into ranges and shared among the
+       cooperating processes. Each worker process will complete the scanning of
+       its given range of blocks before requesting an additional range of blocks.
+      </para>
+      <para>
+       The number of workers used to perform parallel index vacuum and index
+       cleanup is equal to the number of indexes on the relation that support
+       parallel vacuum. An index can participate in parallel vacuum if and only
+       if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+       Only one worker can be used per index. So parallel workers for index vacuum
+       and index cleanup are launched only when there are at least <literal>2</literal>
+       indexes in the table.
+      </para>
+      <para>
+       Workers for vacuum are launched before the start of each phase and exit
+       at the end of the phase. The number of workers for each phase is limited by
+       the number of workers specified with <literal>PARALLEL</literal> option if
+       any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+       Please note that in any parallel vacuum phase, it is not guaanteed that the
+       number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+       will be used during execution. It is possible for a vacuum to run with fewer
+       workers than specified, or even with no workers at all. These behaviors might
+       change in a future release. This option can't be used with the <literal>FULL</literal>
+       option.
+      </para>
     </listitem>
    </varlistentry>
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e817f8f8f84..9484a2fdb3f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2656,6 +2656,12 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_relation_worker = heap_parallel_vacuum_worker,
+
 	.scan_analyze_next_block = heapam_scan_analyze_next_block,
 	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 97e94b3ac86..f1fbe242aa7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -102,6 +102,7 @@
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -169,10 +170,24 @@
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
 /*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
+ * DSM keys for heap parallel vacuum scan. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_KEY_SCAN_SHARED			0xFFFF0001
+#define LV_PARALLEL_KEY_SCAN_DESC			0xFFFF0002
+#define LV_PARALLEL_KEY_SCAN_DESC_WORKER	0xFFFF0003
+
+/*
+ * Macros to check if we are in parallel heap vacuuming, parallel index vacuuming,
+ * or both. If ParallelVacuumIsActive() is true, we are in the parallel mode, meaning
+ * that we have dead items TIDs on shared memory area.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelIndexVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_index((vacrel)->pvs) > 0)
+#define ParallelHeapVacuumIsActive(vacrel)  \
+	(ParallelVacuumIsActive(vacrel) && parallel_vacuum_get_nworkers_table((vacrel)->pvs) > 0)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -227,6 +242,81 @@ typedef struct LVRelScanData
 	bool		skippedallvis;
 } LVRelScanData;
 
+/*
+ * Struct for information that needs to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState vistest;
+
+	/* per-worker scan stats for parallel heap vacuum scan */
+	LVRelScanData worker_scandata[FLEXIBLE_ARRAY_MEMBER];
+} PHVShared;
+#define SizeOfPHVShared (offsetof(PHVShared, worker_scandata))
+
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerData
+{
+	/* Has this worker data been initialized? */
+	bool		inited;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData pbscanwork;
+
+	/*
+	 * True if a parallel vacuum scan worker allocated blocks in state but
+	 * might have not scanned all of them. The leader process will take over
+	 * for scanning these remaining blocks.
+	 */
+	bool		maybe_have_unprocessed_blocks;
+
+	/* last block number the worker scanned */
+	BlockNumber last_blkno;
+}			PHVScanWorkerData;
+
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pscandesc;
+
+	/* Shared information */
+	PHVShared  *shared;
+
+	/*
+	 * Points to all per-worker scan states stored on DSM area.
+	 *
+	 * During parallel heap scan, each worker allocates some chunks of blocks
+	 * to scan in its scan state, and could exit while leaving some chunks
+	 * un-scanned if the size of dead_items TIDs is close to overrunning the
+	 * the available space. We store the scan states on shared memory area so
+	 * that workers can resume heap scans from the previous point.
+	 */
+	PHVScanWorkerData *phvscanworks;
+
+	/* Attached per-worker scan state */
+	PHVScanWorkerData *myphvscanwork;
+
+	/*
+	 * All blocks up to this value has been scanned, i.e. the minimum of all
+	 * PHVScanWorkerData->last_blkno. This field is updated by
+	 * parallel_lazy_scan_compute_min_block().
+	 */
+	BlockNumber min_scanned_blkno;
+
+	/* The number of workers launched for parallel heap vacuum */
+	int			nworkers_launched;
+} PHVState;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -238,6 +328,9 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
+	/* Parallel heap vacuum state and sizes for each struct */
+	PHVState   *phvstate;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -278,6 +371,8 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
+	BlockNumber next_fsm_block_to_vacuum;	/* next block to check for FSM
+											 * vacuum */
 
 	/* Working state for heap scanning and vacuuming */
 	LVRelScanData *scan_data;
@@ -309,9 +404,11 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static bool do_lazy_scan_heap(LVRelState *vacrel);
 static bool heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 									 bool *all_visible_according_to_vm);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -351,6 +448,13 @@ static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 static void update_relstats_all_indexes(LVRelState *vacrel);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_begin(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
+static void parallel_lazy_scan_compute_min_block(LVRelState *vacrel);
+static void parallel_lazy_scan_gather_scan_results(LVRelState *vacrel);
+static void complete_unfinihsed_lazy_scan_heap(LVRelState *vacrel);
+
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
 									 LVSavedErrInfo *saved_vacrel,
@@ -916,12 +1020,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_fsm_block_to_vacuum = 0;
-	bool		all_visible_according_to_vm;
-
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -941,6 +1040,77 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_allvis = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/*
+	 * Do the actual work. If parallel heap vacuum is active, we scan and
+	 * vacuum heap using parallel workers.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+	{
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/* We must have scanned all heap pages */
+		Assert(scan_done);
+	}
+
+	/* report that everything is now scanned */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+								rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * Return true if we processed all blocks, otherwise false if we exit from this function
+ * while not completing the heap scan due to full of dead item TIDs. In serial heap scan
+ * case, this function always returns true. In parallel heap vacuum scan, this function
+ * is called by both worker processes and the leader process, and could return false.
+ */
+static bool
+do_lazy_scan_heap(LVRelState *vacrel)
+{
+	bool		all_visible_according_to_vm;
+	BlockNumber blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		scan_done = true;
+
 	while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm))
 	{
 		Buffer		buf;
@@ -966,46 +1136,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (!IsParallelWorker() &&
+			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
-		/*
-		 * Consider if we definitely have enough space to process TIDs on page
-		 * already.  If we are close to overrunning the available space for
-		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page.
-		 */
-		if (TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
-		{
-			/*
-			 * Before beginning index vacuuming, we release any pin we may
-			 * hold on the visibility map page.  This isn't necessary for
-			 * correctness, but we do it anyway to avoid holding the pin
-			 * across a lengthy, unrelated operation.
-			 */
-			if (BufferIsValid(vmbuffer))
-			{
-				ReleaseBuffer(vmbuffer);
-				vmbuffer = InvalidBuffer;
-			}
-
-			/* Perform a round of index and heap vacuuming */
-			vacrel->consider_bypass_optimization = false;
-			lazy_vacuum(vacrel);
-
-			/*
-			 * Vacuum the Free Space Map to make newly-freed space visible on
-			 * upper-level FSM pages.  Note we have not yet processed blkno.
-			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-									blkno);
-			next_fsm_block_to_vacuum = blkno;
-
-			/* Report that we are once again scanning the heap */
-			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
-		}
-
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
@@ -1094,9 +1228,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1110,57 +1245,98 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				BlockNumber fsm_vac_up_to;
+
+				/*
+				 * If parallel heap vacuum scan is active, compute the minimum
+				 * block number we scanned so far.
+				 */
+				if (ParallelHeapVacuumIsActive(vacrel))
+				{
+					parallel_lazy_scan_compute_min_block(vacrel);
+					fsm_vac_up_to = vacrel->phvstate->min_scanned_blkno;
+				}
+				else
+				{
+					/* blkno is already processed */
+					fsm_vac_up_to = blkno + 1;
+				}
+
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+										fsm_vac_up_to);
+				vacrel->next_fsm_block_to_vacuum = fsm_vac_up_to;
 			}
 		}
 		else
 			UnlockReleaseBuffer(buf);
-	}
 
-	vacrel->blkno = InvalidBlockNumber;
-	if (BufferIsValid(vmbuffer))
-		ReleaseBuffer(vmbuffer);
+		/*
+		 * Consider if we definitely have enough space to process TIDs on page
+		 * already.  If we are close to overrunning the available space for
+		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
+		 * this page.
+		 */
+		if (TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
+		{
+			/*
+			 * Before beginning index vacuuming, we release any pin we may
+			 * hold on the visibility map page.  This isn't necessary for
+			 * correctness, but we do it anyway to avoid holding the pin
+			 * across a lengthy, unrelated operation.
+			 */
+			if (BufferIsValid(vmbuffer))
+			{
+				ReleaseBuffer(vmbuffer);
+				vmbuffer = InvalidBuffer;
+			}
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+			/*
+			 * In parallel heap scan, we set scan_done to false return to the
+			 * caller without index and heap vacuuming. The parallel vacuum
+			 * workers will exit while the leader process perform both index
+			 * and heap vacuuming after confirming the all parallel workers to
+			 * exit.
+			 */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/* Remember the last scanned block */
+				vacrel->phvstate->myphvscanwork->last_blkno = blkno;
 
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_data->scanned_pages,
-													 vacrel->scan_data->live_tuples);
+				/*
+				 * We might have some unprocessed blocks in the current chunk.
+				 */
+				scan_done = false;
 
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
-		vacrel->scan_data->missed_dead_tuples;
+				break;
+			}
 
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
+			/* Perform a round of index and heap vacuuming */
+			vacrel->consider_bypass_optimization = false;
+			lazy_vacuum(vacrel);
 
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+			/*
+			 * Vacuum the Free Space Map to make newly-freed space visible on
+			 * upper-level FSM pages.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+			/* Report that we are once again scanning the heap */
+			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
+			continue;
+		}
+	}
+
+	vacrel->blkno = InvalidBlockNumber;
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
+
+	return scan_done;
 }
 
 /*
@@ -1186,12 +1362,29 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 						 bool *all_visible_according_to_vm)
 {
 	BlockNumber next_block;
+	PHVState   *phvstate = vacrel->phvstate;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+retry:
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		/*
+		 * Get the next block to scan using parallel scan.
+		 *
+		 * If we reach the end of the relation,
+		 * table_block_parallelscan_nextpage return InvalidBlockNumber.
+		 */
+		next_block = table_block_parallelscan_nextpage(vacrel->rel,
+													   &(phvstate->myphvscanwork->pbscanwork),
+													   phvstate->pscandesc);
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
+	}
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->rel_pages)
+	if (!BlockNumberIsValid(next_block) || next_block >= vacrel->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1213,9 +1406,23 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		 * beginning of the scan).  Find the next unskippable block using the
 		 * visibility map.
 		 */
-		bool		skipsallvis;
+		bool		skipsallvis = false;
+		BlockNumber end_block;
+		BlockNumber nblocks_skip;
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		/*
+		 * In parallel heap scan, compute how many blocks are remaining in the
+		 * current chunk. We look for the next unskippable block within the
+		 * chunk.
+		 */
+		if (ParallelHeapVacuumIsActive(vacrel))
+			end_block = next_block +
+				phvstate->myphvscanwork->pbscanwork.phsw_chunk_remaining + 1;
+		else
+			end_block = vacrel->rel_pages;
+
+		find_next_unskippable_block(vacrel, &skipsallvis, next_block,
+									end_block);
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1232,11 +1439,33 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
 				vacrel->scan_data->skippedallvis = true;
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/* Tell the parallel scans to skip blocks */
+				table_block_parallelscan_skip_pages_in_chunk(vacrel->rel,
+															 &(phvstate->myphvscanwork->pbscanwork),
+															 phvstate->pscandesc,
+															 nblocks_skip);
+
+				/* Did we consume all blocks in the chunk? */
+				if (phvstate->myphvscanwork->pbscanwork.phsw_chunk_remaining == 0)
+				{
+					/*
+					 * Reset the next_unskippable_blocks and try to find an
+					 * unskippable block in the next chunk.
+					 */
+					vacrel->next_unskippable_block = InvalidBlockNumber;
+					goto retry;
+				}
+			}
+
+			next_block = vacrel->next_unskippable_block;
 		}
 	}
 
@@ -1267,7 +1496,9 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 }
 
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in a range of start_blk (inclusive) and end_blk (exclusive).
+ *
  * The next unskippable block and its visibility information is updated in
  * vacrel.
  *
@@ -1280,16 +1511,19 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
  * to skip such a range is actually made, making everything safe.)
  */
 static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start_blk, BlockNumber end_blk)
 {
 	BlockNumber rel_pages = vacrel->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block = start_blk;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_allvis;
 
-	*skipsallvis = false;
+	Assert(BlockNumberIsValid(next_unskippable_block));
 
-	for (;;)
+	for (next_unskippable_block = start_blk;
+		 next_unskippable_block < end_blk;
+		 next_unskippable_block += 1)
 	{
 		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
 													   next_unskippable_block,
@@ -1309,11 +1543,12 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 
 		/*
 		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
+		 * (caller must have the opportunity to set
+		 * vacrel->scan_data->nonempty_pages). This rule avoids having
+		 * lazy_truncate_heap() take access-exclusive lock on rel to attempt a
+		 * truncation that fails anyway, just because there are tuples on the
+		 * last page (it is likely that there will be tuples on other nearby
+		 * pages as well, but those can be skipped).
 		 *
 		 * Implement this by always treating the last block as unsafe to skip.
 		 */
@@ -1340,8 +1575,6 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 			 */
 			*skipsallvis = true;
 		}
-
-		next_unskippable_block++;
 	}
 
 	/* write the local variables back to vacrel */
@@ -2172,7 +2405,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2548,7 +2781,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	progress_start_val[1] = vacrel->nindexes;
 	pgstat_progress_update_multi_param(2, progress_start_index, progress_start_val);
 
-	if (!ParallelVacuumIsActive(vacrel))
+	if (!ParallelIndexVacuumIsActive(vacrel))
 	{
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2998,12 +3231,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -3021,11 +3250,20 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize parallel heap scan/vacuuming or index vacuuming
+			 * or both based on the table size and the number of indexes.
+			 * Since only one worker can be used for an index, we will invoke
+			 * parallelism for index vacuuming only if there are at least two
+			 * indexes on a table.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3065,9 +3303,19 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	/*
+	 * Protect both dead_items and dead_items_info from concurrent updates in
+	 * parallel heap scan cases.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
@@ -3267,6 +3515,476 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of parallel workers for parallel vacuum heap scan.
+ *
+ * The calculation logic is borrowed from compute_parallel_worker().
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	int			parallel_workers = 0;
+	int			heap_parallel_threshold;
+	int			heap_pages;
+
+	if (nrequested == 0)
+	{
+		/*
+		 * Select the number of workers based on the log of the size of the
+		 * relation. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		heap_pages = RelationGetNumberOfBlocks(rel);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nrequested;
+
+	return parallel_workers;
+}
+
+/*
+ * Estimate shared memory sizes required for parallel heap vacuum.
+ */
+static inline void
+heap_parallel_estimate_shared_memory_size(Relation rel, int nworkers, Size *pscan_len,
+										  Size *shared_len, Size *pscanwork_len)
+{
+	Size		size = 0;
+
+	size = add_size(size, SizeOfPHVShared);
+	size = add_size(size, mul_size(sizeof(LVRelScanData), nworkers));
+	*shared_len = size;
+
+	*pscan_len = table_block_parallelscan_estimate(rel);
+
+	*pscanwork_len = mul_size(sizeof(PHVScanWorkerData), nworkers);
+}
+
+/*
+ * Compute the amount of space we'll need in the parallel heap vacuum
+ * DSM, and inform pcxt->estimator about our needs.
+ *
+ * nworkers is the number of workers for the table vacuum. Note that it could
+ * be different than pcxt->nworkers since it is the maximum of number of
+ * workers for table vacuum and index vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+							  int nworkers, void *state)
+{
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+
+	/* space for PHVShared */
+	shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for ParallelBlockTableScanDesc */
+	shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* space for per-worker scan state, PHVScanWorkerData */
+	shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+								int nworkers, void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	PHVState   *phvstate = vacrel->phvstate;
+	ParallelBlockTableScanDesc pscan;
+	PHVScanWorkerData *phvscanwork;
+	PHVShared  *shared;
+	Size		pscan_len;
+	Size		shared_len;
+	Size		pscanwork_len;
+
+	phvstate = (PHVState *) palloc0(sizeof(PHVState));
+	phvstate->min_scanned_blkno = InvalidBlockNumber;
+
+	/* Allocate the shared memory for parallel heap vacuum */
+	heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+											  &shared_len, &pscanwork_len);
+	shared = shm_toc_allocate(pcxt->toc, shared_len);
+
+	/* prepare the shared lazy vacuum state */
+	MemSet(shared, 0, shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_data->NewRelminMxid;
+	shared->vistest = *vacrel->vistest;
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_SHARED, shared);
+	phvstate->shared = shared;
+
+	/* prepare the parallel block table scan description */
+	pscan = shm_toc_allocate(pcxt->toc, pscan_len);
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pscan);
+	pscan->base.phs_syncscan = false;
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_DESC, pscan);
+	phvstate->pscandesc = pscan;
+
+	/* prepare the workers' parallel block table scan state */
+	phvscanwork = shm_toc_allocate(pcxt->toc, pscanwork_len);
+	MemSet(phvscanwork, 0, pscanwork_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCAN_DESC_WORKER, phvscanwork);
+	phvstate->phvscanworks = phvscanwork;
+
+	vacrel->phvstate = phvstate;
+}
+
+/*
+ * Main function for parallel heap vacuum workers.
+ */
+void
+heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
+							ParallelWorkerContext *pwcxt)
+{
+	LVRelState	vacrel = {0};
+	PHVState   *phvstate;
+	PHVShared  *shared;
+	ParallelBlockTableScanDesc pscandesc;
+	PHVScanWorkerData *phvscanwork;
+	LVRelScanData *scan_data;
+	ErrorContextCallback errcallback;
+	bool		scan_done;
+
+	phvstate = palloc(sizeof(PHVState));
+
+	shared = (PHVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_SCAN_SHARED,
+										  false);
+	phvstate->shared = shared;
+	scan_data = &(shared->worker_scandata[ParallelWorkerNumber]);
+
+	/* Set parallel table block scan description */
+	pscandesc = (ParallelBlockTableScanDesc) shm_toc_lookup(pwcxt->toc,
+															LV_PARALLEL_KEY_SCAN_DESC,
+															false);
+	phvstate->pscandesc = pscandesc;
+
+	/* Set the per-worker scan state */
+	phvscanwork = (PHVScanWorkerData *) shm_toc_lookup(pwcxt->toc,
+													   LV_PARALLEL_KEY_SCAN_DESC_WORKER,
+													   false);
+	phvstate->myphvscanwork = &(phvscanwork[ParallelWorkerNumber]);
+
+	/* Prepare LVRelState */
+	vacrel.rel = rel;
+	vacrel.indrels = parallel_vacuum_get_table_indexes(pvs, &vacrel.nindexes);
+	vacrel.pvs = pvs;
+	vacrel.phvstate = phvstate;
+	vacrel.aggressive = shared->aggressive;
+	vacrel.skipwithvm = shared->skipwithvm;
+	vacrel.cutoffs = shared->cutoffs;
+	vacrel.vistest = &(shared->vistest);
+	vacrel.dead_items = parallel_vacuum_get_dead_items(pvs,
+													   &vacrel.dead_items_info);
+	vacrel.rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel.scan_data = scan_data;
+
+	/* initialize per-worker relation statistics */
+	MemSet(scan_data, 0, sizeof(LVRelScanData));
+
+	/* Set fields necessary for heap scan */
+	vacrel.scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel.scan_data->NewRelminMxid = shared->NewRelminMxid;
+	vacrel.scan_data->skippedallvis = false;
+
+	/* Initialize the per-worker scan state if not yet */
+	if (!phvstate->myphvscanwork->inited)
+	{
+		table_block_parallelscan_startblock_init(rel,
+												 &(phvstate->myphvscanwork->pbscanwork),
+												 phvstate->pscandesc);
+
+		phvstate->myphvscanwork->last_blkno = InvalidBlockNumber;
+		phvstate->myphvscanwork->maybe_have_unprocessed_blocks = false;
+		phvstate->myphvscanwork->inited = true;
+	}
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel.dbname = get_database_name(MyDatabaseId);
+	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel.relname = pstrdup(RelationGetRelationName(rel));
+	vacrel.indname = NULL;
+	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	scan_done = do_lazy_scan_heap(&vacrel);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	/*
+	 * If the leader or a worker finishes the heap scan because dead_items
+	 * TIDs is close to the limit, it might have some allocated blocks in its
+	 * scan state. Since this scan state might not be used in the next heap
+	 * scan, we remember that it might have some unconsumed blocks so that the
+	 * leader complete the scans after the heap scan phase finishes.
+	 */
+	phvstate->myphvscanwork->maybe_have_unprocessed_blocks = !scan_done;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+complete_unfinihsed_lazy_scan_heap(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		PHVScanWorkerData *phvscanwork = &(vacrel->phvstate->phvscanworks[i]);
+		bool		scan_done PG_USED_FOR_ASSERTS_ONLY;
+
+		/*
+		 * Skip if this worker's scan has not been used or doesn't have
+		 * unprocessed block in chunks.
+		 */
+		if (!phvscanwork->inited || !phvscanwork->maybe_have_unprocessed_blocks)
+			continue;
+
+		/* Attach the worker's scan state and do heap scan */
+		vacrel->phvstate->myphvscanwork = phvscanwork;
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		Assert(scan_done);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Compute the minimum block number we have scanned so far and update
+ * vacrel->min_scanned_blkno.
+ */
+static void
+parallel_lazy_scan_compute_min_block(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * We check all worker scan states here to compute the minimum block
+	 * number among all scan states.
+	 */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		PHVScanWorkerData *phvscanwork = &(phvstate->phvscanworks[i]);
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!phvscanwork->inited)
+			continue;
+
+		if (!BlockNumberIsValid(phvstate->min_scanned_blkno) ||
+			phvscanwork->last_blkno < phvstate->min_scanned_blkno)
+			phvstate->min_scanned_blkno = phvscanwork->last_blkno;
+	}
+}
+
+/* Accumulate each worker's scan results into the leader's */
+static void
+parallel_lazy_scan_gather_scan_results(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		LVRelScanData *data = &(phvstate->shared->worker_scandata[i]);
+
+#define ACCUM_COUNT(item) vacrel->scan_data->item += data->item
+		ACCUM_COUNT(scanned_pages);
+		ACCUM_COUNT(removed_pages);
+		ACCUM_COUNT(new_frozen_tuple_pages);
+		ACCUM_COUNT(vm_new_visible_pages);
+		ACCUM_COUNT(vm_new_visible_frozen_pages);
+		ACCUM_COUNT(vm_new_frozen_pages);
+		ACCUM_COUNT(lpdead_item_pages);
+		ACCUM_COUNT(missed_dead_pages);
+		ACCUM_COUNT(tuples_deleted);
+		ACCUM_COUNT(tuples_frozen);
+		ACCUM_COUNT(lpdead_items);
+		ACCUM_COUNT(live_tuples);
+		ACCUM_COUNT(recently_dead_tuples);
+		ACCUM_COUNT(missed_dead_tuples);
+#undef ACCUM_COUNT
+
+		Assert(TransactionIdIsValid(data->NewRelfrozenXid));
+		Assert(MultiXactIdIsValid(data->NewRelminMxid));
+
+		if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
+			vacrel->scan_data->nonempty_pages = data->nonempty_pages;
+
+		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
+			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
+
+		vacrel->scan_data->skippedallvis |= data->skippedallvis;
+	}
+
+	/* Also, compute the minimum block number we scanned so far */
+	parallel_lazy_scan_compute_min_block(vacrel);
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches parallel
+ * workers to scan the heap in parallel.
+ */
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	PHVScanWorkerData *phvscanwork;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Launch parallel workers for parallel lazy heap scan */
+	parallel_lazy_scan_heap_begin(vacrel);
+
+	/* Initialize parallel scan description to join as a worker */
+	phvscanwork = palloc0(sizeof(PHVScanWorkerData));
+	phvscanwork->last_blkno = InvalidBlockNumber;
+	table_block_parallelscan_startblock_init(vacrel->rel, &(phvscanwork->pbscanwork),
+											 vacrel->phvstate->pscandesc);
+	vacrel->phvstate->myphvscanwork = phvscanwork;
+
+	for (;;)
+	{
+		bool		scan_done;
+
+		/*
+		 * Scan the table until either we are close to overrunning the
+		 * available space for dead_items TIDs or we reach the end of the
+		 * table.
+		 */
+		scan_done = do_lazy_scan_heap(vacrel);
+
+		/*
+		 * Parallel lazy heap scan finished. Wait for parallel workers to
+		 * finish and gather scan results.
+		 */
+		parallel_lazy_scan_heap_end(vacrel);
+
+		/* We reach the end of the table */
+		if (scan_done)
+			break;
+
+		/*
+		 * The parallel heap scan paused in the middle of the table due to
+		 * full of dead_items TIDs. We perform a round of index and heap
+		 * vacuuming and FSM vacuum.
+		 */
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (vacrel->phvstate->min_scanned_blkno > vacrel->next_fsm_block_to_vacuum)
+		{
+			/*
+			 * min_scanned_blkno was updated when gathering the workers' scan
+			 * results.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									vacrel->phvstate->min_scanned_blkno + 1);
+			vacrel->next_fsm_block_to_vacuum = vacrel->phvstate->min_scanned_blkno;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+		/* Re-launch workers to restart parallel heap scan */
+		parallel_lazy_scan_heap_begin(vacrel);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process launched fewer workers in the next cycle.
+	 */
+	complete_unfinihsed_lazy_scan_heap(vacrel);
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_begin(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for heap scanning (planned: %d)",
+							 "launched %d parallel vacuum workers for heap scanning (planned: %d)",
+							 vacrel->phvstate->nworkers_launched),
+					vacrel->phvstate->nworkers_launched,
+					parallel_vacuum_get_nworkers_table(vacrel->pvs))));
+}
+
+/*
+ * Helper routine to finish the parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_table_scan_end(vacrel->pvs);
+
+	/* Gather the workers' scan results */
+	parallel_lazy_scan_gather_scan_results(vacrel);
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e18a8f8250f..f74809f394b 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -599,6 +599,21 @@ table_block_parallelscan_nextpage(Relation rel,
 	return page;
 }
 
+/*
+ * skip some blocks to scan.
+ *
+ * Consume the given number of blocks in the current chunk. It doesn't skip blocks
+ * beyond the current chunk.
+ */
+void
+table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+											 ParallelBlockTableScanWorker pbscanwork,
+											 ParallelBlockTableScanDesc pbscan,
+											 BlockNumber nblocks_skip)
+{
+	pbscanwork->phsw_chunk_remaining -= Min(nblocks_skip, pbscanwork->phsw_chunk_remaining);
+}
+
 /* ----------------------------------------------------------------------------
  * Helper functions to implement relation sizing for block oriented AMs.
  * ----------------------------------------------------------------------------
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7d06dad83fc..94438eff25c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -21,6 +21,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "commands/vacuum.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -401,6 +402,13 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int requested);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
+										  int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
+										ParallelWorkerContext *pwcxt);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d75970d0fe3..0fd5b60c217 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1845,6 +1845,9 @@ PGresAttValue
 PGresParamDesc
 PGresult
 PGresult_data
+PHVScanWorkerState
+PHVShared
+PHVState
 PIO_STATUS_BLOCK
 PLAINTREE
 PLAssignStmt
-- 
2.43.5

v8-0002-Remember-the-number-of-times-parallel-index-vacuu.patchapplication/octet-stream; name=v8-0002-Remember-the-number-of-times-parallel-index-vacuu.patchDownload

From 05363e0c157a89c5325d7b86dc4963db426bc6a0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Dec 2024 15:54:32 -0800
Subject: [PATCH v8 2/9] Remember the number of times parallel index
 vacuuming/cleanup is executed in ParallelVacuumState.

Previously, the caller can passes an arbitrary value for
'num_index_scans' to parallel index vacuuming or cleaning up APIs, but
it didn't make sense since if the caller needs to be careful about
counting how many times it executed index vacuuming or cleaning
up. Otherwise, it fails to reinitialize parallel DSM.

This commit changes parallel vacuum APIs so that ParallelVacuumState
has the counter num_index_scans and re-initialize parallel DSM based
on that.

An upcoming patch for parallel table scan will do a similar thing.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c  |  4 +---
 src/backend/commands/vacuumparallel.c | 27 +++++++++++++++------------
 src/include/commands/vacuum.h         |  4 +---
 3 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 985dcf51217..97e94b3ac86 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2198,8 +2198,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	else
 	{
 		/* Outsource everything to parallel variant */
-		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, old_live_tuples,
-											vacrel->num_index_scans);
+		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, old_live_tuples);
 
 		/*
 		 * Do a postcheck to consider applying wraparound failsafe now.  Note
@@ -2569,7 +2568,6 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	{
 		/* Outsource everything to parallel variant */
 		parallel_vacuum_cleanup_all_indexes(vacrel->pvs, reltuples,
-											vacrel->num_index_scans,
 											estimated_count);
 	}
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0d92e694d6a..08011fde23f 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -200,6 +200,9 @@ struct ParallelVacuumState
 	 */
 	bool	   *will_parallel_vacuum;
 
+	/* How many time index vacuuming or cleaning up is executed? */
+	int			num_index_scans;
+
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
 	 * parallel index cleanup respectively.
@@ -223,8 +226,7 @@ struct ParallelVacuumState
 
 static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 											bool *will_parallel_vacuum);
-static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
-												bool vacuum);
+static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_unsafe_indexes(ParallelVacuumState *pvs);
 static void parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
@@ -497,8 +499,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
  * Do parallel index bulk-deletion with parallel workers.
  */
 void
-parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tuples,
-									int num_index_scans)
+parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tuples)
 {
 	Assert(!IsParallelWorker());
 
@@ -509,7 +510,7 @@ parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 	pvs->shared->reltuples = num_table_tuples;
 	pvs->shared->estimated_count = true;
 
-	parallel_vacuum_process_all_indexes(pvs, num_index_scans, true);
+	parallel_vacuum_process_all_indexes(pvs, true);
 }
 
 /*
@@ -517,7 +518,7 @@ parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tup
  */
 void
 parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tuples,
-									int num_index_scans, bool estimated_count)
+									bool estimated_count)
 {
 	Assert(!IsParallelWorker());
 
@@ -529,7 +530,7 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 	pvs->shared->reltuples = num_table_tuples;
 	pvs->shared->estimated_count = estimated_count;
 
-	parallel_vacuum_process_all_indexes(pvs, num_index_scans, false);
+	parallel_vacuum_process_all_indexes(pvs, false);
 }
 
 /*
@@ -608,8 +609,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
  * must be used by the parallel vacuum leader process.
  */
 static void
-parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
-									bool vacuum)
+parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, bool vacuum)
 {
 	int			nworkers;
 	PVIndVacStatus new_status;
@@ -631,7 +631,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		nworkers = pvs->nindexes_parallel_cleanup;
 
 		/* Add conditionally parallel-aware indexes if in the first time call */
-		if (num_index_scans == 0)
+		if (pvs->num_index_scans == 0)
 			nworkers += pvs->nindexes_parallel_condcleanup;
 	}
 
@@ -659,7 +659,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		indstats->parallel_workers_can_process =
 			(pvs->will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
-													num_index_scans,
+													pvs->num_index_scans,
 													vacuum));
 	}
 
@@ -670,7 +670,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	if (nworkers > 0)
 	{
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (pvs->num_index_scans > 0)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -764,6 +764,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 	}
+
+	/* Increment the counter */
+	pvs->num_index_scans++;
 }
 
 /*
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 12d0b61950d..e7b7753b691 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -366,11 +366,9 @@ extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
-												long num_table_tuples,
-												int num_index_scans);
+												long num_table_tuples);
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
-												int num_index_scans,
 												bool estimated_count);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
-- 
2.43.5

v8-0001-Move-lazy-heap-scan-related-variables-to-new-stru.patchapplication/octet-stream; name=v8-0001-Move-lazy-heap-scan-related-variables-to-new-stru.patchDownload

From 8a86521207ab6d889a56680076025c38dda6d90d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 15 Nov 2024 14:14:13 -0800
Subject: [PATCH v8 1/9] Move lazy heap scan related variables to new struct
 LVRelScanData.

This is a pure refactoring for upcoming parallel heap scan, which
requires storing relation statistics collected during lazy heap scan
to a shared memory area.
---
 src/backend/access/heap/vacuumlazy.c | 305 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 160 insertions(+), 146 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 5b0e790e121..985dcf51217 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -185,6 +185,48 @@ typedef enum
 	VACUUM_ERRCB_PHASE_TRUNCATE,
 } VacErrPhase;
 
+/*
+ * This struct stores the relation statistics collected during lazy heap scan
+ * as well as scan state data.
+ */
+typedef struct LVRelScanData
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVRelScanData;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -211,10 +253,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -240,43 +278,18 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
 
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
+	/* Working state for heap scanning and vacuuming */
+	LVRelScanData *scan_data;
 
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
-
-	/* Statistics output by us, for table */
-	double		new_rel_tuples; /* new estimated total # of tuples */
-	double		new_live_tuples;	/* new estimated total # of live tuples */
+	/* New estimated total # of tuples and total # of live tuples */
+	double		new_rel_tuples;
+	double		new_live_tuples;
 	/* Statistics output by index AMs */
 	IndexBulkDeleteResult **indstats;
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -363,6 +376,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVRelScanData *scan_data;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -474,12 +488,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
+	scan_data = palloc(sizeof(LVRelScanData));
+	scan_data->scanned_pages = 0;
+	scan_data->removed_pages = 0;
+	scan_data->new_frozen_tuple_pages = 0;
+	scan_data->lpdead_item_pages = 0;
+	scan_data->missed_dead_pages = 0;
+	scan_data->nonempty_pages = 0;
+	scan_data->tuples_deleted = 0;
+	scan_data->tuples_frozen = 0;
+	scan_data->lpdead_items = 0;
+	scan_data->live_tuples = 0;
+	scan_data->recently_dead_tuples = 0;
+	scan_data->missed_dead_tuples = 0;
+	scan_data->vm_new_visible_pages = 0;
+	scan_data->vm_new_visible_frozen_pages = 0;
+	scan_data->vm_new_frozen_pages = 0;
+	vacrel->scan_data = scan_data;
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/* Allocate/initialize output statistics state */
@@ -488,19 +513,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
-	/* Initialize remaining counters (be tidy) */
-	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
-
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
-
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
 	 * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze.  Then determine
@@ -521,9 +533,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
+	vacrel->scan_data->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_data->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_data->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -604,15 +616,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_data->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_data->NewRelfrozenXid));
+	Assert(vacrel->scan_data->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_data->NewRelminMxid));
+	if (vacrel->scan_data->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -620,8 +632,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_data->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_data->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
@@ -642,7 +654,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_data->NewRelfrozenXid, vacrel->scan_data->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -658,8 +670,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples);
+						 vacrel->scan_data->recently_dead_tuples +
+						 vacrel->scan_data->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	if (instrument)
@@ -732,21 +744,21 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_data->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_data->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->scan_data->scanned_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->scan_data->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->scan_data->recently_dead_tuples);
+			if (vacrel->scan_data->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->scan_data->missed_dead_tuples,
+								 vacrel->scan_data->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -754,33 +766,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_data->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_data->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_data->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_data->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_data->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_data->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 (long long) vacrel->scan_data->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_data->vm_new_visible_pages,
+							 vacrel->scan_data->vm_new_visible_frozen_pages +
+							 vacrel->scan_data->vm_new_frozen_pages,
+							 vacrel->scan_data->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -800,10 +812,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_data->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->scan_data->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->scan_data->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -936,7 +948,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		bool		has_lpdead_items;
 		bool		got_cleanup_lock = false;
 
-		vacrel->scanned_pages++;
+		vacrel->scan_data->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -954,7 +966,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1118,16 +1130,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
 
 	/*
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
@@ -1164,10 +1176,10 @@ lazy_scan_heap(LVRelState *vacrel)
  * there are no further blocks to process.
  *
  * vacrel is an in/out parameter here.  Vacuum options and information about
- * the relation are read.  vacrel->skippedallvis is set if we skip a block
- * that's all-visible but not all-frozen, to ensure that we don't update
- * relfrozenxid in that case.  vacrel also holds information about the next
- * unskippable block, as bookkeeping for this function.
+ * the relation are read.  vacrel->scan_data->skippedallvis is set if we skip
+ * a block that's all-visible but not all-frozen, to ensure that we don't
+ * update relfrozenxid in that case.  vacrel also holds information about the
+ * next unskippable block, as bookkeeping for this function.
  */
 static bool
 heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
@@ -1224,7 +1236,7 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_data->skippedallvis = true;
 		}
 	}
 
@@ -1468,11 +1480,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_data->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1542,10 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_data->NewRelfrozenXid,
+							   &vacrel->scan_data->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_data->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_data->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1555,7 +1568,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_data->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -1590,7 +1603,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -1605,15 +1618,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_data->tuples_deleted += presult.ndeleted;
+	vacrel->scan_data->tuples_frozen += presult.nfrozen;
+	vacrel->scan_data->lpdead_items += presult.lpdead_items;
+	vacrel->scan_data->live_tuples += presult.live_tuples;
+	vacrel->scan_data->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -1662,13 +1675,13 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (presult.all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 	}
 
 	/*
@@ -1754,8 +1767,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		/*
@@ -1763,7 +1776,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * above, so we don't need to test the value of old_vmbits.
 		 */
 		else
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 	}
 }
 
@@ -1802,8 +1815,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_data->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1930,8 +1943,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_data->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -1958,25 +1971,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_data->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_data->live_tuples += live_tuples;
+	vacrel->scan_data->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_data->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_data->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -2005,7 +2018,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_data->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -2039,7 +2052,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_data->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2066,7 +2079,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		bypass = (vacrel->scan_data->lpdead_item_pages < threshold &&
 				  (TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L)));
 	}
 
@@ -2204,7 +2217,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2313,8 +2326,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
+			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2430,14 +2443,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -2513,7 +2526,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_data->scanned_pages < vacrel->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -2694,7 +2707,7 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->scan_data->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
 		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
@@ -2720,7 +2733,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_data->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -2821,7 +2834,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_data->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
@@ -2829,7 +2842,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_data->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -2857,7 +2870,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_data->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2969,7 +2982,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_data->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 56ba63f3d92..d75970d0fe3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1482,6 +1482,7 @@ LPVOID
 LPWSTR
 LSEG
 LUID
+LVRelScanData
 LVRelState
 LVSavedErrInfo
 LWLock
-- 
2.43.5

v8-0009-Support-parallel-heap-vacuum-during-lazy-vacuum.patchapplication/octet-stream; name=v8-0009-Support-parallel-heap-vacuum-during-lazy-vacuum.patchDownload

From 4677e2510fc98bc071ce011b2ebf4e78b3d6e47d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 16:30:28 -0800
Subject: [PATCH v8 9/9] Support parallel heap vacuum during lazy vacuum.

This commit further extends parallel vacuum to perform the heap vacuum
phase with parallel workers. It leverages the shared TidStore iteration.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/ref/vacuum.sgml          |  17 +-
 src/backend/access/heap/vacuumlazy.c  | 340 ++++++++++++++++++++------
 src/backend/commands/vacuumparallel.c |   8 +-
 src/include/commands/vacuum.h         |   2 +-
 4 files changed, 282 insertions(+), 85 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index b099d3a03cf..b9158a74f11 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -279,20 +279,21 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
       <para>
-       Perform scanning heap, index vacuum, and index cleanup phases of
-       <command>VACUUM</command> in parallel using
+       Perform scanning heap, vacuuming heap, index vacuum, and index cleanup
+       phases of <command>VACUUM</command> in parallel using
        <replaceable class="parameter">integer</replaceable> background workers
        (for the details of each vacuum phase, please refer to
        <xref linkend="vacuum-phases"/>).
       </para>
       <para>
        For heap tables, the number of workers used to perform the scanning
-       heap is determined based on the size of table. A table can participate in
-       parallel scanning heap if and only if the size of the table is more than
-       <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
-       the heap table's blocks will be divided into ranges and shared among the
-       cooperating processes. Each worker process will complete the scanning of
-       its given range of blocks before requesting an additional range of blocks.
+       heap and vacuuming heap is determined based on the size of table. A table
+       can participate in parallel scanning heap if and only if the size of the
+       table is more than <xref linkend="guc-min-parallel-table-scan-size"/>.
+       During scanning heap, the heap table's blocks will be divided into ranges
+       and shared among the cooperating processes. Each worker process will
+       complete the scanning of its given range of blocks before requesting an
+       additional range of blocks.
       </para>
       <para>
        The number of workers used to perform parallel index vacuum and index
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f1fbe242aa7..d9d902de17d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -74,6 +74,41 @@
  * that there only needs to be one call to lazy_vacuum, after the initial pass
  * completes.
  *
+ * Parallel Vacuum
+ * ----------------
+ * Lazy vacuum on heap tables supports parallel processing for three vacuum
+ * phases: scanning heap, vacuuming indexes, and vacuuming heap. Before the
+ * scanning heap phase, we initialize parallel vacuum state, ParallelVacuumState,
+ * and allocate the TID store in a DSA area if we can use parallel mode for any
+ * of these three phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size, number of indexes, and the number
+ * of pages having dead tuples. Parallel workers are launched at the beginning
+ * of each phase and exit at the end of each phase.
+ *
+ * For scanning the heap table with parallel workers, we utilize the
+ * table_block_parallelscan_xxx facility which splits the table into several
+ * chunks and parallel workers allocate chunks to scan. If dead_items TIDs is
+ * close to overrunning the available space during parallel heap scan, parallel
+ * workers exit and leader process gathers the scan results. Then, it performs
+ * a round of index and heap vacuuming that could also use the parallelism. After
+ * vacuuming both indexes and heap table, the leader process vacuums FSM to make
+ * newly-freed space visible. Then, it relaunches parallel workers to resume the
+ * scanning heap phase with parallel workers again. In order to be able to resume
+ * the parallel heap scan from the previous status, the workers' parallel scan
+ * descriptions are stored in the shared memory (DSM) space to share among parallel
+ * workers. If the leader could launch fewer workers to resume the parallel heap
+ * scan, some blocks are remained as un-scanned. The leader process serially deals
+ * with such blocks at the end of scanning heap phase (see
+ * parallel_heap_complete_unfinished_scan()).
+ *
+ * At the beginning of the vacuuming heap phase, the leader launches parallel
+ * workers and initiates the shared iteration on the shared TID store. At the
+ * end of the phase, the leader process waits for all workers to finish and gather
+ * the workers' results.
+ *
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -227,6 +262,7 @@ typedef struct LVRelScanData
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	BlockNumber vacuumed_pages; /* # pages vacuumed in one second pass */
 
 	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
@@ -258,6 +294,9 @@ typedef struct PHVShared
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState vistest;
 
+	dsa_pointer shared_iter_handle;
+	bool		do_heap_vacuum;
+
 	/* per-worker scan stats for parallel heap vacuum scan */
 	LVRelScanData worker_scandata[FLEXIBLE_ARRAY_MEMBER];
 } PHVShared;
@@ -306,6 +345,14 @@ typedef struct PHVState
 	/* Attached per-worker scan state */
 	PHVScanWorkerData *myphvscanwork;
 
+	/*
+	 * The number of parallel workers to launch for parallel heap scanning.
+	 * Note that the number of parallel workers for parallel heap vacuuming
+	 * could vary but is less than num_heapscan_workers. So this works also as
+	 * the maximum number of workers for parallel heap scanning and vacuuming.
+	 */
+	int			num_heapscan_workers;
+
 	/*
 	 * All blocks up to this value has been scanned, i.e. the minimum of all
 	 * PHVScanWorkerData->last_blkno. This field is updated by
@@ -422,6 +469,7 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
+static void do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter);
 static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, OffsetNumber *deadoffsets,
 								  int num_offsets, Buffer vmbuffer);
@@ -454,6 +502,9 @@ static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
 static void parallel_lazy_scan_compute_min_block(LVRelState *vacrel);
 static void parallel_lazy_scan_gather_scan_results(LVRelState *vacrel);
 static void complete_unfinihsed_lazy_scan_heap(LVRelState *vacrel);
+static int	compute_heap_vacuum_parallel_workers(Relation rel, BlockNumber nblocks);
+static TidStoreIter *parallel_lazy_vacum_heap_begin(LVRelState *vacrel, int nworkers);
+static void parallel_lazy_vacuum_heap_end(LVRelState *vacrel);
 
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
@@ -599,6 +650,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	scan_data->lpdead_item_pages = 0;
 	scan_data->missed_dead_pages = 0;
 	scan_data->nonempty_pages = 0;
+	scan_data->vacuumed_pages = 0;
 	scan_data->tuples_deleted = 0;
 	scan_data->tuples_frozen = 0;
 	scan_data->lpdead_items = 0;
@@ -2469,45 +2521,16 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 }
 
 /*
- *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
- *
- * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
- * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
- *
- * We may also be able to truncate the line pointer array of the heap pages we
- * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
- * array, it can be reclaimed as free space.  These LP_UNUSED items usually
- * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
- * each page to LP_UNUSED, and then consider if it's possible to truncate the
- * page's line pointer array).
- *
- * Note: the reason for doing this as a second pass is we cannot remove the
- * tuples until we've removed their index entries, and we want to process
- * index entry removal in batches as large as possible.
+ * Workhorse for lazy_vacuum_heal_rel.
  */
 static void
-lazy_vacuum_heap_rel(LVRelState *vacrel)
+do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter)
 {
-	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
-	LVSavedErrInfo saved_err_info;
-	TidStoreIter *iter;
-	TidStoreIterResult *iter_result;
-
-	Assert(vacrel->do_index_vacuuming);
-	Assert(vacrel->do_index_cleanup);
-	Assert(vacrel->num_index_scans > 0);
-
-	/* Report that we are now vacuuming the heap */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
 
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrel, &saved_err_info,
-							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 InvalidBlockNumber, InvalidOffsetNumber);
+	/* LVSavedErrInfo saved_err_info; */
+	TidStoreIterResult *iter_result;
 
-	iter = TidStoreBeginIterate(vacrel->dead_items);
 	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
@@ -2545,26 +2568,106 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		UnlockReleaseBuffer(buf);
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-		vacuumed_pages++;
+		vacrel->scan_data->vacuumed_pages++;
 	}
-	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
+}
+
+/*
+ *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
+ *
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
+ *
+ * We may also be able to truncate the line pointer array of the heap pages we
+ * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
+ * array, it can be reclaimed as free space.  These LP_UNUSED items usually
+ * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
+ * each page to LP_UNUSED, and then consider if it's possible to truncate the
+ * page's line pointer array).
+ *
+ * Note: the reason for doing this as a second pass is we cannot remove the
+ * tuples until we've removed their index entries, and we want to process
+ * index entry removal in batches as large as possible.
+ */
+static void
+lazy_vacuum_heap_rel(LVRelState *vacrel)
+{
+	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	int			nworkers = 0;
+
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
+	Assert(vacrel->num_index_scans > 0);
+
+	/* Report that we are now vacuuming the heap */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+							 InvalidBlockNumber, InvalidOffsetNumber);
+
+	vacrel->scan_data->vacuumed_pages = 0;
+
+	/*
+	 * If parallel heap vacuum is enabled, compute parallel workers required
+	 * to scan blocks to vacuum.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		BlockNumber nblocks_to_vacuum;
+
+		/*
+		 * Calculate the number of blocks that are not vacuumed yet.
+		 */
+		nblocks_to_vacuum = vacrel->scan_data->lpdead_item_pages -
+			vacrel->scan_data->vacuumed_pages;
+
+		nworkers = compute_heap_vacuum_parallel_workers(vacrel->rel,
+														nblocks_to_vacuum);
+	}
+
+	/*
+	 * Begin the iteration on dead_items TIDs.
+	 *
+	 * If we need at least one worker, we begin the shared iteration with
+	 * parallel workers.
+	 */
+	if (nworkers > 0)
+		iter = parallel_lazy_vacum_heap_begin(vacrel, nworkers);
+	else
+		iter = TidStoreBeginIterate(vacrel->dead_items);
+
+	/* do the real work */
+	do_lazy_vacuum_heap_rel(vacrel, iter);
+
+	if (nworkers > 0)
+	{
+		/* Wait for all workers to finish */
+		parallel_lazy_vacuum_heap_end(vacrel);
+	}
+
+	TidStoreEndIterate(iter);
+
 	/*
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
 		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
-			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
+			vacrel->scan_data->vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
 					vacrel->relname, (long long) vacrel->dead_items_info->num_items,
-					vacuumed_pages)));
+					vacrel->scan_data->vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2590,7 +2693,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
 
-	Assert(vacrel->do_index_vacuuming);
+	/*
+	 * This function could be called by parallel workers for heap vacuuming
+	 * too. They don't set its do_index_vacuuming.
+	 */
+	Assert(vacrel->do_index_vacuuming || IsParallelWorker());
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
@@ -3273,6 +3380,11 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		{
 			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
 																&vacrel->dead_items_info);
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+				vacrel->phvstate->num_heapscan_workers =
+					parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
 			return;
 		}
 	}
@@ -3520,37 +3632,48 @@ update_relstats_all_indexes(LVRelState *vacrel)
  *
  * The calculation logic is borrowed from compute_parallel_worker().
  */
-int
-heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+static int
+compute_heap_vacuum_parallel_workers(Relation rel, BlockNumber nblocks)
 {
 	int			parallel_workers = 0;
 	int			heap_parallel_threshold;
 	int			heap_pages;
 
-	if (nrequested == 0)
+	/*
+	 * Select the number of workers based on the log of the size of the
+	 * relation.Note that the upper limit of the min_parallel_table_scan_size
+	 * GUC is chosen to prevent overflow here.
+	 */
+	heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+	heap_pages = BlockNumberIsValid(nblocks) ?
+		nblocks : RelationGetNumberOfBlocks(rel);
+	while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
 	{
-		/*
-		 * Select the number of workers based on the log of the size of the
-		 * relation. Note that the upper limit of the
-		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
-		 * here.
-		 */
-		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
-		heap_pages = RelationGetNumberOfBlocks(rel);
-		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
-		{
-			parallel_workers++;
-			heap_parallel_threshold *= 3;
-			if (heap_parallel_threshold > INT_MAX / 3)
-				break;
-		}
+		parallel_workers++;
+		heap_parallel_threshold *= 3;
+		if (heap_parallel_threshold > INT_MAX / 3)
+			break;
 	}
-	else
-		parallel_workers = nrequested;
 
 	return parallel_workers;
 }
 
+/*
+ * Compute the number of parallel workers for parallel heap vacuum.
+ *
+ * nrequested is the number of parallel workers that user requested. If
+ * nrequested is 0, we compute the parallel degree by using
+ * compute_heap_vacuum_parallel_workers().
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nrequested)
+{
+	if (nrequested == 0)
+		return compute_heap_vacuum_parallel_workers(rel, InvalidBlockNumber);
+	else
+		return nrequested;
+}
+
 /*
  * Estimate shared memory sizes required for parallel heap vacuum.
  */
@@ -3666,7 +3789,6 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	PHVScanWorkerData *phvscanwork;
 	LVRelScanData *scan_data;
 	ErrorContextCallback errcallback;
-	bool		scan_done;
 
 	phvstate = palloc(sizeof(PHVState));
 
@@ -3704,10 +3826,9 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	/* initialize per-worker relation statistics */
 	MemSet(scan_data, 0, sizeof(LVRelScanData));
 
-	/* Set fields necessary for heap scan */
+	/* Set fields necessary for heap scan and vacuum */
 	vacrel.scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
 	vacrel.scan_data->NewRelminMxid = shared->NewRelminMxid;
-	vacrel.scan_data->skippedallvis = false;
 
 	/* Initialize the per-worker scan state if not yet */
 	if (!phvstate->myphvscanwork->inited)
@@ -3729,25 +3850,44 @@ heap_parallel_vacuum_worker(Relation rel, ParallelVacuumState *pvs,
 	vacrel.relnamespace = get_database_name(RelationGetNamespace(rel));
 	vacrel.relname = pstrdup(RelationGetRelationName(rel));
 	vacrel.indname = NULL;
-	vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = &vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
-	scan_done = do_lazy_scan_heap(&vacrel);
+	if (shared->do_heap_vacuum)
+	{
+		TidStoreIter *iter;
+
+		iter = TidStoreAttachIterateShared(vacrel.dead_items, shared->shared_iter_handle);
+
+		/* Join parallel heap vacuum */
+		vacrel.phase = VACUUM_ERRCB_PHASE_VACUUM_HEAP;
+		do_lazy_vacuum_heap_rel(&vacrel, iter);
+
+		TidStoreEndIterate(iter);
+	}
+	else
+	{
+		bool		scan_done;
+
+		/* Join parallel heap scan */
+		vacrel.phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+		scan_done = do_lazy_scan_heap(&vacrel);
+
+		/*
+		 * If the leader or a worker finishes the heap scan because dead_items
+		 * TIDs is close to the limit, it might have some allocated blocks in
+		 * its scan state. Since this scan state might not be used in the next
+		 * heap scan, we remember that it might have some unconsumed blocks so
+		 * that the leader complete the scans after the heap scan phase
+		 * finishes.
+		 */
+		phvstate->myphvscanwork->maybe_have_unprocessed_blocks = !scan_done;
+	}
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
-
-	/*
-	 * If the leader or a worker finishes the heap scan because dead_items
-	 * TIDs is close to the limit, it might have some allocated blocks in its
-	 * scan state. Since this scan state might not be used in the next heap
-	 * scan, we remember that it might have some unconsumed blocks so that the
-	 * leader complete the scans after the heap scan phase finishes.
-	 */
-	phvstate->myphvscanwork->maybe_have_unprocessed_blocks = !scan_done;
 }
 
 /*
@@ -3857,7 +3997,7 @@ parallel_lazy_scan_gather_scan_results(LVRelState *vacrel)
 		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
 			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
 
-		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+		if (MultiXactIdPrecedes(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
 			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
 
 		vacrel->scan_data->skippedallvis |= data->skippedallvis;
@@ -3962,7 +4102,10 @@ parallel_lazy_scan_heap_begin(LVRelState *vacrel)
 	Assert(!IsParallelWorker());
 
 	/* launcher workers */
-	vacrel->phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs);
+	vacrel->phvstate->shared->do_heap_vacuum = false;
+	vacrel->phvstate->nworkers_launched =
+		parallel_vacuum_table_scan_begin(vacrel->pvs,
+										 vacrel->phvstate->num_heapscan_workers);
 
 	ereport(vacrel->verbose ? INFO : DEBUG2,
 			(errmsg(ngettext("launched %d parallel vacuum worker for heap scanning (planned: %d)",
@@ -3985,6 +4128,55 @@ parallel_lazy_scan_heap_end(LVRelState *vacrel)
 	parallel_lazy_scan_gather_scan_results(vacrel);
 }
 
+/*
+ * Helper routine to launch parallel workers for parallel heap vacuuming.
+ */
+static TidStoreIter *
+parallel_lazy_vacum_heap_begin(LVRelState *vacrel, int nworkers)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+	TidStoreIter *iter;
+
+	Assert(nworkers > 0);
+
+	/* prepare for shared iteration on the dead items */
+	iter = TidStoreBeginIterateShared(vacrel->dead_items);
+
+	/* launcher workers */
+	phvstate->shared->do_heap_vacuum = true;
+	phvstate->shared->shared_iter_handle = TidStoreGetSharedIterHandle(iter);
+	phvstate->nworkers_launched = parallel_vacuum_table_scan_begin(vacrel->pvs,
+																   nworkers);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for heap vacuuming (planned: %d)",
+							 "launched %d parallel vacuum workers for heap vacuuming (planned: %d)",
+							 vacrel->phvstate->nworkers_launched),
+					phvstate->nworkers_launched, nworkers)));
+
+	return iter;
+}
+
+/*
+ * Helper routine to finish the parallel heap vacuuming.
+ */
+static void
+parallel_lazy_vacuum_heap_end(LVRelState *vacrel)
+{
+	PHVState   *phvstate = vacrel->phvstate;
+
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_table_scan_end(vacrel->pvs);
+
+	/* Gather the heap vacuum statistics collected by workers */
+	for (int i = 0; i < phvstate->nworkers_launched; i++)
+	{
+		LVRelScanData *data = &(phvstate->shared->worker_scandata[i]);
+
+		vacrel->scan_data->vacuumed_pages += data->vacuumed_pages;
+	}
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 84416767187..7272d0e7de2 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1054,8 +1054,10 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
  * table vacuum.
  */
 int
-parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
+parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs, int nworkers_request)
 {
+	int			nworkers;
+
 	Assert(!IsParallelWorker());
 
 	if (pvs->shared->nworkers_for_table == 0)
@@ -1069,11 +1071,13 @@ parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs)
 	if (pvs->num_table_scans > 0)
 		ReinitializeParallelDSM(pvs->pcxt);
 
+	nworkers = Min(nworkers_request, pvs->shared->nworkers_for_table);
+
 	/*
 	 * The number of workers might vary between table vacuum and index
 	 * processing
 	 */
-	ReinitializeParallelWorkers(pvs->pcxt, pvs->shared->nworkers_for_table);
+	ReinitializeParallelWorkers(pvs->pcxt, nworkers);
 	LaunchParallelWorkers(pvs->pcxt);
 
 	if (pvs->pcxt->nworkers_launched > 0)
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d45866d61e5..7bec04395e9 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -371,7 +371,7 @@ extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												bool estimated_count);
-extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs);
+extern int	parallel_vacuum_table_scan_begin(ParallelVacuumState *pvs, int nworkers_request);
 extern void parallel_vacuum_table_scan_end(ParallelVacuumState *pvs);
 extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
 extern int	parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs);
-- 
2.43.5

#35

Dilip Kumar

dilipbalaut@gmail.com

12 months ago

In reply to: Masahiko Sawada (#34)

Re: Parallel heap vacuum

On Fri, Jan 17, 2025 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Sun, Jan 12, 2025 at 1:34 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

IIRC, there was one of the blocker for implementing parallel heap vacuum
was group locking, have we already resolved that issue or its being
included in this patch set?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#36

Masahiko Sawada

sawada.mshk@gmail.com

12 months ago

In reply to: Dilip Kumar (#35)

Re: Parallel heap vacuum

On Fri, Jan 17, 2025 at 1:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jan 17, 2025 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Jan 12, 2025 at 1:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

IIRC, there was one of the blocker for implementing parallel heap vacuum was group locking, have we already resolved that issue or its being included in this patch set?

I recall we had some discussion on changes to group locking for
implementing parallel heap vacuum, but I don't remember if we have a
blocker now.

One problem we previously had was that since the relation extension
locks were not in conflict between parallel workers and the leader,
multiple workers could extend the visibility map simultaneously. This
problem was fixed by commit 85f6b49c2c.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#37

Dilip Kumar

dilipbalaut@gmail.com

12 months ago

In reply to: Masahiko Sawada (#36)

Re: Parallel heap vacuum

On Fri, Jan 17, 2025 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Jan 17, 2025 at 1:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jan 17, 2025 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Sun, Jan 12, 2025 at 1:34 AM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

IIRC, there was one of the blocker for implementing parallel heap vacuum

was group locking, have we already resolved that issue or its being
included in this patch set?

I recall we had some discussion on changes to group locking for
implementing parallel heap vacuum, but I don't remember if we have a
blocker now.

One problem we previously had was that since the relation extension
locks were not in conflict between parallel workers and the leader,
multiple workers could extend the visibility map simultaneously. This
problem was fixed by commit 85f6b49c2c.

Yes, that's correct. As part of that commit, we made the relation extension
lock conflict among group members, ensuring that multiple workers cannot
acquire it simultaneously. Additionally, this cannot cause a deadlock
because no other locks are held while the relation extension lock is being
held.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#38

Tomas Vondra

tomas@vondra.me

12 months ago

In reply to: Masahiko Sawada (#34)

6 attachment(s)

Re: Parallel heap vacuum

Hi,

Thanks for the new patches. I've repeated my benchmarking on v8, and I
agree this looks fine - the speedups are reasonable and match what I'd
expect on this hardware. I don't see any suspicious results like with
the earlier patches, where it got much faster thanks to the absence of
SKIP_PAGE_THRESHOLD logic.

Attached is the benchmarking script, CSV with raw results, and then also
two PDF reports comparing visualizing the impact of the patch by
comparing it to current master.

* parallel-vacuum-duration.pdf - Duration of the vacuum, and duration
relative to master (green - faster, read - slower). The patch is clearly
an improvement, with speedup up to ~3x depending on the index count and
a fraction of updated rows.

* parallel-vacuum-reads.pdf - Average read speed, as reported by VACUUM
VERBOSE. With the patch it can reach up to ~3GB/s, which is about the
max value possible on this hardware - so that's nice. I'll try to test
this on a better storage, to see how far it can go.

I haven't done any actual code review on the new patches, I'll try to
find time for that sometime next week.

regards

--
Tomas Vondra

Attachments:

parallel-vacuum.csvtext/csv; charset=UTF-8; name=parallel-vacuum.csvDownload

create-small-unlogged.sqlapplication/sql; name=create-small-unlogged.sqlDownload

create-small-logged.sqlapplication/sql; name=create-small-logged.sqlDownload

paralle-vacuum.shapplication/x-shellscript; name=paralle-vacuum.shDownload

parallel-vacuum-duration.pdfapplication/pdf; name=parallel-vacuum-duration.pdfDownload

%PDF-1.7
%��������
2 0 obj
<</Length 3 0 R/Filter/FlateDecode>>
stream
x��][�-�q}?�b?,K�V_`80���q2������������ji�V���9{�Z}Q�RU���!������x�!����)lK�����������?����~��?��������-~�����������m����mO�5oa��mk�yp���o������_�������;�t'����s��)�i�������H�%��_���?��=�s�������/�������7�������~�������[�������wW7��nL��n����o�Y��O������s���o���<�//����;��T_�>�Y�B��%�O���/���b���9�u��X��8�9�e�(����+��y��������8�������?���O�bX�%��<�����b�&!;�Jk6��zK^����5�y�i�d�Z�z�mK�q7�NH}����)�-�{�dKN!��E��2����d����W~>���qp�U�����~��U2����������q�[�o7���y�.����6�jk{�������R^Nw�~�Y�
Rn��r��N����{{0���C?��L._G�����o����N�z�nL�������;�%��zcH!�	�_��Nu���m�p�I�s����d�;�F6����H���U^�h�l��fcke.�?����\gIrG�
�I�� ����c��@��zc�������i�V����O?����6��7J&����+G�=��=��z���3�|��Y?I��8����\������R���
0�=��{X���ab���La��0?*1�����Ug�CtN�I0&ot!A��ss�0M���[��f��w%:���a�_�)��7�9�sX�n�������W�=P2&���Z%;8�'���z�&�����I�7J6D����afVk�d��l<x��0W��+���`*a���]q����'�q��Hm����P/8���{g��u�6���a����k�M�r
AP��S:�
`�����7�k�iE��wP�#b}1J&�|5���0��R�O5�5��Cz4��j�lQ�n��
�+zk�dOZ#^�>��!$�H[�~}c�A�R
������L�ko�L��^�O�2 u�xS�y������oJ�������7#{6"�{�VT8z�����O��M�H9��U�28L�.1��g�CT�\�(���(�&s�b-[��o�;��q9�s!{��D����TmNc�d���#kSz3���� ��o���1L{
������<��2w�\O�)���7k�dC�I���8Y�M����3P����
�@vYT�dC:N W5�i�aO�@�����
����b�����2����,(�?��7~?}������!�@��D���1`�c��n���>�,`�C���d,� \�Of�r��?�Z��U�u����!0�q9����}�����z\g���������3�2���g�!:���[��A�d]��@6�|d�)Y?}��C�)�N�$H��:;/���a�d�(q*K�QErX�xXl�2�5�h{��s=����z����j��X���pptf�#�(��K��3�y��F��x����Ao��7��j7�7�����!���j��P��y>L���X'{2`d���CJ��5s���ET�x.s�\d!�/��50�g��5���D~����z+��:s���su���5��H_���B����/P!�����|��F����0�y0'<y�	P�aZ����_���(�PI�}N��j��I��SH��Jo�]0.�+=���M}�|�i�P+9����E�a�o'�F�����H�v����������vA���q�!�F�a�AI���������N����2�w��\��T���\���.���4������P2*/
��gC��o{�`d����%���CI{��`d$
�_�����Z�p�#j
��o*�k�y���;:����*�>��.���O8b�i�s0�����P��Zk�!�c`w������/��-@��]Fc(��+a[Q�LoJ�/�\�������O��9;d�
�����;�ANo��Mko������d�l�0�����5TU_����0�90�y��c���q�����K��k�����_6[`Ty���.��U�)P�a
���|��F�k�y�B�P5����3������:#{6��dC?�=��+��0RW��/=�����!K_b�1�o�P�0'�.��P[`\O�#���c��I[`\yH_�j�(W�FS�'c5���JqT#�Bg
�n�q���1{QK \�[#�c���Wu9�!w)����K��F��<C��4�i��u�]��9�uE5"o�%{�o���^2��U�;'s���eP2�jr��B��Kzj�k�.��+�1�����V�2&�,V��\}XL���.��eN���Q�`\,����%��k�2�`dO�d�<�4�j��u�W���p�2����N��r���90���q�����,M�t��_��t��8��L@v��*��^FE���E7^��������]F���������wj�k��0�W��0o�f���U��d�^O�	F6�7������((^	�R����K�b��8���x� ���e�[E� (���EN�lh�@����f�lh�'}Q�8�s3�%{2&H[������6F�v;b�lt;�a�����ES�aP2���'�j���<�v��|�-i��	J�H<v��*���1�P�?m��a�s�Pr2��5BFr������br�aip�Qf<�5�Q]��oI��������nF�=iD�����������p���~%��"��f�e0�q����r��!��96P��n'V�`dO�>q�!��
���=P�/����Y��^3�d���CG���Z�kX9Mb�	6+(S(*����lt�T�s5F���(��@���=k��3ur���z6b��Z�F��aS��)�ry�2�J6����]���������
j
��k�j���:����J�P��I��k2R`\�t(��	P�'��8�x�@���1�cx�lH����������;:����$����K�b����W[e8���X
��y;��0(l�,V��q=kiyE��a�R�k�`di���b#-�z���������}����9[#�,0��6�+���R��\���>Z�n}�d�)��Zd��;���d^s��d�OG�r=����=�������	��yw�v�����X�V��,�7�OS�`d}�-o��u0��{���(�pB��E��^����1.�@L�1��W\�m��2x�/�]3�����z)� ��.�}�b���p���������B1�U��9;F�q�}�9s�I,�qM}	���7�`\�~d�L%��.���_��V�:�<XT��qy��a�\�-,1*o"&����>�
3�z���P�t�����[�
�q���o����_���u��<�}O���V�r���������������9,k������)�X��s��t|��D�/6t�B�K�_���l����a������J���m�8�6�m���������e����24q	����T�R����V��P�)6o���j���m����wWl~~}���CI�S�~- |������8�����m~�o��C�����r������h��D�%�e���Y�Z[��C�/&�,a���~1�u_v,=����v.��++w���u��i{Zz`��D�rI=��b!S�H��_S~1�����_Ld=Us�
W~!��n�u�H����a`�����iYo%�5��>A��#D?����S,q�K�q�k��_��s|�������:�}^ uJ�{�iB������TolN�uq%�_I����Jb�b�����������}����r,)����^E�)��0�\�<�������=�qJ?�	��:Il��A���*�������Gy���4�N�%|��8�a�w|Ed��ZB�f��9��9�"���b�r~ �V��7�Yt�=s~~8JI�D�85�Xw�4K�!����p,�l��e�a������	�������R�~���~n�.B� ��[,�$�\��e2���\�gz�N��a��OpN9�����~p�7�#���k�%�4���9�p�����x�*<G����������������c��,!b2�EYb�]�?��OK������My��%����K\�"	�c������`����.&���|�L�����J'V}o9��\X�G�|t��X���P��o����"G�7��O8���	<~��@�����U���gU}3x$����
�����pds)~�
�F�X�������T��T�L��(�;&U�:��wz=X���>
B���)�j����E*'��i���~��T4��3v�qO��U��o^5N���T5�������qh4���8��'��j��wQ�c0�.�ql����es��,nk�������	<'8�����d�"�xz�F4���^]5N�uJ�4��0R�3�;v�qOo��U����`2���5-�E�q� �I5���cS%�������O���:�������������5�}�����zGN�E@^��������T����$�8���f�>q5�ns)������-3��w��5�y"��:
����[u�\�8
v8
&��8�W�Y�>/�W������U,�h��N	�=Xz����z�w�V=��wJ?�;����t$=X�-( pR����]�UT��QU:GV�C��x���������!���W��J������_�+����]���t<�C�t��W����Y��F�k��3��	:�w��&h�X5�'������a;Ua#,��E�p�p�Le���yU�����89f�����v4xSy?�jCU������Xv��S��>��T�����kNj)��#~v�6z���`/�1x��u~����n���y�=��G��=
���Xt�`G*�k����s��Ryx�V�2SyHp�����Q*����wo����%���M�TK����8�V	�A]�u8
`L4��x����:
v��C�'��Wv)�y��A�����1��`�����8]�	���E�� ��	�;'x�A��?��!W��T���<�;'����wdD���]k�Lm_��l�_���8R�c0�*D��\Y�s=����)�?�z�*��B�&��(*�Z�$^�?����!U3�g�:+M(^]'�Rb:��zxuO��zr0�Q�C��:cT��zD��#
���Oa�/<I>������eF!��]������Z'a����Cgp>m*g��a�)7���<�
��]dM&�r��]9l��<���C�bw�`0q��rre
H�?m%�|�+`�t������Y�\�#�* h�1A[C��������%~L��~�g�-����@�~��������H"a?"�K���@�����g��:����(0�T��Q@��Q�V�����W������i�|���N�cJ6�	Z���wLp���z����z�lC����e	
0�g� -����;
{���m7��va�\�T	D��\Y��a�.�v�}�A[�U���,���1��&������&�l�j��p�N'I��M�h�tj<����{���,W�c���#p�$!�*��T��o������5Z�6�>����9��q������m�d����;�P&�h�r�'�w�S���,�uG)]��ck��
����b=	>�z������{X���
�%�/�GW�0�su	������>��\�3�2���������%����|N�^W�p���M}5"�I8mZU/����<����	>{<�+�S���$-���-G�8�S���UI[�(�W��D���&���t���xD����������Y�;��j@#������=8�t�@�m���c�����.�c��z��qx���=�����M��������7��������1���Ob@p8�*���]�~uX{���#�\tng�c�������L�����,SV�c�#	x���u�{v�<������}f#�c�:$0&&�:��7��N�|�&Z{W�x{~8�����wm�c�����������9�:��_�c�=;R}�p��b��g��s�+`M��|���u�-��t����k5D��"�0v�i(�	>8������#�������S��n��sl�(��M�����]�>�uO�>{�8���`��[���d?��L��]�>����d6�c��rO�;"S������7[����?�w@p/7��x�;&Ht+����GP�##r������}W�;���8�:[��(��}	[����.C{v���N��hm������M��xB�A�������v��	�;��$�"0z�E�����������!�v�6�c�c�4\z^^�;;v�B��������}^L���-�H�9m�oz��XM��G�������W8��$�G�v�6��wO���d@��������96���{�vb�v!���������]��������r�~^����/c���'N�,���������8�B�!h��� ����\�VN$hV����_&����pq,��d_��rm�:��
�G�oG������S���3���Alc��P��S�����
���X��m�i
���F�{��y������T��_���kx���S�����&p���FpW�|u�s�7��>U74{��]����� �����JFG����Q���i���T|�n�8��s�&p��?~u��M���(n
����q���U��l<�7z�t5O4�m��qvi�=�����!�������<���~��y��7�s�*S�8i�Q�x�7�gp��V4N�$R�8I�T���8z�.�c�qh�]�`������v0�����)l����� �S��c�����;P���Is[�'W/��=Ga��_��W'�����o����^���mrmG���?x�l�]�{�����%W��{�TB\��v���'����?��}M����0D��(����[�NN�Q}�a�����)�l�!��7~�����X�'���ov�K��#��`������g��[��v��I<���p#�}�tQ8��������
Q8��VdU8��O��G��87�����q8���i��h�������#��Ok�3:t��8�'ZW�38���x���Y����]�h��I3�h��'���389)_4����X5���9����Y��������Y���f�A��$�=I|��/��������Cp��b&���}C��@��`�.�����u�\_��������@��/�����g�g��
��=G�;����.��s�`O���v��K`�x�����'`[�*^h�c<?�N��=���o���.�����*C��=3�z��M�f}�=��O��&l|�v
�:�����6�6UC�c��+�S�1��J8d���s�Y����s���	��g�N��S��p#/8�M��A�7���S���3��8��������/�]"��S9;6��+�����&4y��;���sI����Ko_�3go���������qt��AI5��� �8�[�;�;�w2pv�N�;9�O4��v|��l�L4���]�{	,�{Z�0�����s@W��z���C5�W���b��An�8
[ �g�f�?(K��>:u���|��;��gl���|��\��T��3h�=@�0�"����	���V�/s��������~�N�lo�5>U9$h=���;��}U9�����C}��R�8�2R�x���T�^�gs=~��\��g�;N�$`{�-�}:��W���R���O��:�p�uCu�	>�m3M�������l����Q2S�8�������7lD��81�]�9{N�����X�s����`
��1����`GK}���th��som����:7�=��xIAU���*gp��JQ9��gW�38��E�n5�HR���U�C7�9/����"O����]7���%W&�n��O[��U��� ���������[��$�FS^����'r�!�����"�}Q�<����'-��"��l�c���yG�������9��,CV�������K`Uz�CN�|K�l��J�Lg�tH�B�������JF
L�N���P8���(���������_�qN&��B0��U����c��
�����c�0q���(���}]�:!��=��4����O�����m��u�
������h\w<�M�d�^����G�����������z���X������X�uJ�,/��7a�D�����lJ��������*��I�������'�Tq38���n�w��`���|�^�u���!��#�k�=8����`��>�����A�/�����H��D�������o���%����AO���r9���]���<��v�I8����=G���wEL?o#.���7�h�������`GU�����q��o��#���|����=���j,�����#������p}����,	��M��j<��AX���VS���=�BXZik��V�X�{+��JHrcj��5����l��w��?��B|�7?fy��	�m�i�F��p�tM���Xt��V����N���t��q���V�5�~8�n�s��+X�������mL]���'���@|��b��1�w�%z���1���[�5|���(�fh�s�m���������e*l�v�,�:���k�~	�t������U��
endstream
endobj

3 0 obj
8960
endobj

11 0 obj
<</Length 12 0 R/Filter/FlateDecode/Length1 14664>>
stream
x��{{|SU��Z������I�BN��<�w(XD��2�(������m�41'��(E@���� 0�:0(��Pu�s|�g��s�G�3��0������NZZt���~�_�i'k����k��~4M�W�A
]@��ee0������)4��Nt�{�@mn���|�����v������o\?9���lCu
�B���0��=\��K0�F�o_��u��&v��:�-����wL�����[c7�^
`R;����6��9�6d6��bb'�
T|$�c�p��9�z�0�,@@�5rr���S(U�j^���
YF�9�2&��;v�M��9�������bO��z0�Z��0��k���+�����Z(S��y8�F�6��px�x~F����;�>������K��`l�p|�n���#8����^��������������)�?��p3��n���P����
���{a��l�f8�
�w�4A�W	�����]WP������1���6�	��=���
~��O�O�/�6x���I��C}��M�	��!<m����V���������g����%������`��C�,����74,��-�]���`���5��S}��j����{=3��~���k�N�(+-).*�0�������,F�N��33TJ�24�P(H�\%�NA�:����B���>�����k��� ��%��Q]MP��$4RAP�#���7(H�WqzS��aN�	�a�<�C�N�v��la�C���v�s�#0S@�lG�n/*�V��B��[��]�<��{23f9f�3�
�'#s�cVfQ!H��0	@M���C�����hgU0$��W�������9��1��`)q�$)Dd�a��S��}_�V4��!G(xc�DE��tUw�=��%Mt��&������*,:fWI.Yj���qj����9�����������������D��pQ�]~�>������|�������A���Q��cU���z	����n�J�������S�-���6�K��'�%�)�N��~������+2H�Y'[�n�����+�
�R���T[����-q$�Y��
QLu2�k�2���a/*���������*"y�����Iv�C'i�����PY ��D;��"��H��kd�)��t�HC�e��sk���
B�C�,��T9������-R�
��P�v�aq���-TI�`�cU=�%U��`�������R�#&3��+�U��']��$�,	�[����*�WBUw���
�,���g�=��3Y�s�d������%����>�*���!Ihn��v��0p��r�9t��~+	�����5�������I+�"��g�Ub����uJJ�R���t@b�:�u
>�q:fN���p*%�S'q)��3��h�!nib�4Q�
�N���QBY9�fUI����������=�R��K�S)�z�D;�q*%�9���d[Z���aG��.H^�<7�<��ic��}�xTk���
%��,n���|.�H�J7��p��*��!���t��v��i� Q�9�!��Fo%�@Nh�/�t�/���=^�����d!�9�nGm�t�]����m�X����3�
{(�����{���vY�3:a����R��gz�q���g/�R2VF�
An�����������P� ��^�S�Zz�N����
Zz��;��@K�2��"8���l2o�UzU^5�S��QGY��YP!S#���.j�"���������zSn��2t���cj��J��@`��VY�5��R%��@�~���9 '�%�)QN��1$�1�)N-e8�3�L�L�������
�L	�XT(uI�,��r4��:I�y����\�T�UT����(8
�������<��4G��,�
��%����������,������O3��������_�c+.g3�7,<���|p@)��]"L��P�4�b���r��������f���
�<�j�*��h��L@��|�}�����cW9������XJ��������q��m)Y��x��PY���RF0�(��)+e�
*&O�`��G�o���Mz��]>��9�
�;48�|^�

e2�q��>����[�vf��7���I���g�j�����Z��n�o�������k7�-�c�s=�b�BEL�}���e?�
0Tiha}3�X���'3
2�F�?8UMQ\��VQZM���Q��j\���j�hT#��~5���W��O�;F���R�m$�,���6�u4~��|���U�[�	��3E�y��@��X�F�A�����������e�~�G��AA����nwY����,s���T�����?��8��3m�;��U�����ja�����/K�)���
=m��8���?��A#3f�J�5�Z���T�3V���>+n�b�cVl�����V������Q�44*,�JQ��)S�){�����4���D��;Vm�'��������<��~��]���7����a�q���������._��3<3����qYYj�I�e�-������#����<�m6�g��<�$��I?wH��b���������xw�sKk�c/8Er|Ir)}���a�7d�V�l=A��mtiI�6;��1:��N��4fP�&93�zKQ(��KQ*���R��+��R\P��J��KJQ[�K��MiwG�l����Fb�!;WZ<��+�7�`��q��
:RF.7g�'O���Lz=���$�aR�=����������_>��i�`1�d~���P�}��mu��K�t��,�k�j���hC}$8n�b���_M^C��������{)�\�^����)0C,����F��i�|������J�P���6o��X>��q0���/Ss�Q��{�L����]��x�[o�vcE*�R��(���/���(����aZ1���������k�j_�F�PS��X�/^�:8E����?������� l<O�L�X����Tf���1�Vkf���7�@{&�rQ����=����8�����\N�x<�r)�#�RV�z�8��N��OWP��i�n���8w)��g<���c���E7��|t%~����������S�;�`�HMb�U�|�h��jUc6iX%�djU��U^��2������)��^���RL�v����fN��W�wT���Mn�#�tP���~���[_�����������/��0P7��I����u�)��vz�i�F�b��	�6�0��~�Sa�
%�Ua�
c*��Txqi�
��p!5~�j���k]V�U�6�z����������_�3�.���If,32���B=�	d[��F9�Yp��,�`��,��9����9\NPG���@��#Y5����?�?����o����[�����(�)�QO�&/$�������o��P
�����z�T�������@�M�k3���6T��6�>��z�\�Q���Mz����F=Ebm`+]{%�J���0���!�0�+hY�X���e��ZV��,(h� d������ghmp�#���z{9��MD�]`>M^�O�x�Z�92}���&��^����8����)��K���%:��FFIQ�j�ah�S"`"Hm�<nw�Ph���+�z�����M�`[�%��8.��L����/[v��������4x��j��X%��T���ZP���� �o���3�v6���%l�R�=n�K#��\�v*+
��e�v�N;��B���t�xf�O�����-5e�A5��8�}:7I��}'������e�N�R2�<C-�b@����a�����*AD(i�%ej9-����g��
6�����lp@�w�8V��-�C���S&��0�x�0Qv�NHN�8q�cN�9q���N�s���p�+��2�:���4r
G��)H�]�-�����)%�w�aH�~��{~�e���k���7-�uS2�'���/K�O~��W�|��[r�^:x�~��c��[�1�S(�x:����l�u�m�sF�h��.�uqg�~�N�5�fw�S�h��������J�����.i�������*HJa��{��kO����������>��G�uQy�~��C���h�,����C�e7�"�������4{�5�������@&��h����A)/��`���0r_�r�J}�#nC�gt���1���FN��M{����'5���u����7������xt����]�������88 ���7���wA���Z
��R�z��T�hV*1Si�5Y�R�j=�0�6c���g�)f�7����/��3�b��f<`��f�{�s6�4��3c�K3~h�w���O��fL���!�'��D\��K��qf�����!O��q3�k������O����"���Sc�RB6��2!���{K�v3����f���"�-3'�o$T�)��L�u����_}
X~5}$��G����l�'��GwS����7dW�b���O�x
��s��,w��bv��|����3�w��T��?|��b�i�8Xtp���6`�<�=�a`�/��9$^��=�����Ta�*���L�0���{xJ~-�yF���u<[�{k�T7�]�>��?���x>�f�����7M��/�*��F�e�1�����Jl�����xK\o�ky�^.��u
$��rJ�&���q<����7����:���A�O�<����L
L��w���3�Zy��c\�|m����\�����YW�P����
���
�S��B����MRn��>������.orA	S��\)&�8�G�'����g�{��w���~������7��:w�vC�m������9g�v=�i�m��������	�:�;v���X=�{5��'9���^0����|��wz-�*�`P��kR����Y�`21r�Wk�*0�\X�B�].��P���\x�����g.����]u�������>s�[�|�������\hu�e^ ��v�05��00.����C����fN&$�+/�s.�Gz&���!�2����RT+z��T������F�L,ua���:Y5���[����5�����O��>����P��!����Sk�U���m�c�N���x����k�������������G��Vo�������+;1q���W6}K��?���O������Rf,3�!_^�-`�+�*P8���J��j��J����=��ch+���/����imOo~�y�����F,������e�sO!�2�{������m�y�g���o���+�',����5+�������������7_�78��%V{�,���,��x��>%����b�E�Z/� �cq;�X$]��0�hW�<X����[�}�=��d�O6�)3
2q�wd*U*8��iV��dy�FW�8�_��xz
��<�ydx����x|��Wx<����n~'O�x�x3_���%<���O���>�	������B<.���H�2�K�~EP�O�}<3���?A�,�;���r���J�xD��R�x�+p?O����������<.����dI�<��z������x��Sd0�(hJ�i�2��(�\���qy�kD�,��]���WZ��-�a�
k�x�B�J��r�'?H�}	�'�%jP�F����MR�d>6pi�Wr���3v'dB6L�
�8��Qi��JA��=�_����mv��?���K�#����[�?�����A��������7�M�$wa�_$�J���s�
�Q�;a��
���*McMZ`l�2Wc0d��!r��%�K[p*G�/�3���T�0jP�A��������]6��{����kv�#nG�[�������%��;n�|��g����!_��l��yf�&@��R�YsMyj�<�.��&Nr�uz]"��d�5Oo��yZ=�X����l1`S�*1��M��|6�o�\C��o[��v����b�JfxF��4�1����A����������;��a��O�����=�}���^~����_��Q\����-n�j�|�U�����QR�J���h�)�=k�(%��eF���^�����;��������^�n���T���-���M����NM��M���I�����_���^�5e
���������|������9uf6@&G;��9��U��V�4b@����2X�JAW��}��4����]4�b�B��@�
#OPc�x��_#�zjW{�`�Q���O�����?���x�\2�+pv����m���O^lX~��]��p��g�
���L��@>�@�������,'@�Y�s\i�Y�7!o���6���<Z��]�)��U#�k����GF��)S+�1�J�=�>4����t��)L�?�������My��_�N���w���m���u��l��a����x�������?72��k��+O�=��������Y�n���6m�#��V9"$G�C&L����JjT�D �c,0�zN^���J1�29t�W���
��cTd��2�'���/����M��9��
���aw7�n<R��6=�����{�������I�����qgQp�Y��k�b@��i)
���&�U�H.f+0���W�|8v���T*��
Y��J�n<3�1$����~�:<�w�!f�K�_|��>���G6��sg����Q$L��e�UB3��D�7$9�v�C��.Rc�;���M�
C�%�h�,g�R��8f���������D���iS�Y��b��JO�q��F�EK���P)������#�n��r81c����/���8��}s��H%�}`�u�Sg��o������9�k���������/�,�����n�L�����z+r!��h�����1c�x-\�ds�������s��
�T�Z��@;O�,��\����?���w��v�H�pDF���w��yd����������|���������a�q^C��
��._������q�B<�������x ��W���������s_;��d���=������'�)�5fW�+�����[��l�{����w��K�w�S��nT����0k�+Rx�W�
8���K��I���K�J��aU	6�%��nof��Oa�C�F���RF���_0�@%���I���~F�6F���%���F�U	�1"43",R��9�np���PJ�$�@w��S��g>bul��b?�|��%(��N���2��9)���O������B~6H��lL��&���A	�@�`uv�g�����4
S���4LC�L�h��4�;�0x4
+�6�I�J0�+
�@��4�����3!�:1�m�b�Wi��
zH7
��@2*x�^����i�
cM�4Lf&�a�1�i����4�A.�`V�%�'
+a�XVA.{:
gP�b���3��siX
7*���y�I5��&���i�$"��CB(�
-���x��=!Lh�(����
7D�maaV4����H��X��u5_��(���BaNgK����p�Y�
v���m�:�����pg(����j.	�E./.+��B��3"
A!��+����h�h�x�-"&��pH�t
u����?�w&�`gHX<�qAkk�%L�-�x"����p\�iU<"�"-�hb���#LQ�����DX�v�������GVF�5���vaMPBa1��	+�
��AQv
�����`"�:\(�������l�`�(��x�5-BH���W��HK��c��]&"+:���H�]=�q�8��mm
�E!�2��&��-�p�S������HG$�Vhi��-�p<"&"-"1F�=,���EU���X8�),�a�FA�)F;V�E���D����pG4&���,O�5�DB�����F;���
�P(E!mY�RvQ,O)l�GEQ�u���J��=��M+)Y�fMq0���h<V�]Y�]���X8���,ee��HK�S��*�Zy�s�
b�N��Li�Ba(&����C�-�H,!����h��d�o.���A�� ! AH@h�(�`-�	W;$@�0(�R(�R��B��� �,�Bb�$r��	� @�}��r`QZ�j������0"���$PA���k�U�A��� B��B��E �����P�a|9CC����n�A 6N���J��� @Z����\e)	";�!���j	����m� �u���2�h�V��]�l!��XHI�B����	V?�"���&B��������h���9����Hh3!bz^)�]O�[	Qb�5�N�m'p��3Dz���������q�t�`�/��	QX��R�S��w+y�d�Nh������]�UZ�bAb���WB��\�=�A4�3m%����H������sB�=�x��-R����M�`c's�b�"�Y�0�J��$�W@:�8)=�IL�G�i'�����H��@4��W�h��;���R��~����FF�����8Bv'�6Dp�a��\��R3� ���a���(KY/D�����$��F�F!�����(��*��T�b8�
��}��~1R�i]V��h'q�iP%������+-�L)N�\���'�#��a]VB�N))C��jD�y��_. �d������ ���u������HEcb������������{���y��m��q�T�+�g�?g��`���6�����iP���[����h�{/2�C�7�G0c�e.���l_�&����d��s��_Xw��^Xpa��m�\`3?�h���l�����l�}���v�������=����������s�c�����>-��?���u�U
u���������C��w��u�I������������{�%|�o��E�����`|�����^Z���5��l'='���\wr��#'�;���t����O��4j�F������ct��]�$�O:#�%G<G�}OJOR}O�y�*9�9L���:s�Zpp�A��`��2�����wct'��w�����m����X�c���l����0��k�}�m;��Zp������w�m{7��
e������[�s���Wa�AK���N���8z���/�-��n����U����lY��:�:�����QMO�����i���Aoh!�]Xq����9�����[���v���v���||T����:=j�t��:
�m6�G�\�N�h�%���v���vP��h�i/h�(����b/n�Y\�r��*�H
���%g���.\&q�%�[�P��x`���0sl�T^[/5�
�H��z�+]���nl�f���X��L�p�DQ�Pn�R4�Kt��/&D��@t�	������I���Et�] ����%%\�&��&1�B�DQDQ��Y�����S
endstream
endobj

12 0 obj
9384
endobj

13 0 obj
<</Type/FontDescriptor/FontName/BAAAAA+LiberationSans
/Flags 4
/FontBBox[-543 -303 1301 980]/ItalicAngle 0
/Ascent 905
/Descent -211
/CapHeight 979
/StemV 80
/FontFile2 11 0 R
>>
endobj

14 0 obj
<</Length 389/Filter/FlateDecode>>
stream
x�]��n�0��y��CE~Zh%��B+q�����i�����d;��@�6����j7���O]A^�{�Mw��<�upB�]�Ot��vIYW��`��e�s���u���O�~:�B$��?��|:��B$�}��`�E!{���|n��v����u.��<��_��ci�Y�J7�p��|�� r�
�����{g3.9_����\�B�J���2�Y�l9�F^q� �9�!��F!g_!o8�j���q�y�|D.����z�-��W�Z�w+���n��S��7���u��#�F���zFt�������QO��������R��������l
�[�c����a�s0�oqV��-����Y���S�c�=�Y���hYigp[��<O3V��
���
endstream
endobj

15 0 obj
<</Type/Font/Subtype/TrueType/BaseFont/BAAAAA+LiberationSans
/FirstChar 0
/LastChar 37
/Widths[0 556 500 556 333 556 277 556 556 277 222 556 556 666 556 666
333 556 500 556 722 500 500 833 556 222 500 556 556 556 556 556
556 556 277 556 556 889 ]
/FontDescriptor 13 0 R
/ToUnicode 14 0 R
>>
endobj

16 0 obj
<</F1 15 0 R
>>
endobj

17 0 obj
<<
/Font 16 0 R
/ProcSet[/PDF/Text]
>>
endobj

1 0 obj
<</Type/Page/Parent 10 0 R/Resources 17 0 R/MediaBox[0 0 841.889763779528 595.303937007874]/Tabs/S
/StructParents 0
/Contents 2 0 R>>
endobj

18 0 obj
<</Count 1/First 19 0 R/Last 19 0 R
>>
endobj

19 0 obj
<</Count 0/Title<FEFF00610076006500720061006700650020006400750072006100740069006F006E>
/Dest[1 0 R/XYZ 0 595.303 0]/Parent 18 0 R>>
endobj

4 0 obj
<</Type/StructElem
/S/P
/P 20 0 R
/Pg 1 0 R
/K[0 ]
>>
endobj

5 0 obj
<</Type/StructElem
/S/P
/P 20 0 R
/Pg 1 0 R
/K[1 ]
>>
endobj

6 0 obj
<</Type/StructElem
/S/P
/P 20 0 R
/Pg 1 0 R
/K[2 ]
>>
endobj

7 0 obj
<</Type/StructElem
/S/P
/P 20 0 R
/Pg 1 0 R
/K[3 ]
>>
endobj

8 0 obj
<</Type/StructElem
/S/P
/P 20 0 R
/Pg 1 0 R
/K[4 ]
>>
endobj

9 0 obj
<</Type/StructElem
/S/P
/P 20 0 R
/Pg 1 0 R
/K[5 ]
>>
endobj

20 0 obj
<</Type/StructTreeRoot
/ParentTree 21 0 R
/K[4 0 R  5 0 R  6 0 R  7 0 R  8 0 R  9 0 R  ]
>>
endobj

21 0 obj
<</Nums[
0 [ 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R ]
]>>
endobj

10 0 obj
<</Type/Pages
/Resources 17 0 R
/Kids[ 1 0 R ]
/Count 1>>
endobj

22 0 obj
<</Type/Catalog/Pages 10 0 R
/PageMode/UseOutlines
/OpenAction[1 0 R /XYZ null null 0]
/Outlines 18 0 R
/StructTreeRoot 20 0 R
/Lang(en-US)
/MarkInfo<</Marked true>>
>>
endobj

23 0 obj
<</Creator<FEFF00430061006C0063>
/Producer<FEFF004C0069006200720065004F00660066006900630065002000320034002E0032>
/CreationDate(D:20250119161418+01'00')>>
endobj

xref
0 24
0000000000 65535 f 
0000019616 00000 n 
0000000019 00000 n 
0000009050 00000 n 
0000019971 00000 n 
0000020041 00000 n 
0000020111 00000 n 
0000020181 00000 n 
0000020251 00000 n 
0000020321 00000 n 
0000020572 00000 n 
0000009071 00000 n 
0000018542 00000 n 
0000018564 00000 n 
0000018760 00000 n 
0000019219 00000 n 
0000019527 00000 n 
0000019560 00000 n 
0000019766 00000 n 
0000019822 00000 n 
0000020391 00000 n 
0000020500 00000 n 
0000020647 00000 n 
0000020833 00000 n 
trailer
<</Size 24/Root 22 0 R
/Info 23 0 R
/ID [ <2907D5ACB8C4DF916ED1052D32B402EF>
<2907D5ACB8C4DF916ED1052D32B402EF> ]
/DocChecksum /472D5B1A1D9F74BA6B93316DA35E163B
>>
startxref
21004
%%EOF

parallel-vacuum-reads.pdfapplication/pdf; name=parallel-vacuum-reads.pdfDownload

#39

Masahiko Sawada

sawada.mshk@gmail.com

12 months ago

In reply to: Tomas Vondra (#38)

Re: Parallel heap vacuum

On Sun, Jan 19, 2025 at 7:50 AM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

Thanks for the new patches. I've repeated my benchmarking on v8, and I
agree this looks fine - the speedups are reasonable and match what I'd
expect on this hardware. I don't see any suspicious results like with
the earlier patches, where it got much faster thanks to the absence of
SKIP_PAGE_THRESHOLD logic.

Attached is the benchmarking script, CSV with raw results, and then also
two PDF reports comparing visualizing the impact of the patch by
comparing it to current master.

* parallel-vacuum-duration.pdf - Duration of the vacuum, and duration
relative to master (green - faster, read - slower). The patch is clearly
an improvement, with speedup up to ~3x depending on the index count and
a fraction of updated rows.

* parallel-vacuum-reads.pdf - Average read speed, as reported by VACUUM
VERBOSE. With the patch it can reach up to ~3GB/s, which is about the
max value possible on this hardware - so that's nice. I'll try to test
this on a better storage, to see how far it can go.

Thank you for doing a performance benchmark. These results make sense to me.

I haven't done any actual code review on the new patches, I'll try to
find time for that sometime next week.

Thank you!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#40

Masahiko Sawada

sawada.mshk@gmail.com

11 months ago

In reply to: Masahiko Sawada (#39)

Re: Parallel heap vacuum

On Tue, Jan 21, 2025 at 11:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Jan 19, 2025 at 7:50 AM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

Thanks for the new patches. I've repeated my benchmarking on v8, and I
agree this looks fine - the speedups are reasonable and match what I'd
expect on this hardware. I don't see any suspicious results like with
the earlier patches, where it got much faster thanks to the absence of
SKIP_PAGE_THRESHOLD logic.

Attached is the benchmarking script, CSV with raw results, and then also
two PDF reports comparing visualizing the impact of the patch by
comparing it to current master.

* parallel-vacuum-duration.pdf - Duration of the vacuum, and duration
relative to master (green - faster, read - slower). The patch is clearly
an improvement, with speedup up to ~3x depending on the index count and
a fraction of updated rows.

* parallel-vacuum-reads.pdf - Average read speed, as reported by VACUUM
VERBOSE. With the patch it can reach up to ~3GB/s, which is about the
max value possible on this hardware - so that's nice. I'll try to test
this on a better storage, to see how far it can go.

Thank you for doing a performance benchmark. These results make sense to me.

I haven't done any actual code review on the new patches, I'll try to
find time for that sometime next week.

Thank you!

Since we introduced the eagar vacuum scan (052026c9b9), I need to
update the parallel heap vacuum patch. After thinking of how to
integrate these two features, I find some complexities. The region
size used by eager vacuum scan and the chunk size used by parallel
table scan are different. While the region is fixed size the chunk
becomes smaller as we scan the table. A chunk of the table that a
parallel vacuum worker took could be across different regions or be
within one region, and different parallel heap vacuum workers might
scan the same region. And parallel heap vacuum workers could be
scanning different regions of the table simultaneously.

During eager vacuum scan, we reset the eager_scan_remaining_fails
counter when we start to scan the new region. So if we want to make
parallel heap vacuum behaves exactly the same way as the
single-progress vacuum in terms of the eager vacuum scan, we would
need to have the eager_scan_remaining_fails counters for each region
so that the workers can decrement it corresponding to the region of
the block that the worker is scanning. But I'm concerned that it makes
the logic very complex. I'd like to avoid making newly introduced
codes more complex by adding yet another new code on top of that.

Another idea is to disable the eager vacuum scan when parallel heap
vacuum is enabled. It might look like just avoiding difficult things
but it could make sense in a sense. The eager vacuum scan is aimed to
amortize the aggressive vacuum by incrementally freezing pages that
are potentially frozen by the next aggressive vacuum. On the other
hand, parallel heap vacuum is available only in manual VACUUM and
would be used to remove garbage on a large table as soon as possible
or to freeze the entire table to avoid reaching the XID limit. So I
think it might make sense to disable the eager vacuum scan when
parallel vacuum.

Thoughts?
Thoughts?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#41

John Naylor

johncnaylorls@gmail.com

11 months ago

In reply to: Masahiko Sawada (#40)

Re: Parallel heap vacuum

On Thu, Feb 13, 2025 at 5:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

During eager vacuum scan, we reset the eager_scan_remaining_fails
counter when we start to scan the new region. So if we want to make
parallel heap vacuum behaves exactly the same way as the
single-progress vacuum in terms of the eager vacuum scan, we would
need to have the eager_scan_remaining_fails counters for each region
so that the workers can decrement it corresponding to the region of
the block that the worker is scanning. But I'm concerned that it makes
the logic very complex. I'd like to avoid making newly introduced
codes more complex by adding yet another new code on top of that.

Would it be simpler to make only phase III parallel? In other words,
how much of the infrastructure and complexity needed for parallel
phase I is also needed for phase III?

--
John Naylor
Amazon Web Services

#42

Masahiko Sawada

sawada.mshk@gmail.com

11 months ago

In reply to: John Naylor (#41)

Re: Parallel heap vacuum

On Thu, Feb 13, 2025 at 8:16 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Feb 13, 2025 at 5:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

During eager vacuum scan, we reset the eager_scan_remaining_fails
counter when we start to scan the new region. So if we want to make
parallel heap vacuum behaves exactly the same way as the
single-progress vacuum in terms of the eager vacuum scan, we would
need to have the eager_scan_remaining_fails counters for each region
so that the workers can decrement it corresponding to the region of
the block that the worker is scanning. But I'm concerned that it makes
the logic very complex. I'd like to avoid making newly introduced
codes more complex by adding yet another new code on top of that.

Would it be simpler to make only phase III parallel?

Yes, I think so.

In other words,
how much of the infrastructure and complexity needed for parallel
phase I is also needed for phase III?

Both phases need some common changes to the parallel vacuum
infrastructure so that we can launch parallel workers using
ParallelVacuumContext for phase I and III and parallel vacuum workers
can do its task on the heap table. Specifically, we need to change
vacuumparallel.c to work also in parallel heap vacuum case, and add
some table AM callbacks. Other than that, these phases have different
complexity. As for supporting parallelism for phase III, changes to
lazy vacuum would not be very complex, it needs to change both the
radix tree and TidStore to support shared iteration through.
Supporting parallelism of phase I is more complex since it integrates
parallel table scan with lazy heap scan.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#43

Melanie Plageman

melanieplageman@gmail.com

11 months ago

In reply to: Masahiko Sawada (#40)

Re: Parallel heap vacuum

On Wed, Feb 12, 2025 at 5:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since we introduced the eagar vacuum scan (052026c9b9), I need to
update the parallel heap vacuum patch. After thinking of how to
integrate these two features, I find some complexities. The region
size used by eager vacuum scan and the chunk size used by parallel
table scan are different. While the region is fixed size the chunk
becomes smaller as we scan the table. A chunk of the table that a
parallel vacuum worker took could be across different regions or be
within one region, and different parallel heap vacuum workers might
scan the same region. And parallel heap vacuum workers could be
scanning different regions of the table simultaneously.

Ah, I see. What are the chunk size ranges? I picked a 32 MB region
size after a little testing and mostly because it seemed reasonable. I
think it would be fine to use different region size. Parallel workers
could just consider the chunk they get an eager scan region (unless it
is too small or too large -- then it might not make sense).

During eager vacuum scan, we reset the eager_scan_remaining_fails
counter when we start to scan the new region. So if we want to make
parallel heap vacuum behaves exactly the same way as the
single-progress vacuum in terms of the eager vacuum scan, we would
need to have the eager_scan_remaining_fails counters for each region
so that the workers can decrement it corresponding to the region of
the block that the worker is scanning. But I'm concerned that it makes
the logic very complex. I'd like to avoid making newly introduced
codes more complex by adding yet another new code on top of that.

I don't think it would have to behave exactly the same. I think we
just don't want to add a lot of complexity or make it hard to reason
about.

Since the failure rate is defined as a percent, couldn't we just have
parallel workers set eager_scan_remaining_fails when they get their
chunk assignment (as a percentage of their chunk size)? (I haven't
looked at the code, so maybe this doesn't make sense).

For the success cap, we could have whoever hits it first disable eager
scanning for all future assigned chunks.

Another idea is to disable the eager vacuum scan when parallel heap
vacuum is enabled. It might look like just avoiding difficult things
but it could make sense in a sense. The eager vacuum scan is aimed to
amortize the aggressive vacuum by incrementally freezing pages that
are potentially frozen by the next aggressive vacuum. On the other
hand, parallel heap vacuum is available only in manual VACUUM and
would be used to remove garbage on a large table as soon as possible
or to freeze the entire table to avoid reaching the XID limit. So I
think it might make sense to disable the eager vacuum scan when
parallel vacuum.

Do we only do parallelism in manual vacuum because we don't want to
use up too many parallel workers for a maintenance subsystem? I never
really tried to find out why parallel index vacuuming is only in
manual vacuum. I assume you made the same choice they did for the same
reasons.

If the idea is to never allow parallelism in vacuum, then I think
disabling eager scanning during manual parallel vacuum seems
reasonable. People could use vacuum freeze if they want more freezing.

Also, if you start with only doing parallelism for the third phase of
heap vacuuming (second pass over the heap), this wouldn't be a problem
because eager scanning only impacts the first phase.

- Melanie

#44

Masahiko Sawada

sawada.mshk@gmail.com

11 months ago

In reply to: Melanie Plageman (#43)

Re: Parallel heap vacuum

On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Wed, Feb 12, 2025 at 5:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since we introduced the eagar vacuum scan (052026c9b9), I need to
update the parallel heap vacuum patch. After thinking of how to
integrate these two features, I find some complexities. The region
size used by eager vacuum scan and the chunk size used by parallel
table scan are different. While the region is fixed size the chunk
becomes smaller as we scan the table. A chunk of the table that a
parallel vacuum worker took could be across different regions or be
within one region, and different parallel heap vacuum workers might
scan the same region. And parallel heap vacuum workers could be
scanning different regions of the table simultaneously.

Thank you for your feedback.

Ah, I see. What are the chunk size ranges? I picked a 32 MB region
size after a little testing and mostly because it seemed reasonable. I
think it would be fine to use different region size. Parallel workers
could just consider the chunk they get an eager scan region (unless it
is too small or too large -- then it might not make sense).

The maximum chunk size is 8192 blocks, 64MB. As we scan the table, we
ramp down the chunk size. It would eventually become 1.

During eager vacuum scan, we reset the eager_scan_remaining_fails
counter when we start to scan the new region. So if we want to make
parallel heap vacuum behaves exactly the same way as the
single-progress vacuum in terms of the eager vacuum scan, we would
need to have the eager_scan_remaining_fails counters for each region
so that the workers can decrement it corresponding to the region of
the block that the worker is scanning. But I'm concerned that it makes
the logic very complex. I'd like to avoid making newly introduced
codes more complex by adding yet another new code on top of that.

I don't think it would have to behave exactly the same. I think we
just don't want to add a lot of complexity or make it hard to reason
about.

Since the failure rate is defined as a percent, couldn't we just have
parallel workers set eager_scan_remaining_fails when they get their
chunk assignment (as a percentage of their chunk size)? (I haven't
looked at the code, so maybe this doesn't make sense).

IIUC since the chunk size eventually becomes 1, we cannot simply just
have parallel workers set the failure rate to its assigned chunk.

For the success cap, we could have whoever hits it first disable eager
scanning for all future assigned chunks.

Agreed.

Another idea is to disable the eager vacuum scan when parallel heap
vacuum is enabled. It might look like just avoiding difficult things
but it could make sense in a sense. The eager vacuum scan is aimed to
amortize the aggressive vacuum by incrementally freezing pages that
are potentially frozen by the next aggressive vacuum. On the other
hand, parallel heap vacuum is available only in manual VACUUM and
would be used to remove garbage on a large table as soon as possible
or to freeze the entire table to avoid reaching the XID limit. So I
think it might make sense to disable the eager vacuum scan when
parallel vacuum.

Do we only do parallelism in manual vacuum because we don't want to
use up too many parallel workers for a maintenance subsystem? I never
really tried to find out why parallel index vacuuming is only in
manual vacuum. I assume you made the same choice they did for the same
reasons.

If the idea is to never allow parallelism in vacuum, then I think
disabling eager scanning during manual parallel vacuum seems
reasonable. People could use vacuum freeze if they want more freezing.

IIUC the purpose of parallel vacuum is incompatible with the purpose
of auto vacuum. The former is aimed to execute the vacuum as fast as
possible using more resources, whereas the latter is aimed to execute
the vacuum while not affecting foreground transaction processing. It's
probably worth considering to enable parallel vacuum even for
autovacuum in a wraparound situation, but the purpose would remain the
same.

Also, if you start with only doing parallelism for the third phase of
heap vacuuming (second pass over the heap), this wouldn't be a problem
because eager scanning only impacts the first phase.

Right. I'm inclined to support only the second heap pass as the first
step. If we support parallelism only for the second pass, it cannot
help speed up freezing the entire table in emergency situations, but
it would be beneficial for cases where a big table have a large amount
of spread garbage.

At least, I'm going to reorganize the patch set to support parallelism
for the second pass first and then the first heap pass.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#45

Melanie Plageman

melanieplageman@gmail.com

11 months ago

In reply to: Masahiko Sawada (#44)

Re: Parallel heap vacuum

On Mon, Feb 17, 2025 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Since the failure rate is defined as a percent, couldn't we just have
parallel workers set eager_scan_remaining_fails when they get their
chunk assignment (as a percentage of their chunk size)? (I haven't
looked at the code, so maybe this doesn't make sense).

IIUC since the chunk size eventually becomes 1, we cannot simply just
have parallel workers set the failure rate to its assigned chunk.

Yep. The ranges are too big (1-8192). The behavior would be too
different from serial.

Also, if you start with only doing parallelism for the third phase of
heap vacuuming (second pass over the heap), this wouldn't be a problem
because eager scanning only impacts the first phase.

Right. I'm inclined to support only the second heap pass as the first
step. If we support parallelism only for the second pass, it cannot
help speed up freezing the entire table in emergency situations, but
it would be beneficial for cases where a big table have a large amount
of spread garbage.

At least, I'm going to reorganize the patch set to support parallelism
for the second pass first and then the first heap pass.

Makes sense.

- Melanie

#46

Masahiko Sawada

sawada.mshk@gmail.com

11 months ago

In reply to: Melanie Plageman (#45)

6 attachment(s)

Re: Parallel heap vacuum

On Tue, Feb 18, 2025 at 4:43 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Feb 17, 2025 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Since the failure rate is defined as a percent, couldn't we just have
parallel workers set eager_scan_remaining_fails when they get their
chunk assignment (as a percentage of their chunk size)? (I haven't
looked at the code, so maybe this doesn't make sense).

IIUC since the chunk size eventually becomes 1, we cannot simply just
have parallel workers set the failure rate to its assigned chunk.

Yep. The ranges are too big (1-8192). The behavior would be too
different from serial.

Also, if you start with only doing parallelism for the third phase of
heap vacuuming (second pass over the heap), this wouldn't be a problem
because eager scanning only impacts the first phase.

Right. I'm inclined to support only the second heap pass as the first
step. If we support parallelism only for the second pass, it cannot
help speed up freezing the entire table in emergency situations, but
it would be beneficial for cases where a big table have a large amount
of spread garbage.

At least, I'm going to reorganize the patch set to support parallelism
for the second pass first and then the first heap pass.

Makes sense.

I've attached the updated patches. In this version, I focused on
parallelizing only the second pass over the heap. It's more
straightforward than supporting the first pass, it still requires many
preliminary changes though.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v9-0003-radixtree.h-support-shared-iteration.patchapplication/octet-stream; name=v9-0003-radixtree.h-support-shared-iteration.patchDownload

From cd51153a3a3c10689dd80a820e7fea0e4bff58f4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Oct 2024 17:29:51 -0700
Subject: [PATCH v9 3/6] radixtree.h: support shared iteration.

This commit supports a shared iteration operation on a radix tree with
multiple processes. The radix tree must be in shared mode to start a
shared iteration. Parallel workers can attach the shared iteration
using the iterator handle given by the leader process. Same as the
normal iteration, it's guaranteed that the shared iteration returns
key-values in an ascending order.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/include/lib/radixtree.h                   | 216 +++++++++++++++---
 .../modules/test_radixtree/test_radixtree.c   | 128 +++++++----
 2 files changed, 272 insertions(+), 72 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index f0abb0df389..c8efa61ac7c 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -136,6 +136,9 @@
  * RT_LOCK_SHARE 	- Lock the radix tree in share mode
  * RT_UNLOCK		- Unlock the radix tree
  * RT_GET_HANDLE	- Return the handle of the radix tree
+ * RT_BEGIN_ITERATE_SHARED	- Begin iterating in shared mode.
+ * RT_ATTACH_ITERATE_SHARED	- Attach to the shared iterator.
+ * RT_GET_ITER_HANDLE		- Get the handle of the shared iterator.
  *
  * Optional Interface
  * ---------
@@ -179,6 +182,9 @@
 #define RT_ATTACH RT_MAKE_NAME(attach)
 #define RT_DETACH RT_MAKE_NAME(detach)
 #define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#define RT_BEGIN_ITERATE_SHARED RT_MAKE_NAME(begin_iterate_shared)
+#define RT_ATTACH_ITERATE_SHARED RT_MAKE_NAME(attach_iterate_shared)
+#define RT_GET_ITER_HANDLE RT_MAKE_NAME(get_iter_handle)
 #define RT_LOCK_EXCLUSIVE RT_MAKE_NAME(lock_exclusive)
 #define RT_LOCK_SHARE RT_MAKE_NAME(lock_share)
 #define RT_UNLOCK RT_MAKE_NAME(unlock)
@@ -238,15 +244,19 @@
 #define RT_SHRINK_NODE_16 RT_MAKE_NAME(shrink_child_16)
 #define RT_SHRINK_NODE_48 RT_MAKE_NAME(shrink_child_48)
 #define RT_SHRINK_NODE_256 RT_MAKE_NAME(shrink_child_256)
+#define RT_INITIALIZE_ITER RT_MAKE_NAME(initialize_iter)
 #define RT_NODE_ITERATE_NEXT RT_MAKE_NAME(node_iterate_next)
 #define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
 
 /* type declarations */
 #define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
 #define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER_CONTROL RT_MAKE_NAME(iter_control)
 #define RT_ITER RT_MAKE_NAME(iter)
 #ifdef RT_SHMEM
 #define RT_HANDLE RT_MAKE_NAME(handle)
+#define RT_ITER_CONTROL_SHARED RT_MAKE_NAME(iter_control_shared)
+#define RT_ITER_HANDLE RT_MAKE_NAME(iter_handle)
 #endif
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_CHILD_PTR RT_MAKE_NAME(child_ptr)
@@ -272,6 +282,7 @@ typedef struct RT_ITER RT_ITER;
 
 #ifdef RT_SHMEM
 typedef dsa_pointer RT_HANDLE;
+typedef dsa_pointer RT_ITER_HANDLE;
 #endif
 
 #ifdef RT_SHMEM
@@ -282,6 +293,9 @@ RT_SCOPE	RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_LOCK_EXCLUSIVE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_LOCK_SHARE(RT_RADIX_TREE * tree);
 RT_SCOPE void RT_UNLOCK(RT_RADIX_TREE * tree);
+RT_SCOPE	RT_ITER *RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree);
+RT_SCOPE	RT_ITER_HANDLE RT_GET_ITER_HANDLE(RT_ITER * iter);
+RT_SCOPE	RT_ITER *RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle);
 #else
 RT_SCOPE	RT_RADIX_TREE *RT_CREATE(MemoryContext ctx);
 #endif
@@ -689,6 +703,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 	RT_HANDLE	handle;
 	uint32		magic;
 	LWLock		lock;
+	int			tranche_id;
 #endif
 
 	RT_PTR_ALLOC root;
@@ -739,11 +754,9 @@ typedef struct RT_NODE_ITER
 	int			idx;
 }			RT_NODE_ITER;
 
-/* state for iterating over the whole radix tree */
-struct RT_ITER
+/* Contain the iteration state data */
+typedef struct RT_ITER_CONTROL
 {
-	RT_RADIX_TREE *tree;
-
 	/*
 	 * A stack to track iteration for each level. Level 0 is the lowest (or
 	 * leaf) level
@@ -754,8 +767,36 @@ struct RT_ITER
 
 	/* The key constructed during iteration */
 	uint64		key;
-};
+}			RT_ITER_CONTROL;
 
+#ifdef RT_SHMEM
+/* Contain the shared iteration state data */
+typedef struct RT_ITER_CONTROL_SHARED
+{
+	/* Actual shared iteration state data */
+	RT_ITER_CONTROL common;
+
+	/* protect the control data */
+	LWLock		lock;
+
+	RT_ITER_HANDLE handle;
+	pg_atomic_uint32 refcnt;
+}			RT_ITER_CONTROL_SHARED;
+#endif
+
+/* state for iterating over the whole radix tree */
+struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* pointing to either local memory or DSA */
+	RT_ITER_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	/* True if the iterator is for shared iteration */
+	bool		shared;
+#endif
+};
 
 /* verification (available only in assert-enabled builds) */
 static void RT_VERIFY_NODE(RT_NODE * node);
@@ -1833,6 +1874,7 @@ RT_CREATE(MemoryContext ctx)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
 	tree->ctl->handle = dp;
 	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	tree->ctl->tranche_id = tranche_id;
 	LWLockInitialize(&tree->ctl->lock, tranche_id);
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
@@ -2044,6 +2086,28 @@ RT_FREE(RT_RADIX_TREE * tree)
 
 /***************** ITERATION *****************/
 
+/*
+ * Common routine to initialize the given iterator.
+ */
+static void
+RT_INITIALIZE_ITER(RT_RADIX_TREE * tree, RT_ITER * iter)
+{
+	RT_CHILD_PTR root;
+
+	iter->tree = tree;
+
+	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
+	root.alloc = iter->tree->ctl->root;
+	RT_PTR_SET_LOCAL(tree, &root);
+
+	iter->ctl->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+
+	/* Set the root to start */
+	iter->ctl->cur_level = iter->ctl->top_level;
+	iter->ctl->node_iters[iter->ctl->cur_level].node = root;
+	iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
+}
+
 /*
  * Create and return an iterator for the given radix tree
  * in the caller's memory context.
@@ -2055,24 +2119,50 @@ RT_SCOPE	RT_ITER *
 RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 {
 	RT_ITER    *iter;
-	RT_CHILD_PTR root;
 
 	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
-	iter->tree = tree;
+	iter->ctl = (RT_ITER_CONTROL *) palloc0(sizeof(RT_ITER_CONTROL));
+	RT_INITIALIZE_ITER(tree, iter);
 
-	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
-	root.alloc = iter->tree->ctl->root;
-	RT_PTR_SET_LOCAL(tree, &root);
+#ifdef RT_SHMEM
+	/* we will non-shared iteration on a shared radix tree */
+	iter->shared = false;
+#endif
 
-	iter->top_level = iter->tree->ctl->start_shift / RT_SPAN;
+	return iter;
+}
 
-	/* Set the root to start */
-	iter->cur_level = iter->top_level;
-	iter->node_iters[iter->cur_level].node = root;
-	iter->node_iters[iter->cur_level].idx = 0;
+#ifdef RT_SHMEM
+/*
+ * Create and return the shared iterator for the given shard radix tree.
+ *
+ * Taking a lock on a radix tree in shared mode during the shared iteration to
+ * prevent concurrent writes is the caller's responsibility.
+ */
+RT_SCOPE	RT_ITER *
+RT_BEGIN_ITERATE_SHARED(RT_RADIX_TREE * tree)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl_shared;
+	dsa_pointer dp;
+
+	/* The radix tree must be in shared mode */
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	dp = dsa_allocate0(tree->dsa, sizeof(RT_ITER_CONTROL_SHARED));
+	ctl_shared = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, dp);
+	ctl_shared->handle = dp;
+	LWLockInitialize(&ctl_shared->lock, tree->ctl->tranche_id);
+	pg_atomic_init_u32(&ctl_shared->refcnt, 1);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->ctl = (RT_ITER_CONTROL *) ctl_shared;
+	iter->shared = true;
+	RT_INITIALIZE_ITER(tree, iter);
 
 	return iter;
 }
+#endif
 
 /*
  * Scan the inner node and return the next child pointer if one exists, otherwise
@@ -2086,12 +2176,18 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	RT_CHILD_PTR node;
 	RT_PTR_ALLOC *slot = NULL;
 
+	node_iter = &(iter->ctl->node_iters[level]);
+	node = node_iter->node;
+
 #ifdef RT_SHMEM
-	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
-#endif
 
-	node_iter = &(iter->node_iters[level]);
-	node = node_iter->node;
+	/*
+	 * Since the iterator is shared, the local pointer of the node might be
+	 * set by other backends, we need to make sure to use the local pointer.
+	 */
+	if (iter->shared)
+		RT_PTR_SET_LOCAL(iter->tree, &node);
+#endif
 
 	Assert(node.local != NULL);
 
@@ -2164,8 +2260,8 @@ RT_NODE_ITERATE_NEXT(RT_ITER * iter, int level)
 	}
 
 	/* Update the key */
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
-	iter->key |= (((uint64) key_chunk) << (level * RT_SPAN));
+	iter->ctl->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_SPAN));
+	iter->ctl->key |= (((uint64) key_chunk) << (level * RT_SPAN));
 
 	return slot;
 }
@@ -2179,18 +2275,29 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 {
 	RT_PTR_ALLOC *slot = NULL;
 
-	while (iter->cur_level <= iter->top_level)
+#ifdef RT_SHMEM
+	/* Prevent the shared iterator from being updated concurrently */
+	if (iter->shared)
+		LWLockAcquire(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock, LW_EXCLUSIVE);
+#endif
+
+	while (iter->ctl->cur_level <= iter->ctl->top_level)
 	{
 		RT_CHILD_PTR node;
 
-		slot = RT_NODE_ITERATE_NEXT(iter, iter->cur_level);
+		slot = RT_NODE_ITERATE_NEXT(iter, iter->ctl->cur_level);
 
-		if (iter->cur_level == 0 && slot != NULL)
+		if (iter->ctl->cur_level == 0 && slot != NULL)
 		{
 			/* Found a value at the leaf node */
-			*key_p = iter->key;
+			*key_p = iter->ctl->key;
 			node.alloc = *slot;
 
+#ifdef RT_SHMEM
+			if (iter->shared)
+				LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 			if (RT_CHILDPTR_IS_VALUE(*slot))
 				return (RT_VALUE_TYPE *) slot;
 			else
@@ -2206,17 +2313,23 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 			node.alloc = *slot;
 			RT_PTR_SET_LOCAL(iter->tree, &node);
 
-			iter->cur_level--;
-			iter->node_iters[iter->cur_level].node = node;
-			iter->node_iters[iter->cur_level].idx = 0;
+			iter->ctl->cur_level--;
+			iter->ctl->node_iters[iter->ctl->cur_level].node = node;
+			iter->ctl->node_iters[iter->ctl->cur_level].idx = 0;
 		}
 		else
 		{
 			/* Not found the child slot, move up the tree */
-			iter->cur_level++;
+			iter->ctl->cur_level++;
 		}
+
 	}
 
+#ifdef RT_SHMEM
+	if (iter->shared)
+		LWLockRelease(&((RT_ITER_CONTROL_SHARED *) iter->ctl)->lock);
+#endif
+
 	/* We've visited all nodes, so the iteration finished */
 	return NULL;
 }
@@ -2227,9 +2340,44 @@ RT_ITERATE_NEXT(RT_ITER * iter, uint64 *key_p)
 RT_SCOPE void
 RT_END_ITERATE(RT_ITER * iter)
 {
+#ifdef RT_SHMEM
+	RT_ITER_CONTROL_SHARED *ctl = (RT_ITER_CONTROL_SHARED *) iter->ctl;
+
+	if (iter->shared &&
+		pg_atomic_sub_fetch_u32(&ctl->refcnt, 1) == 0)
+		dsa_free(iter->tree->dsa, ctl->handle);
+#endif
 	pfree(iter);
 }
 
+#ifdef	RT_SHMEM
+RT_SCOPE	RT_ITER_HANDLE
+RT_GET_ITER_HANDLE(RT_ITER * iter)
+{
+	Assert(iter->shared);
+	return ((RT_ITER_CONTROL_SHARED *) iter->ctl)->handle;
+
+}
+
+RT_SCOPE	RT_ITER *
+RT_ATTACH_ITERATE_SHARED(RT_RADIX_TREE * tree, RT_ITER_HANDLE handle)
+{
+	RT_ITER    *iter;
+	RT_ITER_CONTROL_SHARED *ctl;
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+	ctl = (RT_ITER_CONTROL_SHARED *) dsa_get_address(tree->dsa, handle);
+	iter->ctl = (RT_ITER_CONTROL *) ctl;
+	iter->shared = true;
+
+	/* For every iterator, increase the refcnt by 1 */
+	pg_atomic_add_fetch_u32(&ctl->refcnt, 1);
+
+	return iter;
+}
+#endif
+
 /***************** DELETION *****************/
 
 #ifdef RT_USE_DELETE
@@ -2929,7 +3077,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_PTR_ALLOC
 #undef RT_INVALID_PTR_ALLOC
 #undef RT_HANDLE
+#undef RT_ITER_HANDLE
+#undef RT_ITER_CONTROL
+#undef RT_ITER_HANDLE
 #undef RT_ITER
+#undef RT_SHARED_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
 #undef RT_NODE_KIND_4
@@ -2966,6 +3118,11 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_LOCK_SHARE
 #undef RT_UNLOCK
 #undef RT_GET_HANDLE
+#undef RT_BEGIN_ITERATE_SHARED
+#undef RT_ATTACH_ITERATE_SHARED
+#undef RT_GET_ITER_HANDLE
+#undef RT_ATTACH_ITER
+#undef RT_GET_ITER_HANDLE
 #undef RT_FIND
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
@@ -3022,5 +3179,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_SHRINK_NODE_256
 #undef RT_NODE_DELETE
 #undef RT_NODE_INSERT
+#undef RT_INITIALIZE_ITER
 #undef RT_NODE_ITERATE_NEXT
 #undef RT_VERIFY_NODE
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 32de6a3123e..dcba1508a29 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -158,12 +158,86 @@ test_empty(void)
 #endif
 }
 
+/* Iteration test for test_basic() */
+static void
+test_iterate_basic(rt_radix_tree *radixtree, uint64 *keys, int children,
+				   bool asc, bool shared)
+{
+	rt_iter    *iter;
+
+#ifdef TEST_SHARED_RT
+	if (!shared)
+		iter = rt_begin_iterate(radixtree);
+	else
+		iter = rt_begin_iterate_shared(radixtree);
+#else
+	iter = rt_begin_iterate(radixtree);
+#endif
+
+	for (int i = 0; i < children; i++)
+	{
+		uint64		expected;
+		uint64		iterkey;
+		TestValueType *iterval;
+
+		/* iteration is ordered by key, so adjust expected value accordingly */
+		if (asc)
+			expected = keys[i];
+		else
+			expected = keys[children - 1 - i];
+
+		iterval = rt_iterate_next(iter, &iterkey);
+
+		EXPECT_TRUE(iterval != NULL);
+		EXPECT_EQ_U64(iterkey, expected);
+		EXPECT_EQ_U64(*iterval, expected);
+	}
+
+	rt_end_iterate(iter);
+}
+
+/* Iteration test for test_random() */
+static void
+test_iterate_random(rt_radix_tree *radixtree, uint64 *keys, int num_keys,
+					bool shared)
+{
+	rt_iter    *iter;
+
+#ifdef TEST_SHARED_RT
+	if (!shared)
+		iter = rt_begin_iterate(radixtree);
+	else
+		iter = rt_begin_iterate_shared(radixtree);
+#else
+	iter = rt_begin_iterate(radixtree);
+#endif
+
+	for (int i = 0; i < num_keys; i++)
+	{
+		uint64		expected;
+		uint64		iterkey;
+		TestValueType *iterval;
+
+		/* skip duplicate keys */
+		if (i < num_keys - 1 && keys[i + 1] == keys[i])
+			continue;
+
+		expected = keys[i];
+		iterval = rt_iterate_next(iter, &iterkey);
+
+		EXPECT_TRUE(iterval != NULL);
+		EXPECT_EQ_U64(iterkey, expected);
+		EXPECT_EQ_U64(*iterval, expected);
+	}
+
+	rt_end_iterate(iter);
+}
+
 /* Basic set, find, and delete tests */
 static void
 test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 {
 	rt_radix_tree *radixtree;
-	rt_iter    *iter;
 	uint64	   *keys;
 	int			children = test_info->nkeys;
 #ifdef TEST_SHARED_RT
@@ -244,28 +318,12 @@ test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 	}
 
 	/* test that iteration returns the expected keys and values */
-	iter = rt_begin_iterate(radixtree);
-
-	for (int i = 0; i < children; i++)
-	{
-		uint64		expected;
-		uint64		iterkey;
-		TestValueType *iterval;
-
-		/* iteration is ordered by key, so adjust expected value accordingly */
-		if (asc)
-			expected = keys[i];
-		else
-			expected = keys[children - 1 - i];
-
-		iterval = rt_iterate_next(iter, &iterkey);
-
-		EXPECT_TRUE(iterval != NULL);
-		EXPECT_EQ_U64(iterkey, expected);
-		EXPECT_EQ_U64(*iterval, expected);
-	}
+	test_iterate_basic(radixtree, keys, children, asc, false);
 
-	rt_end_iterate(iter);
+#ifdef TEST_SHARED_RT
+	/* test shared-iteration as well */
+	test_iterate_basic(radixtree, keys, children, asc, true);
+#endif
 
 	/* delete all keys again */
 	for (int i = 0; i < children; i++)
@@ -295,7 +353,6 @@ static void
 test_random(void)
 {
 	rt_radix_tree *radixtree;
-	rt_iter    *iter;
 	pg_prng_state state;
 
 	/* limit memory usage by limiting the key space */
@@ -387,27 +444,12 @@ test_random(void)
 	}
 
 	/* test that iteration returns the expected keys and values */
-	iter = rt_begin_iterate(radixtree);
-
-	for (int i = 0; i < num_keys; i++)
-	{
-		uint64		expected;
-		uint64		iterkey;
-		TestValueType *iterval;
+	test_iterate_random(radixtree, keys, num_keys, false);
 
-		/* skip duplicate keys */
-		if (i < num_keys - 1 && keys[i + 1] == keys[i])
-			continue;
-
-		expected = keys[i];
-		iterval = rt_iterate_next(iter, &iterkey);
-
-		EXPECT_TRUE(iterval != NULL);
-		EXPECT_EQ_U64(iterkey, expected);
-		EXPECT_EQ_U64(*iterval, expected);
-	}
-
-	rt_end_iterate(iter);
+#ifdef TEST_SHARED_RT
+	/* test shared-iteration as well */
+	test_iterate_random(radixtree, keys, num_keys, true);
+#endif
 
 	/* reset random number generator for deletion */
 	pg_prng_seed(&state, seed);
-- 
2.43.5

v9-0002-vacuumparallel.c-Support-parallel-table-vacuuming.patchapplication/octet-stream; name=v9-0002-vacuumparallel.c-Support-parallel-table-vacuuming.patchDownload

From 98e0ddd3eb59dc71c8d832db31aeee78587d7490 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 18 Feb 2025 17:45:36 -0800
Subject: [PATCH v9 2/6] vacuumparallel.c: Support parallel table vacuuming.

Since parallel vacuum was available only for index vacuuming and index
cleanup, ParallelVacuumState was initialized only when the table has
at least two indexes that are eligible for parallel index vacuuming
and cleanup.

This commit extends vacuumparallel.c to support parallel table
vacuuming. parallel_vacuum_init() now initializes ParallelVacuumState
and it enables parallel table vacuuming and parallel index
vacuuming/cleanup separately. During the initialization, it asks the
table AM for the number of parallel workers required for parallel
table vacuuming. If >0, it enables parallel table vacuuming and calls
further table AM APIs such as parallel_vacuum_estimate.

For parallel table vacuuming, this commit introduces
parallel_vacuum_remove_dead_items_begin() function, which can be used
to remove the collected dead items from the table (for example, the
second pass over heap table in lazy vacuum).

Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/heapam_handler.c |   4 +-
 src/backend/access/heap/vacuumlazy.c     |  13 +-
 src/backend/commands/vacuumparallel.c    | 287 +++++++++++++++++++----
 src/include/access/heapam.h              |   1 +
 src/include/commands/vacuum.h            |   6 +-
 src/tools/pgindent/typedefs.list         |   1 +
 6 files changed, 267 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c0bec014154..035a506b022 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2677,7 +2677,9 @@ static const TableAmRoutine heapam_methods = {
 	.scan_bitmap_next_block = heapam_scan_bitmap_next_block,
 	.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
 	.scan_sample_next_block = heapam_scan_sample_next_block,
-	.scan_sample_next_tuple = heapam_scan_sample_next_tuple
+	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a231854b1e6..837c9597591 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3485,7 +3485,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3727,6 +3727,17 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Disabled so far.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel)
+{
+	return 0;
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..116047d8121 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -6,15 +6,24 @@
  * This file contains routines that are intended to support setting up, using,
  * and tearing down a ParallelVacuumState.
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan or both index bulk deletion and
+ * index cleanup or all of them with parallel worker processes. Different
+ * numbers of workers are launched for the table vacuuming and index processing.
+ * ParallelVacuumState contains shared information as well as the memory space
+ * for storing dead items allocated in the DSA area.
+ *
+ * When initializing parallel table vacuum scan, we invoke table AM routines for
+ * estimating DSM sizes and initializing DSM memory. Parallel table vacuum
+ * workers invoke the table AM routine for vacuuming the table.
+ *
+ * For processing indexes in parallel, individual indexes are processed by one
+ * vacuum process. We launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the parallel
+ * worker processes exit.
+ *
+ * Each time we process table or indexes in parallel, the parallel context is
+ * re-initialized so that the same DSM can be used for multiple passes of table vacuum
+ * or index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -26,8 +35,10 @@
  */
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -50,6 +61,13 @@
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		4
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
+/* The kind of parallel vacuum work */
+typedef enum
+{
+	PV_WORK_ITEM_PROCESS_INDEXES,	/* index vacuuming or cleanup */
+	PV_WORK_ITEM_REMOVE_DEAD_ITEMS, /* remove dead tuples */
+} PVWorkItem;
+
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
  * segment.
@@ -65,6 +83,11 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * Processing indexes or removing dead tuples from the table.
+	 */
+	PVWorkItem	work_item;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +187,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Do we need to reinitialize parallel DSM? */
+	bool		need_reinitialize_dsm;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -171,6 +197,12 @@ struct ParallelVacuumState
 	Relation   *indrels;
 	int			nindexes;
 
+	/*
+	 * The number of workers for parallel table vacuuming. If > 0, the
+	 * parallel table vacuum is enabled.
+	 */
+	int			nworkers_for_table;
+
 	/* Shared information among parallel vacuum workers */
 	PVShared   *shared;
 
@@ -221,7 +253,8 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
+static int	parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
 											bool *will_parallel_vacuum);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
@@ -242,7 +275,7 @@ static void parallel_vacuum_error_callback(void *arg);
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -256,22 +289,21 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
 	int			querylen;
 
-	/*
-	 * A parallel vacuum must be requested and there must be indexes on the
-	 * relation
-	 */
+	/* A parallel vacuum must be requested */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
+	parallel_workers = parallel_vacuum_compute_workers(rel, indrels, nindexes,
 													   nrequested_workers,
+													   &nworkers_for_table,
 													   will_parallel_vacuum);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
@@ -291,6 +323,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 parallel_workers);
 	Assert(pcxt->nworkers > 0);
 	pvs->pcxt = pcxt;
+	pvs->need_reinitialize_dsm = false;
+	pvs->nworkers_for_table = nworkers_for_table;
 
 	/* Estimate size for index vacuum stats -- PARALLEL_VACUUM_KEY_INDEX_STATS */
 	est_indstats_len = mul_size(sizeof(PVIndStats), nindexes);
@@ -327,6 +361,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, pvs->nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -419,6 +457,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Initialize AM-specific DSM space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, pvs->nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -533,26 +575,35 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.  Return the maximum number of parallel
+ * workers for table vacuuming and index vacuuming.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * nrequested is the number of parallel workers that user requested, which
+ * applies to both the number of workers for table vacuum and index vacuum.
+ * If nrequested is 0, we compute the parallel degree for them differently
+ * as described below.
+ *
+ * For parallel table vacuum, we ask AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to nworkers_table_p.
+ *
+ * For parallel index vacuum, the index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. This function sets
+ * will_parallel_vacuum to remember indexes that participate in parallel vacuum.
  */
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table_p,
 								bool *will_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			nworkers_table = 0;
+	int			nworkers_index = 0;
+
+	*nworkers_table_p = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -561,6 +612,19 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;
 
+	/* Compute the number of workers for parallel table scan */
+	nworkers_table = table_parallel_vacuum_compute_workers(rel);
+
+	if (nworkers_table > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_table = (nrequested > 0) ?
+			Min(nrequested, nworkers_table) : nworkers_table;
+
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_table = Min(nworkers_table, max_parallel_maintenance_workers);
+	}
+
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
@@ -590,17 +654,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	nindexes_parallel--;
 
 	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table_p = nworkers_table;
+	return Max(nworkers_table, nworkers_index);
 }
 
 /*
@@ -670,7 +735,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	if (nworkers > 0)
 	{
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (pvs->need_reinitialize_dsm)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -764,6 +829,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
 }
 
 /*
@@ -979,6 +1047,120 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Begin the parallel scan to remove the collected dead items. Return the
+ * number of launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_scan_end() to finish the parallel
+ * table scan.
+ */
+int
+parallel_vacuum_remove_dead_items_begin(ParallelVacuumState *pvs,
+										int nrequested)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->work_item = PV_WORK_ITEM_REMOVE_DEAD_ITEMS;
+
+	if (pvs->need_reinitialize_dsm)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	nworkers = Min(nrequested, pvs->pcxt->nworkers);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	Assert(pvs->nworkers_for_table >= pvs->pcxt->nworkers);
+	ReinitializeParallelWorkers(pvs->pcxt, nworkers);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel vacuum workers launched by
+ * parallel_vacuum_remove_dead_items_begin(), and gather workers' statistics.
+ */
+void
+parallel_vacuum_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
+}
+
+/*
+ * The function is for parallel workers to execute the parallel scan to remove
+ * dead tuples.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs, void *state)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->work_item == PV_WORK_ITEM_REMOVE_DEAD_ITEMS);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do the parallel scan to remove dead tuples */
+	table_parallel_vacuum_remove_dead_items(pvs->heaprel, pvs, state);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -998,6 +1180,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
+	void	   *state;
 	ErrorContextCallback errcallback;
 
 	/*
@@ -1030,7 +1213,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	/*
 	 * Apply the desired value of maintenance_work_mem within this process.
@@ -1076,6 +1258,17 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
+	/* Initialize AM-specific vacuum state for parallel table vacuuming */
+	if (shared->work_item == PV_WORK_ITEM_REMOVE_DEAD_ITEMS)
+	{
+		ParallelWorkerContext pwcxt;
+
+		pwcxt.toc = toc;
+		pwcxt.seg = seg;
+		table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+												&state);
+	}
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = parallel_vacuum_error_callback;
 	errcallback.arg = &pvs;
@@ -1085,8 +1278,18 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	if (pvs.shared->work_item == PV_WORK_ITEM_REMOVE_DEAD_ITEMS)
+	{
+		/* Remove collected dead items */
+		parallel_vacuum_process_table(&pvs, state);
+	}
+	else
+	{
+		Assert(pvs.shared->work_item == PV_WORK_ITEM_PROCESS_INDEXES);
+
+		/* Process indexes to perform vacuum/cleanup */
+		parallel_vacuum_process_safe_indexes(&pvs);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..ed629f6972a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -409,6 +409,7 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int heap_parallel_vacuum_compute_workers(Relation rel);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1571a66c6bf..0f6282c1968 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -380,7 +380,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -392,6 +393,9 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_remove_dead_items_begin(ParallelVacuumState *pvs,
+													int nrequested);
+extern void parallel_vacuum_scan_end(ParallelVacuumState *pvs);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b6c170ac249..4e192e6b3da 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1981,6 +1981,7 @@ PVIndStats
 PVIndVacStatus
 PVOID
 PVShared
+PVWorkItem
 PX_Alias
 PX_Cipher
 PX_Combo
-- 
2.43.5

v9-0004-tidstore.c-support-shared-iteration-on-TidStore.patchapplication/octet-stream; name=v9-0004-tidstore.c-support-shared-iteration-on-TidStore.patchDownload

From 2dd7633e987183f9389cceb53f2698c058828586 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 16:29:33 -0800
Subject: [PATCH v9 4/6] tidstore.c: support shared iteration on TidStore.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/common/tidstore.c          | 59 ++++++++++++++++++
 src/include/access/tidstore.h                 |  3 +
 .../modules/test_tidstore/test_tidstore.c     | 62 ++++++++++++++-----
 3 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 5bd75fb499c..720bc86c266 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -475,6 +475,7 @@ TidStoreBeginIterate(TidStore *ts)
 	iter = palloc0(sizeof(TidStoreIter));
 	iter->ts = ts;
 
+	/* begin iteration on the radix tree */
 	if (TidStoreIsShared(ts))
 		iter->tree_iter.shared = shared_ts_begin_iterate(ts->tree.shared);
 	else
@@ -525,6 +526,56 @@ TidStoreEndIterate(TidStoreIter *iter)
 	pfree(iter);
 }
 
+/*
+ * Prepare to iterate through a shared TidStore in shared mode. This function
+ * is aimed to start the iteration on the given TidStore with parallel workers.
+ *
+ * The TidStoreIter struct is created in the caller's memory context, and it
+ * will be freed in TidStoreEndIterate.
+ *
+ * The caller is responsible for locking TidStore until the iteration is
+ * finished.
+ */
+TidStoreIter *
+TidStoreBeginIterateShared(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	if (!TidStoreIsShared(ts))
+		elog(ERROR, "cannot begin shared iteration on local TidStore");
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* begin the shared iteration on radix tree */
+	iter->tree_iter.shared =
+		(shared_ts_iter *) shared_ts_begin_iterate_shared(ts->tree.shared);
+
+	return iter;
+}
+
+/*
+ * Attach to the shared TidStore iterator. 'iter_handle' is the dsa_pointer
+ * returned by TidStoreGetSharedIterHandle(). The returned object is allocated
+ * in backend-local memory using CurrentMemoryContext.
+ */
+TidStoreIter *
+TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle)
+{
+	TidStoreIter *iter;
+
+	Assert(TidStoreIsShared(ts));
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/* Attach to the shared iterator */
+	iter->tree_iter.shared = shared_ts_attach_iterate_shared(ts->tree.shared,
+															 iter_handle);
+
+	return iter;
+}
+
 /*
  * Return the memory usage of TidStore.
  */
@@ -556,6 +607,14 @@ TidStoreGetHandle(TidStore *ts)
 	return (dsa_pointer) shared_ts_get_handle(ts->tree.shared);
 }
 
+dsa_pointer
+TidStoreGetSharedIterHandle(TidStoreIter *iter)
+{
+	Assert(TidStoreIsShared(iter->ts));
+
+	return (dsa_pointer) shared_ts_get_iter_handle(iter->tree_iter.shared);
+}
+
 /*
  * Given a TidStoreIterResult returned by TidStoreIterateNext(), extract the
  * offset numbers.  Returns the number of offsets filled in, if <=
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 041091df278..c886cef0f7d 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -37,6 +37,9 @@ extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
 extern void TidStoreUnlock(TidStore *ts);
+extern TidStoreIter *TidStoreBeginIterateShared(TidStore *ts);
+extern TidStoreIter *TidStoreAttachIterateShared(TidStore *ts, dsa_pointer iter_handle);
+extern dsa_pointer TidStoreGetSharedIterHandle(TidStoreIter *iter);
 extern void TidStoreDestroy(TidStore *ts);
 extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 									int num_offsets);
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index eb16e0fbfa6..36654cf0110 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -33,6 +33,7 @@ PG_FUNCTION_INFO_V1(test_is_full);
 PG_FUNCTION_INFO_V1(test_destroy);
 
 static TidStore *tidstore = NULL;
+static bool tidstore_is_shared;
 static size_t tidstore_empty_size;
 
 /* array for verification of some tests */
@@ -107,6 +108,7 @@ test_create(PG_FUNCTION_ARGS)
 		LWLockRegisterTranche(tranche_id, "test_tidstore");
 
 		tidstore = TidStoreCreateShared(tidstore_max_size, tranche_id);
+		tidstore_is_shared = true;
 
 		/*
 		 * Remain attached until end of backend or explicitly detached so that
@@ -115,8 +117,11 @@ test_create(PG_FUNCTION_ARGS)
 		dsa_pin_mapping(TidStoreGetDSA(tidstore));
 	}
 	else
+	{
 		/* VACUUM uses insert only, so we test the other option. */
 		tidstore = TidStoreCreateLocal(tidstore_max_size, false);
+		tidstore_is_shared = false;
+	}
 
 	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
 
@@ -212,14 +217,42 @@ do_set_block_offsets(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(blkno);
 }
 
+/* Collect TIDs stored in the tidstore, in order */
+static void
+check_iteration(TidStore *tidstore, int *num_iter_tids, bool shared_iter)
+{
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+
+	TidStoreLockShare(tidstore);
+
+	if (shared_iter)
+		iter = TidStoreBeginIterateShared(tidstore);
+	else
+		iter = TidStoreBeginIterate(tidstore);
+
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	{
+		OffsetNumber offsets[MaxOffsetNumber];
+		int			num_offsets;
+
+		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
+		Assert(num_offsets <= lengthof(offsets));
+		for (int i = 0; i < num_offsets; i++)
+			ItemPointerSet(&(items.iter_tids[(*num_iter_tids)++]), iter_result->blkno,
+						   offsets[i]);
+	}
+
+	TidStoreEndIterate(iter);
+	TidStoreUnlock(tidstore);
+}
+
 /*
  * Verify TIDs in store against the array.
  */
 Datum
 check_set_block_offsets(PG_FUNCTION_ARGS)
 {
-	TidStoreIter *iter;
-	TidStoreIterResult *iter_result;
 	int			num_iter_tids = 0;
 	int			num_lookup_tids = 0;
 	BlockNumber prevblkno = 0;
@@ -261,22 +294,23 @@ check_set_block_offsets(PG_FUNCTION_ARGS)
 	}
 
 	/* Collect TIDs stored in the tidstore, in order */
+	check_iteration(tidstore, &num_iter_tids, false);
 
-	TidStoreLockShare(tidstore);
-	iter = TidStoreBeginIterate(tidstore);
-	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	/* If the tidstore is shared, check the shared-iteration as well */
+	if (tidstore_is_shared)
 	{
-		OffsetNumber offsets[MaxOffsetNumber];
-		int			num_offsets;
+		int			num_iter_tids_shared = 0;
 
-		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
-		Assert(num_offsets <= lengthof(offsets));
-		for (int i = 0; i < num_offsets; i++)
-			ItemPointerSet(&(items.iter_tids[num_iter_tids++]), iter_result->blkno,
-						   offsets[i]);
+		check_iteration(tidstore, &num_iter_tids_shared, true);
+
+		/*
+		 * verify that normal iteration and shared iteration returned the
+		 * number of TIDs.
+		 */
+		if (num_lookup_tids != num_iter_tids_shared)
+			elog(ERROR, "shared-iteration should have %d TIDs, have %d aaa",
+				 items.num_tids, num_iter_tids_shared);
 	}
-	TidStoreEndIterate(iter);
-	TidStoreUnlock(tidstore);
 
 	/*
 	 * Sort verification and lookup arrays and test that all arrays are the
-- 
2.43.5

v9-0005-Move-some-fields-of-LVRelState-to-LVVacCounters-s.patchapplication/octet-stream; name=v9-0005-Move-some-fields-of-LVRelState-to-LVVacCounters-s.patchDownload

From d2038bedc3ce81d86d4c832d14c95a428c50c87d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 19 Feb 2025 15:39:38 -0800
Subject: [PATCH v9 5/6] Move some fields of LVRelState to LVVacCounters
 struct.

This commit is a preliminary work for parallel heap vacuuming. Some
fields in LVRelState will be updated not only by the leader process
but also by parallel vacuum workers. These fields are now moved to the
new struct LVVacCounters, and LVRelStates has a pointer to it.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/vacuumlazy.c | 89 ++++++++++++++++------------
 src/tools/pgindent/typedefs.list     |  1 +
 2 files changed, 53 insertions(+), 37 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 837c9597591..70c767ecbc8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -256,6 +256,30 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * Data and counters collected during lazy vacuum by the leader process and
+ * parallel vacuum worker processes.
+ */
+typedef struct LVVacCounters
+{
+	/* # pages vacuumed during the second pass */
+	BlockNumber vacuumed_pages;
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+} LVVacCounters;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -322,24 +346,13 @@ typedef struct LVRelState
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
 	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
 
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
-
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
-
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 
+	/* Counters collected during heap vacuuming */
+	LVVacCounters *counters;
+
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
 	double		new_live_tuples;	/* new estimated total # of live tuples */
@@ -753,9 +766,12 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->recently_dead_tuples = 0;
 	vacrel->missed_dead_tuples = 0;
 
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
+	vacrel->counters = palloc(sizeof(LVVacCounters));
+	vacrel->counters->vacuumed_pages = 0;
+	vacrel->counters->vm_new_visible_pages = 0;
+	vacrel->counters->vm_new_visible_frozen_pages = 0;
+	vacrel->counters->vm_new_frozen_pages = 0;
+
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 
 	/*
@@ -1049,10 +1065,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->counters->vm_new_visible_pages,
+							 vacrel->counters->vm_new_visible_frozen_pages +
+							 vacrel->counters->vm_new_frozen_pages,
+							 vacrel->counters->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -1881,11 +1897,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->counters->vm_new_visible_pages++;
+				vacrel->counters->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->counters->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -2080,17 +2096,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->counters->vm_new_visible_pages++;
 			if (presult.all_frozen)
 			{
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->counters->vm_new_visible_frozen_pages++;
 				*vm_page_frozen = true;
 			}
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->counters->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2178,8 +2194,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->counters->vm_new_visible_pages++;
+			vacrel->counters->vm_new_visible_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 
@@ -2189,7 +2205,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		else
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->counters->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2698,7 +2714,6 @@ static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
 	ReadStream *stream;
-	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 	TidStoreIter *iter;
@@ -2769,7 +2784,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		UnlockReleaseBuffer(buf);
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-		vacuumed_pages++;
+		vacrel->counters->vacuumed_pages++;
 	}
 
 	read_stream_end(stream);
@@ -2785,12 +2800,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
 		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+			vacrel->counters->vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
 					vacrel->relname, (long long) vacrel->dead_items_info->num_items,
-					vacuumed_pages)));
+					vacrel->counters->vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2901,14 +2916,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->counters->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->counters->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->counters->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4e192e6b3da..4979d93b048 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1485,6 +1485,7 @@ LSEG
 LUID
 LVRelState
 LVSavedErrInfo
+LVVacCounters
 LWLock
 LWLockHandle
 LWLockMode
-- 
2.43.5

v9-0006-Support-parallelism-for-removing-dead-items-durin.patchapplication/octet-stream; name=v9-0006-Support-parallelism-for-removing-dead-items-durin.patchDownload

From db7cefde0cd3368a857e518b378fb25517357fc5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 19 Feb 2025 15:56:21 -0800
Subject: [PATCH v9 6/6] Support parallelism for removing dead items during
 lazy vacuum.

This feature allows the vacuum to leverage multiple CPUs in order to
remove dead items (i.e. the second pass over heap table) with parallel
workers. The parallel degree for parallel heap vacuuming is determined
based on the number of blocks to vacuum unless PARALLEL option of
VACUUM command is specified, and further limited by
max_parallel_maintenance_workers.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/heapam_handler.c |   4 +
 src/backend/access/heap/vacuumlazy.c     | 328 ++++++++++++++++++++---
 src/backend/commands/vacuumparallel.c    |  11 +
 src/include/access/heapam.h              |  12 +-
 src/include/commands/vacuum.h            |   1 +
 src/tools/pgindent/typedefs.list         |   1 +
 6 files changed, 319 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 035a506b022..23ee4a100b3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2680,6 +2680,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
 
 	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_initialize_worker = heap_parallel_vacuum_initialize_worker,
+	.parallel_vacuum_remove_dead_items = heap_parallel_vacuum_remove_dead_items,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 70c767ecbc8..e9fbdfc019f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -147,6 +147,7 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -219,6 +220,8 @@
  * parallel mode and the DSM segment is initialized.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) \
+	(ParallelVacuumIsActive(vacrel) && vacrel->phvshared != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -256,6 +259,13 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * DSM keys for heap parallel vacuum scan. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_KEY_SHARED			0xFFFF0001
+
 /*
  * Data and counters collected during lazy vacuum by the leader process and
  * parallel vacuum worker processes.
@@ -280,6 +290,19 @@ typedef struct LVVacCounters
 	BlockNumber vm_new_frozen_pages;
 } LVVacCounters;
 
+/*
+ * Struct for information that needs to be shared among parallel vacuum
+ * workers.
+ */
+typedef struct PHVShared
+{
+	dsa_pointer shared_iter_handle;
+
+	/* Data and counters collected by each worker */
+	LVVacCounters worker_counters[FLEXIBLE_ARRAY_MEMBER];
+} PHVShared;
+#define SizeOfPHVShared		offsetof(PHVShared, worker_counters)
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -376,6 +399,11 @@ typedef struct LVRelState
 	bool		next_unskippable_eager_scanned; /* if it was eagerly scanned */
 	Buffer		next_unskippable_vmbuffer;	/* buffer containing its VM bit */
 
+	/* Fields used for parallel heap vacuum */
+
+	PHVShared  *phvshared;
+	int			nworkers_launched;
+
 	/* State related to managing eager scanning of all-visible pages */
 
 	/*
@@ -454,6 +482,7 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
+static void do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter);
 static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, OffsetNumber *deadoffsets,
 								  int num_offsets, Buffer vmbuffer);
@@ -480,6 +509,10 @@ static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 static void update_relstats_all_indexes(LVRelState *vacrel);
+static int	compute_heap_vacuum_parallel_workers(BlockNumber heap_pages);
+static TidStoreIter *lazy_parallel_heap_rel_begin(LVRelState *vacrel, int nworkers);
+static void lazy_parallel_heap_rel_end(LVRelState *vacrel);
+static Size lazy_parallel_heap_estimate_shared_memory(Relation rel, int nworkers);
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelState *vacrel,
 									 LVSavedErrInfo *saved_vacrel,
@@ -2694,44 +2727,13 @@ vacuum_reap_lp_read_stream_next(ReadStream *stream,
 }
 
 /*
- *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
- *
- * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
- * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
- *
- * We may also be able to truncate the line pointer array of the heap pages we
- * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
- * array, it can be reclaimed as free space.  These LP_UNUSED items usually
- * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
- * each page to LP_UNUSED, and then consider if it's possible to truncate the
- * page's line pointer array).
- *
- * Note: the reason for doing this as a second pass is we cannot remove the
- * tuples until we've removed their index entries, and we want to process
- * index entry removal in batches as large as possible.
+ * Workhorse for removing collected dead items from heap table.
  */
 static void
-lazy_vacuum_heap_rel(LVRelState *vacrel)
+do_lazy_vacuum_heap_rel(LVRelState *vacrel, TidStoreIter *iter)
 {
 	ReadStream *stream;
 	Buffer		vmbuffer = InvalidBuffer;
-	LVSavedErrInfo saved_err_info;
-	TidStoreIter *iter;
-
-	Assert(vacrel->do_index_vacuuming);
-	Assert(vacrel->do_index_cleanup);
-	Assert(vacrel->num_index_scans > 0);
-
-	/* Report that we are now vacuuming the heap */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
-
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrel, &saved_err_info,
-							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 InvalidBlockNumber, InvalidOffsetNumber);
-
-	iter = TidStoreBeginIterate(vacrel->dead_items);
 
 	/* Set up the read stream for vacuum's second pass through the heap */
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
@@ -2788,11 +2790,71 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	}
 
 	read_stream_end(stream);
-	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
+}
+
+/*
+ *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
+ *
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
+ *
+ * We may also be able to truncate the line pointer array of the heap pages we
+ * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
+ * array, it can be reclaimed as free space.  These LP_UNUSED items usually
+ * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
+ * each page to LP_UNUSED, and then consider if it's possible to truncate the
+ * page's line pointer array).
+ *
+ * Note: the reason for doing this as a second pass is we cannot remove the
+ * tuples until we've removed their index entries, and we want to process
+ * index entry removal in batches as large as possible.
+ */
+static void
+lazy_vacuum_heap_rel(LVRelState *vacrel)
+{
+	int			nworkers = 0;
+	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
+	Assert(vacrel->num_index_scans > 0);
+
+	/* Report that we are now vacuuming the heap */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+							 InvalidBlockNumber, InvalidOffsetNumber);
+
+	vacrel->counters->vacuumed_pages = 0;
+
+	/*
+	 * If parallel heap vacuum is enabled, compute parallel workers required
+	 * to scan blocks to remove dead items.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		nworkers = compute_heap_vacuum_parallel_workers(vacrel->lpdead_item_pages -
+														vacrel->counters->vacuumed_pages);
+
+	if (nworkers > 0)
+		iter = lazy_parallel_heap_rel_begin(vacrel, nworkers);
+	else
+		iter = TidStoreBeginIterate(vacrel->dead_items);
+
+	/* do the real work */
+	do_lazy_vacuum_heap_rel(vacrel, iter);
+
+	if (nworkers > 0)
+		lazy_parallel_heap_rel_end(vacrel);
+
+	TidStoreEndIterate(iter);
 
 	/*
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
@@ -3743,14 +3805,206 @@ update_relstats_all_indexes(LVRelState *vacrel)
 }
 
 /*
- * Compute the number of workers for parallel heap vacuum.
+ * Return the number of parallel workers required for parallel heap vacuum
+ * based on the given number of blocks.
  *
- * Disabled so far.
+ * The calculation logic is borrowed from compute_parallel_worker().
+ */
+static int
+compute_heap_vacuum_parallel_workers(BlockNumber heap_pages)
+{
+	int			parallel_workers = 0;
+	int			heap_parallel_threshold;
+
+	/*
+	 * Select the number of workers based on the log of the size of the
+	 * relation. Note that the upper limit of the min_parallel_table_scan_size
+	 * GUC is chosen to prevent overflow here.
+	 */
+	heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+	while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+	{
+		parallel_workers++;
+		heap_parallel_threshold *= 3;
+		if (heap_parallel_threshold > INT_MAX / 3)
+			break;
+	}
+
+	return parallel_workers;
+}
+
+/*
+ * Compute the number of workers for parallel heap vacuum.
  */
 int
 heap_parallel_vacuum_compute_workers(Relation rel)
 {
-	return 0;
+	return compute_heap_vacuum_parallel_workers(RelationGetNumberOfBlocks(rel));
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel heap vacuuming.
+ */
+static TidStoreIter *
+lazy_parallel_heap_rel_begin(LVRelState *vacrel, int nworkers)
+{
+	TidStoreIter *iter;
+
+	/* Prepare for shared iteration on the dead item */
+	iter = TidStoreBeginIterateShared(vacrel->dead_items);
+	vacrel->phvshared->shared_iter_handle = TidStoreGetSharedIterHandle(iter);
+
+	/* launch parallel vacuum workers */
+	vacrel->nworkers_launched = parallel_vacuum_remove_dead_items_begin(vacrel->pvs,
+																		nworkers);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for heap vacuuming (planned: %d)",
+							 "launched %d parallel vacuum workers for heap vacuuming (planned: %d)",
+							 vacrel->nworkers_launched),
+					vacrel->nworkers_launched, nworkers)));
+
+	return iter;
+}
+
+/*
+ * Helper routine to finish the parallel heap vacuuming.
+ */
+static void
+lazy_parallel_heap_rel_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_scan_end(vacrel->pvs);
+
+	/* Gather the heap vacuum statistics collected by workers */
+	for (int i = 0; i < vacrel->nworkers_launched; i++)
+	{
+		LVVacCounters *c = &(vacrel->phvshared->worker_counters[i]);
+
+		vacrel->counters->vacuumed_pages += c->vacuumed_pages;
+		vacrel->counters->vm_new_visible_pages += c->vm_new_visible_pages;
+		vacrel->counters->vm_new_visible_frozen_pages += c->vm_new_visible_frozen_pages;
+		vacrel->counters->vm_new_frozen_pages += c->vm_new_frozen_pages;
+	}
+}
+
+/*
+ * Estimate shared memory sizes required for parallel heap vacuum.
+ */
+static Size
+lazy_parallel_heap_estimate_shared_memory(Relation rel, int nworkers)
+{
+	Size		size = 0;
+
+	size = add_size(size, SizeOfPHVShared);
+	size = add_size(size, mul_size(sizeof(LVVacCounters), nworkers));
+
+	return size;
+}
+
+/*
+ * Compute the amount of space we'll need in the parallel heap vacuum
+ * DSM, and inform pcxt->estimator about our needs.
+ *
+ * nworkers is the number of workers for the table vacuum. Note that it could
+ * be different than pcxt->nworkers since it is the maximum of number of
+ * workers for table vacuum and index vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							  void *state)
+{
+	Size		size = lazy_parallel_heap_estimate_shared_memory(rel, nworkers);
+
+	shm_toc_estimate_chunk(&pcxt->estimator, size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+								int nworkers, void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	PHVShared  *shared;
+	Size		shared_len;
+
+	shared_len = lazy_parallel_heap_estimate_shared_memory(rel, nworkers);
+	shared = shm_toc_allocate(pcxt->toc, shared_len);
+	MemSet(shared, 0, shared_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SHARED, shared);
+
+	vacrel->phvshared = shared;
+}
+
+/*
+ * Initialize lazy vacuum state with the information retrieved from shared
+ * memory.
+ */
+void
+heap_parallel_vacuum_initialize_worker(Relation rel, ParallelVacuumState *pvs,
+									   ParallelWorkerContext *pwcxt, void **state_out)
+{
+	LVRelState *vacrel;
+	PHVShared  *shared;
+
+	shared = (PHVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_SHARED, false);
+
+	/* Initialize fields required by parallel heap vacuum */
+	vacrel = palloc0(sizeof(LVRelState));
+	vacrel->rel = rel;
+	vacrel->indrels = parallel_vacuum_get_table_indexes(pvs,
+														&vacrel->nindexes);
+	vacrel->pvs = pvs;
+	vacrel->dead_items = parallel_vacuum_get_dead_items(pvs,
+														&vacrel->dead_items_info);
+	vacrel->phvshared = shared;
+	vacrel->counters = &(shared->worker_counters[ParallelWorkerNumber]);
+	MemSet(vacrel->counters, 0, sizeof(LVVacCounters));
+	vacrel->do_index_vacuuming = true;
+
+	*state_out = (void *) vacrel;
+}
+
+/*
+ * Parallel heap vacuum callback for removing the collected dead items.
+ */
+void
+heap_parallel_vacuum_remove_dead_items(Relation rel, ParallelVacuumState *pvs,
+									   void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	TidStoreIter *iter;
+	ErrorContextCallback errcallback;
+
+	Assert(IsParallelWorker());
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel->dbname = get_database_name(MyDatabaseId);
+	vacrel->relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_VACUUM_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	iter = TidStoreAttachIterateShared(vacrel->dead_items,
+									   vacrel->phvshared->shared_iter_handle);
+
+	/* Join removing collected dead items */
+	do_lazy_vacuum_heap_rel(vacrel, iter);
+
+	TidStoreEndIterate(iter);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
 }
 
 /*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 116047d8121..04b5ef2f5ad 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -554,6 +554,17 @@ parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 	parallel_vacuum_process_all_indexes(pvs, num_index_scans, true);
 }
 
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
 /*
  * Do parallel index cleanup with parallel workers.
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ed629f6972a..2ae3875b654 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -15,6 +15,7 @@
 #define HEAPAM_H
 
 #include "access/heapam_xlog.h"
+#include "access/parallel.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
@@ -407,9 +408,18 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 
 /* in heap/vacuumlazy.c */
 struct VacuumParams;
+struct ParallelVacuumState;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
-extern int heap_parallel_vacuum_compute_workers(Relation rel);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+										  void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+												   ParallelWorkerContext *pwcxt, void **state_out);
+extern void heap_parallel_vacuum_remove_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+												   void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 0f6282c1968..dc8b47bbfae 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -385,6 +385,7 @@ extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4979d93b048..b848b0b896c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1846,6 +1846,7 @@ PGresAttValue
 PGresParamDesc
 PGresult
 PGresult_data
+PHVShared
 PIO_STATUS_BLOCK
 PLAINTREE
 PLAssignStmt
-- 
2.43.5

v9-0001-Introduces-table-AM-APIs-for-parallel-table-vacuu.patchapplication/octet-stream; name=v9-0001-Introduces-table-AM-APIs-for-parallel-table-vacuu.patchDownload

From de729da30d73d6a120bf08f7820014df49989bb0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v9 1/6] Introduces table AM APIs for parallel table vacuuming.

This commit introduces the following new table AM APIs for parallel
heap vacuuming:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_remove_dead_items

There is no code using these new APIs for now. Upcoming parallel
vacuum patches utilize these APIs.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/include/access/tableam.h | 114 +++++++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 131c050c15f..6eb01c7766d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -35,6 +35,9 @@ extern PGDLLIMPORT bool synchronize_seqscans;
 
 struct BulkInsertStateData;
 struct IndexInfo;
+struct ParallelVacuumState;
+struct ParallelContext;
+struct ParallelWorkerContext;
 struct SampleScanState;
 struct VacuumParams;
 struct ValidateIndexState;
@@ -655,6 +658,60 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * function must return 0 to disable parallel table vacuum.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel);
+
+	/*
+	 * Estimate the size of shared memory that the parallel table vacuum needs
+	 * for AM
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 struct ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   struct ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * Initialize AM-specific vacuum state for worker processes.
+	 *
+	 * The state_out is the output parameter so that an arbitrary data can be
+	 * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize_worker) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  struct ParallelWorkerContext *pwcxt,
+													  void **state_out);
+
+	/*
+	 * Execute a parallel scan to remove the dead items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_remove_dead_items) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  void *state);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1716,6 +1773,63 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, struct ParallelContext *pcxt,
+							   int nworkers, void *state)
+{
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, struct ParallelContext *pcxt,
+								 int nworkers, void *state)
+{
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize AM-specific vacuum state for worker processes.
+ */
+static inline void
+table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+										struct ParallelWorkerContext *pwcxt,
+										void **state_out)
+{
+	rel->rd_tableam->parallel_vacuum_initialize_worker(rel, pvs, pwcxt, state_out);
+}
+
+/*
+ * Perform a parallel vacuums scan to remove the collected dead items.
+ */
+static inline void
+table_parallel_vacuum_remove_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+										void *state)
+{
+	rel->rd_tableam->parallel_vacuum_remove_dead_items(rel, pvs, state);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
-- 
2.43.5

#47

John Naylor

johncnaylorls@gmail.com

11 months ago

In reply to: Masahiko Sawada (#46)

Re: Parallel heap vacuum

On Tue, Feb 18, 2025 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Right. I'm inclined to support only the second heap pass as the first
step. If we support parallelism only for the second pass, it cannot
help speed up freezing the entire table in emergency situations, but
it would be beneficial for cases where a big table have a large amount
of spread garbage.

I started looking at the most recent patch set for this, and while
looking back over the thread, a couple random thoughts occurred to me:

On Sat, Oct 26, 2024 at 2:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

- Patched
parallel 0: 53468.734 (15990, 28296, 9465)
parallel 1: 35712.613 ( 8254, 23569, 3700)

These measurements were done before before phase I and III were
stream-ified, but even here those phases were already the quickest
ones for a modest number of indexes and a single worker. I think it
would have an even bigger difference when only a small percentage of
pages need scanning/vacuuming, because it still has to read the entire
index. It's a bit unfortunate that the simplest heap phase to
parallelize is also the quickest to begin with.

Pre-existing issue: We don't get anywhere near 2x improvement in phase
II for 2 parallel index scans. We've known for a while that the shared
memory TID store has more overhead than private memory, and here that
overhead is about the same as the entirety of phase III with a single
worker! It may be time to look into mitigations there, independently
of this thread.

The same commit that made the parallel scanning patch more difficult
should also reduce the risk of having a large amount of freezing work
at once in the first place. (There are still plenty of systemic things
that can go wrong, of course, and it's always good if unpleasant work
finishes faster.)

I seem to recall a proposal from David Rowley to (something like)
batch gathering xids for visibility checks during executor scans, but
if so I can't find it in the archives. It's possible some similar work
might speed up heap scanning in a more localized fashion.

On Tue, Nov 12, 2024 at 1:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.

On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.

The takeaway I got is that the word "shared" is used for two different
concepts. Future readers are not going to think to look back at this
email for explanation. At the least there needs to be a different word
for the new concept.

There are quite a lot of additions to radix_tree.h and tidstore.c to
make it work this way, including a new "control object", new locks,
new atomic variables. I'm not sure it's really necessary. Is it
possible to just store the "most recent block heap-vacuumed" in the
shared vacuum state? Then each backend would lock the TID store,
iterate starting at the next block, and unlock the store when it has
enough blocks. Sometimes my brainstorms are unworkable for some reason
I failed to think about, but this way seems 95% simpler -- we would
only need to teach the existing iteration machinery to take a "start
key".

--
John Naylor
Amazon Web Services

#48

Jim Nasby

jnasby@upgrade.com

11 months ago

In reply to: Masahiko Sawada (#44)

Re: Parallel heap vacuum

On Mon, Feb 17, 2025 at 12:11 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

If the idea is to never allow parallelism in vacuum, then I think
disabling eager scanning during manual parallel vacuum seems
reasonable. People could use vacuum freeze if they want more freezing.

IIUC the purpose of parallel vacuum is incompatible with the purpose
of auto vacuum. The former is aimed to execute the vacuum as fast as
possible using more resources, whereas the latter is aimed to execute
the vacuum while not affecting foreground transaction processing. It's
probably worth considering to enable parallel vacuum even for
autovacuum in a wraparound situation, but the purpose would remain the
same.

That's assuming that the database is running with autovacuum_cost_delay >
0. There's presumably some number of systems that intentionally do not
throttle autovacuum. Another consideration is that people are free to
set vacuum_cost_page_miss = 0; in those cases it would still be useful to
parallelize at least phase 1 and 2. Of course, folks can also
set vacuum_cost_page_dirty = 0, in which case parallelization would help
phase 2 and 3.

In terms of less hypothetical scenarios, I've definitely run into cases
where 1 table accounts for 90%+ of the space used by a database.
Parallelism would help ensure the single table is still processed in a
timely fashion. (Assuming non-default settings for vacuum throttling.)

#49

Masahiko Sawada

sawada.mshk@gmail.com

11 months ago

In reply to: John Naylor (#47)

1 attachment(s)

Re: Parallel heap vacuum

On Sun, Feb 23, 2025 at 8:51 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Feb 18, 2025 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Right. I'm inclined to support only the second heap pass as the first
step. If we support parallelism only for the second pass, it cannot
help speed up freezing the entire table in emergency situations, but
it would be beneficial for cases where a big table have a large amount
of spread garbage.

I started looking at the most recent patch set for this, and while
looking back over the thread, a couple random thoughts occurred to me:

On Sat, Oct 26, 2024 at 2:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

- Patched
parallel 0: 53468.734 (15990, 28296, 9465)
parallel 1: 35712.613 ( 8254, 23569, 3700)

These measurements were done before before phase I and III were
stream-ified, but even here those phases were already the quickest
ones for a modest number of indexes and a single worker. I think it
would have an even bigger difference when only a small percentage of
pages need scanning/vacuuming, because it still has to read the entire
index. It's a bit unfortunate that the simplest heap phase to
parallelize is also the quickest to begin with.

Pre-existing issue: We don't get anywhere near 2x improvement in phase
II for 2 parallel index scans. We've known for a while that the shared
memory TID store has more overhead than private memory, and here that
overhead is about the same as the entirety of phase III with a single
worker! It may be time to look into mitigations there, independently
of this thread.

Thank you for the comments.

I've done simple performance benchmarks and attached the results. In
this test, I prepared a table having one integer column and 3GB in
size. The 'fraction' column means the fraction of pages with deleted
rows; for example fraction=1000 means that the table has pages with
deleted rows every 1000 pages. The results show the duration for each
phase in addition to the total duration (for PATCHED only) and
compares the total duration between the HEAD and the PATCHED.

What I can see from these results was that we might not benefit much
from parallelizing phase III, unfortunately. Although in the best case
the phase III got about 2x speedup, as for the total duration it's
about only 10% speedup. My analysis for these results matches what
John mentioned; phase III is already the fastest phase and accounts
only ~10% of the total execution time, and the overhead of shared
TidStore offsets the speedup of phase III.

The same commit that made the parallel scanning patch more difficult
should also reduce the risk of having a large amount of freezing work
at once in the first place. (There are still plenty of systemic things
that can go wrong, of course, and it's always good if unpleasant work
finishes faster.)

I think that vacuum would still need to scan a large amount of blocks
of the table especially when it is very large and heavily modified.
Parallel heap vacuum (only phase I) would be beneficial in case where
autovacuum could not catch up. And we might want to consider using
parallel heap vacuum also in autovacuum while integrating it with
eagar freeze scan.

I seem to recall a proposal from David Rowley to (something like)
batch gathering xids for visibility checks during executor scans, but
if so I can't find it in the archives. It's possible some similar work
might speed up heap scanning in a more localized fashion.

Interesting.

On Tue, Nov 12, 2024 at 1:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.

On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.

The takeaway I got is that the word "shared" is used for two different
concepts. Future readers are not going to think to look back at this
email for explanation. At the least there needs to be a different word
for the new concept.

There are quite a lot of additions to radix_tree.h and tidstore.c to
make it work this way, including a new "control object", new locks,
new atomic variables. I'm not sure it's really necessary. Is it
possible to just store the "most recent block heap-vacuumed" in the
shared vacuum state? Then each backend would lock the TID store,
iterate starting at the next block, and unlock the store when it has
enough blocks. Sometimes my brainstorms are unworkable for some reason
I failed to think about, but this way seems 95% simpler -- we would
only need to teach the existing iteration machinery to take a "start
key".

Interesting idea. While it might be worth evaluating this idea, given
that the phase III accounts only a small portion of the total vacuum
execution time, it might be better to switch focusing on parallelizing
phase I.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

parallel_heap_vacuum_test_20250224.pdfapplication/pdf; name=parallel_heap_vacuum_test_20250224.pdfDownload

%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(��Google Sheets)
/Title
(��q!�L0n0�0�0�0�0�0�0�0�)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x���]�$=r��]����������
h_����
Y�e�����B�i!L��l��UEF��(2��� ��L�yk�����d�?=��L����u�r8���qz��������|��B�3�
��_���������%<���}NS��Z���������H�'�}�����J��~�����Z�����_�������I�9_��S1������������>=���.�Bq*GM�����g�s��/����g�P��\���m���7��_��K�������E��@�y���E������jD����Z��\�E����Z�E��"�E����ZU�k�Se����i���|��G1O��>]+��*W<mZ�TY?]+�������$}�V��*�k�Re�t���������@�.��J��|�A��>]+�������'}�V&:����O�t��;U���:4}*}����p��X(	��Z�[F��Zp��r>���g�DkA��>]*�D��|�A��>]+�����l�t��(	��%�>]+w:����O�t������l�O�t�\�|�Ot�+�O�t����':X����Z�(	��%�>]*�L��|�A��>]+W:����O�t�t>�':X:���Z��g�Ot������2S�':XJ}�}�v����|�\���O�R[/�����v����^U���g*�b�z��^]NAP�UA<��Zd����2�"����z��eF%0��m��z��������v�����U���@�Y�� ���L�j\d�x�e��	�s������Z����Ef]�k�Q�5�\_%]�AP�U��r 't��eFAN�"^d������9��y�������`�����!ltM�!@��&W�"##Q
F�HUN����������*����������D��ZfJBEn6B�Zf�M���rBTP����'\%�*�
r"�P�9!V�eZ����"J�
��(�$�L6"�Zf�M��*	+@P�U���"W�	qE-3
r"�Qd��b�2� 'E&��6j�Q0(	�!l��(���|=��(@�����G��eQ�Ui��cE������6BUW�@��FE�a����*�4*��E�0Pq�Q;�D}��1PQ�Q�+��LUW�@��F����Qu��0\�J�09�UW�@��F����Qu�2b�J�0�BEr��J�0?U��F�4
�#�Qu��0k�J�0?�UW�@��Fe�a��s��Q����H�Q
;Q
5F��'�����J�0?�!E���J�0?B"U/8�1}�J�p�z�2��J�0��#U��f�Ri�G��hbTTi�G�����$R*�Z1?&U�FJ�Q8�	�T�0��#��(���I��'���*��J�+����T��H�:`~LJ�Q�����c�&��(�U���C��Ri�h%�4�	���eI7�z>�������b��$S'����6@+If;NJq�Q}�V�Lx���J���$��8)��F���YbZE�F�$� a�P\i��>B+I&?NJq�Q���$�'����}�V�L����J�0�@+IfANJQ�U���$!'���(��J����R\i�h%�t�I)�4*���J���R2^(�`�V�L����J���u~�V@q�Q���$S#'���*��J����R\i�c@+I&HNJq�Q�C��$s$'���(��J�i��RTi�h%�L�I)��j���$�,9)��F�8��d���W�9ZI2erR�*����$�59)��Fa~�V�L����J�0?@+I�NNJq�Q8��$�>9)��Fa�V��gPN(�u����|�h!�J+ �J�
,�V�$ZAEuV�Z���
*�4*��J��VPq�Q	}�V�����+���'��D���*��'�Z���
*�4jF�������T��Z���
*�4jE���3f���T�1Z)�hUZ��Z�:`~�OJ�Q��������T��Z���
*�4*��

�C��:^(�`�V��j��j�1J�<Q��hW��Z)�hUZ��Z�z�q���T����J�����T�9Z�z��0��J�0?B+E���J�0?B+U/���R����Z�:�8�|R*��qLh���9d>)�Fa�V�&ZAE�Va~�V�^1?�'��(���J���|R*����T�pc>)�F�8&�Ru�2��J�0�@+�gPN(~>������,�VDP�X�$�?9)EuV�ZI2rR�+���#��d���W��Gh%���I)�4jG����
(�4�h%�b�P\i��>B+I�ONJq�Q���$�''����}�V������J�0�@+W���*����0?�'��(��J�D+���(��J�D+�����>B+u����x���9Z):�a'���(5�D��P\i�h���V@Q�U������Oj�Q8���a�Oj�Q�C����V@q�Q����fZE�Va~�V�^0?�'�R���Rt�q���V���J�	sH|R+���\5�
(��
��R���!>��Fa~�V����Zi�h�����I�4
�1����V@q�Q�C��u����p+�����"+���*-6�`Z)�h�Y�Fh��J+�����>L+UWZA��F%�aZ���
*�4jG����
*�4j��Gh��J+����}�V����R����+���J�V�aZ�:c�OBKa�V�&ZAE�Va~�V�����R�������$��Gh��J+�����>L+4@`�OBC-�C����v�j�R#O��!ZA��Fa~�V�&ZAE�Va~�V�^pc>	-����J����$��Ph����|Z
�#�R4�
*��
�#�R���a>	
�b~�V�8�1����qLh���9d>	-�9Z)�hUZ��Z�z��0������T0?�'��0?B+U'��OBK�8&�Ru�2����
��d+���*��P!%�V��+X�,��P������7��$S�J)����
�B SU��V�<tlD�
2U���Vj��C�&�� SUE�Z�#��O����R+���T�{F�=�!O����$��@'���+�G+F':vF�
DU�a�Vj��D�b�	��J\���qA��R��1�F%��@'��g�1�Z	:������g�5�<�:)�>l���\P���$L�QA����I)���`���sA��R�C�T,��h��<XU�a��b�c1���y&��Zm:��NJ���b�����c�	��"+�Z	����g��L�Q��sX0�hUm�`����a�	����`�6��1�!b�	�j�VQ����I)��N�`�	�X�3�]�1�FM:�q�<�U0�F1���g��&������@���(C�d8#+���gB�R��{>�L�X��	6j�yN+������	6j�yN�L�XU�t�S�<0V�1�F%���1����fL�V���1���g4���
A���xBA4�]�����'d�0y-�J�2
"��b���V��Q����@V*�����'����R���eD<)�>4��J��>������a�Z��eD<)�>4��-� �I)����Tj%hOJ�O��t�-A�O���t�-A����R+@�(�xR�}VL�Q��3�%(�	�`�V�gFKP��0�F�gFKP��1�F%�gFKP��0Y,�-� �I)��0�F-:�����'b���t�-A���b���3�%(�a��VjhOJ���	6j�yf��>&��E����DL�Q��3�%(��1�FE�gFKP|�0�ZZFA��Rr1�%hOJ�O���<3Z�b��	6*�<3Z�b��(C�d8#+���g����(@�(�xR�}VL�Q��3�%(�	�`�V�gFKP��0�F�gFKP��1�F%�gFKPLt3&X+A�0��W?����L�
y�P9p���E�SA�l�e=pz4�U�S��|���E��.�t���� S�U���zlDz@QQ�d�)�*yzQ=6�=��(z���o�<����=��(z���oU������C�|�-��-Q�O���V�8�z��>�����i#S�U�D�z,f��{TT��t�	-Q�O���t�	-Q��d��
�T��<0V�b���u�	-Q�O���<Z�b��	6*�<Z�b��	6*�<Z�"~��o<n��z�g��6L�Q��t,�y�0����)��M�����j �*�<Z�"~���o<���y��4�`���t�	-Q���	6j�y&�D�>l���Lh��}vL�QQ���_D'L�V�L�z,����T/��`�&�gBKT�C����b��c1�����&����Lh��}�������@#c����S�Q��t,���&U��	6j�y&�D�>l���Lh��}&����Lh��}2&����Lh���n�k��Bu�6 �G��������E��=T���|/2���:����� �xR�+���#���0�I)�4*��,���'����}�>K�|UE������i�(�4jFY���NJq�Q�X���NJq�Q+���� �wR�+����� �vR�*������ �vR�+������ �vR�+������ �vR�+���#���0�I)/�Z0��|/[���J���u~x�(�4
������I)��
������I)�4
�1X���NJq�Q�CX���NJq�Q�X���NJQ�U�X���NJq�V+���a��R\i�c�|/���J�0��|/����J�0?�|/����J�0?�|/����J�0?�|/����J�p��{AX��W�9Z��|��*��q�^���".�AuL��|/��{��� b����E�|�
�B �������i�>u"��� ��{��O�C�>U�Z�G�I���0T�S�������>�4��|y�Z�G�Y��D*���JT���Y�G���S�(��{Q-�#�Y��0QT���Z�G�:�4��}2�����La������Z�G�:�4��}VL�Q��3M��b��	6j�y��0T��0�F�g�C�>lT�y��0T�C���=r�y��0T��a��b�c1�4����������3M��b�]
�ZE�g�CE>�`Q-�#W�g�C�>&��I��&�P���	6j�y��0T�1�Fm:�4��}vL�QQ��&�P�Et�k%HH�b�	��b�	6j�y��0T�0�F1���g����E�|�\u�i"�dEZ%�Y�F�<F�|�\u�i"���`�f�g�C�>l���La��'a��
:�4��}2&����La���fL�V���|��*hH��{Q/���|Y�h����^�q��`�T���^�G�pa�T���Z�G�A�Z��{Q-��j��F�Ch���E�|/��{�����Z���������>���j�^T���Z�G�Y�0Z��{Q-��j�^4����}-���������}f��R-��j���3Z��{Q-��j���03Z��{Q-��j���<3Z��{Q-��j���<3Z��{Q-��j���<3Z��{Q-��j���<3Z��{Q-��j���<3Z��{Q-��j���<3Z��{Q-��j�
�:���j�^T���Z�G�:���j�^T���Z�G�:���j�^T���Z�G�:���j�^T���Z�G�:���j�^T���Z�G�:���j�^T���Z�G�:���j�^T���Z�G�:���j�^T���Z�G�:���j�]�1�F	Z��{��'`��Zu�-������E�|�\u�-������E�|�\
gd��0������3��Z����������3��Z����������3��Z����������3��Z����������3��Z��������+�9#Z��Y
iy�B�2#Z� ��n��eF�A�y�Q��Y�%(���,��J�2+��>4���V	Zf��������*A�����@��[��f�E>�y�Q��Y�%(���,��J�2+��>4���V	Zf����'ap�
:�����'cz�J:������a2�7
�2+��>+&��Y����L�Q��3�%(�I�`���3�%(���`���3�%(�a������
-A���	6j�yf��>l����h	�}v5ku�-A��d�o�eVh	�}L�Q��3�%(��0�F-:�����'b���t�-A���	6*�<3Z�����	�
�2+�%cL�Q��Y�%(�	�`�V�gFKP��0�F�gFKP��eh�gd��0��h�Z�b�l����h	�}&��U����$L�QA����dL�QI�����	�
v���=?��m���0@�dA�q�^��b�������:�����|W�G6u��{�����>��C-�C��F���Y�7S��{��(��a���|W5��l�P��Pq�Q��������J�V���j�*�4
c�1��=TTi�6Q��{���(�l�P��Pq�Q��D�����J�2��&
�|��J-�C�D�����J���u~���j�^�4
����=TTi�6?��{���(�`��Z���+�����|W���Mj�*��
����=T\�����Mj�*�4
�1�������J�0��iA-�CE�Va~`��Z���+����f�|W����j�*�4
�1�l�����J�0�B+��I=��YE$�[������
��W�R�'�v��zBA�*�L+F:4����v�V�y���>dv�s�Vj��C�&�� �����R+F:vG�
2���Z+���T�D*�!�)�Z�D��QOJ��:��V�Nt��>�v�s�Vj��D�b�	�v�s�Vjt�i"�dL�QI��&�P�M}�J���X�3Mo�j�j��j�y��0T�0�F�:�4��}&����La��'c��J:�4��|����� �z�g�]�\�>Z-:�4��}"&��M��&�P���b���3M��"B�R����y����h5�<�D*��0�F-:�4��}"&��M��&�P���	6*�<�D*��N�`�	�X�3�]�1�F1�%�W�3�\��j�y��0T��0�F�g�C�>YQ�V�pFV��1����R+��c1�i�<V�f�g�C�>l���La��'a��
:�4��}2&����La���fL�V���,�.6"~VAC����,�`Ahy4�m�
A�r���W�0��%hY
��hY
d[�U�����>��E�@��[%hY�M�ChY
d[�U����}-���l��*�S�h	�|&�|�-��������-�*A�z��>��E�@��[%hY��3Z�0�F�cFKP��1�F%fFKP��0��hY��<3Z�b���u�-A�O���<3Z�b��	6*�<3Z�b��	6*�<3Z�"��-�(@�:�a�-�"�������b�-���`�6�gFKP����X����h	�|&�|�-���gF��L�Q��3�%(��0�F-:�����'b���t�-A���	6*�<3Z�����	�
���yf��cL�Q��3�%(�!�������b�-�J�`���3�%(���2�J�3���yf�������b�-�Z1�F�:�����'`��Zu�-A�O�t�-A�O��t�-A1���`�-wDK?��!-e+-wDKdA�Y�VZ���"��a2�%h�+��64��|�-w�������o�������@�����Z�b�R�UQ�jFKP��0�����Z�b�R�U���BKP�CY��J�rWh	�}���s�h	�}2������h	�|&S�Q���BKP��b���u�-A�O���<3Z�b��	6*�<3Z�b��	6*�<3Z�"���o�����L�|�-w����'b���t�-A���b���3�%(�a�L�FZ�
-A���	6j�yf��>&��E����DL�Q��3�%(��1�FE�gFKP|�0�ZZ�
-A��l����h	�}&S�U���BKP��0�F�gFKP��eh�gd��0�����Z�b�l����h	�}&��U����$L�QA����dL�QI�����	�J�2m�^�������
Z
�,���������HZ�����
`#TXu�Pq�Q}�
���*�4*�Sa�u�B��F���Y��{�����o�H�������+����i����!���Far�����*�4jE���3f���V�1�+�����J�0?BaU�[�4
�#Vu����J�0?BaU��W���)��!�J-�C����v�j�R#O����{���(���S��|UZ��z�z�q�H�V����S���H�V�9z�z��!�J�0?B=E��=TTi�G����C�T+�Z1?B=U���j�Q8�	�T�0�DH��(��PO��|UZ��Z�z����J�0?B+U��M�4
�#�Ru�q���V����J�sHdS+��
��\~�]m@���y�V0��W��3\E���
��n`QI�f�=�(F:4��LU�*{�U�<tlD�
2U���V1���	}*�TE���[��C���SA��:��J��>�4��|y�|�d"�['�P�!��o�;�O��y�|���X�0QU	�kT�9��0T��1�F%f�CE>4�U*�t�c1�4�U��	6j�y��0T�0�F�:�4��}&����La��'c��J:�4��|��|��h��<XU�a��b�c1�4�UU���<�D*���@�U�y��0T����%(G�b�	��Z0�FM:�4��}6L�Q��3M��b��	6j�y��0T��c���:�4��/�&X+AB:�L�GcL�Q��3M��bB�=�*FB:�L�WU�t�i"�dEZ%�Y�F�<3��F�D�y&`�j�5�<�D*�	�`�V�g�C�>	lT�y��0T��1�F%�g�C�D7c������%��%���"��F1_��+�Tt�j_�R+�L:vC�J�UY�R#�4���VD�$����d����2�Ir���HfN><)���$cR�F2x���2�@Ir�U������SI6#���F
����A��uRfF2����2�|I2���H�Q>\�� �dRI�2���*)fY���d�O\�$���j#�M�p��9I�*�V���*)fAE�����T%�,��[L VUR�������V%�� �A-Zy8U=�`���"o�bz���bU��d|����U f�j#���J��k�6R@�W= L���������J���"o�bz1��bU���L �UR�vy+��D�J�e|R�7R�W= N���������J�YP��r5=�`WI1K*�V2�������'k�12Y��������HA_>\��4�@�KfF��+)fAE�����`%�,��[L VR��������a%@gy#���bbQ?I�@�L�b#�t����bbQ�#�[	L5�/�����Q31H1��1N
	L5�3�����Q31H1��1N
M��A��@p�n%2q�LR�xx�SCG�� ����85$0q�LR��J���$_���ew+���01H6���D&���A���"o�lz�01H1*�V�����"oe0=@���ey+���� �L 8N��8j&)f�������� ��!8N
	L5��]��FF���A��@p�n%2q�LR�y+'���A���"o�bz�01H1�*�Vn���]E��hz�01H��O*�F"G�� �	T���L&)fAE����ab�b���&���A�Y��cd���5�d���t+���fb�b���[9� LR��������� �,��[L&)fYE��dz�01H�YE�HX����sW+�C�i�U��H/BXxu�	����� Xdy�I).���`
�"+NJI���d�"kNJI��	�d�"�NJI��;:E}�h�$*.���,f
�"+NJI��3:�*�E����R+1I�`�U'�����d%�"�NJI��LX����R\z#1Op�{�'���J���_�.�I))��w���RJJ���$����R
F-L&��_��I))�R�LQ���T��R+1Op'~)w�1OLv��Fb��^|��t�u��J��n|�&����Z�����E��'f�Zj%�	��/��8�����H��K/z�<1}�R#W��M/:�H��UK������0��]��JL&�Q�jZd��Ko$�	���b��j���'�^t�<1!�R+1Op?���#�Q-�G:�#^t�d2�R+1�0�7_��h��OR��r�d��������s��J�������_QB<�R#e���h��U2�/#e��������d�)�F������U��S����?:|Wf4�W���j#�I���d3!�k��0�W��?�b�0U����?:|Vf4�W%�T�6R���p�z���2��[L�y���1U�j#e��W����"y��T	�t��<�W��"o�lz�����"o�jz�����"oe0=���P�YV��2���(�L���Z��
����_�������<��R�������<��R�vy+��<������uF���z@�T,fFN���J1�T��\L��?�bU���L��?�b���[M��?�r�T����?:\��8� �Y�@�����C)fAE������C)fIE��`z�����"oe2=���P������H����UH����������C)fAE������C)fIE��`z�����"oe2=���P
��*�F/����OR��z�����+���X��0q=vCb���K����tt@+f�*����&���2#&�r���Hab:<)3b�*cR�F
���2#&�r�U������%�1�j#������(�l������t��������6R��W�g&�2��[L���Q�YVq�2��3�d3��Rm$01�z3q���������(�,��[���L�R�������fb�b�U��L�0�d3��:j	LL�����Un*�V.�0���"o�fz31J1��Xod4=��%�1�j#���p����\6ef�01�z@XT`��F.�0���"o�fz31J1�U����0�����"o$01�z@�T`&(�r2=����y+W���Q�YR��2��L�R��f#����z@��0�j#���p����W�n��Hab:\�����\��\M`&F)fIE��`z31J1�*�V&���Q
��*�F"�e��^�~���r]o0q9vEf���1r]o0q9vCf��`���F��Z	����6$0q=<*3f�"yx\��&��'e�L\$��������]�1����6d4i&�f��z+������ ����umH`�z��������q]���R/L\dRI�2������ne2�&�f��z+��������[9� LR��������� �,��[L&)fYE��dz�01H6^�[�L\�S�����T��\L&)fQE����ab�b�����hz�01H6^�[�L\W=@��H��umH`�z����En*�V.����"o�fz�01H1�U���������"o$2q=\�a��*�VN����"o�jz�01H1K*�V���A�Y��cd���U&^����L\W=@��H��umH`�z����Ey+W���A�YR��2� LR�������ab�������vb*������i�Ut���h�}&�+���V<\5-U�kod@/\�P��o���j+�2����H��2)3X�P��o���j+we�9\�nH����0��a�B���R����,R(T��U���*a�H����)�V��)�?�*���ikI�Y\�p��uU9�L[K������A�L�2m-�r��
�$�3�������+(�7�@J��Y����:���
^��!Z\WP� �a,���J=�E�3Z[��T[�r�K�@�K����r�K�,jp"L[K����?�Th����*������r&���T9����?���Jr��T9�����3���5��r�w��58
���%���w��$Z�����
-����-;U��o���p/���	w��%U���{���r&����T9���������jp����
�pW�ZR�V@�����r�������������%�??��o���=�����O=��������q�\�_\?�<����5���W�������Z�$U��T����|��+R?�X��,�7�������\f�����w����+���ev����w����wE�7���d�.��+��X)Bc]��:C;�um�zbV�����sU;<r����wO��,j�k ����/����������������������x�����k6/�SNI���,[��������~������?=���?z�����x������]�����_�]���W��u�8sR������_<���9����_?������\��t�4MK������������5�C�'�I����w��v��o�H\]�z���������o�;I���������}O�	������<w�?:��}��s��s���������'���w���k_s��/���W�����W�����g8�����\�E�	���Y��\����r��~&�������5����p�V�����SZ5M��sS���w����y��3�u�?+�y�wVs�V�����s��S��^��$�=v��������>�e��o;���T1��k�b���?sL}�<o�/[���6��������/>�<�������?�|��7u���������<
�c������-��Gp�����~{������6�~���t�k��-�~o�4��=���^J8uo�������b�ww�h������������*s�sc��v�.m�������O���76��9wn�e�������/��e�����i��.�j���u4&��g���U������Z�����0Y4��3�BVM�A�Z�������a�}����:OS�����>��~{9���d������a|��x�����9g����~��d�����z>��_k�]���S����y��^�p���0����n��/�#I��������eD��t�s���/��s���� ���.;h�W�88_�7��y��{h^�������PZ�����e�y�gJ;�O������|n�e��~=K�I���jm��!�V��_V����#���;U��p���!&��v���������yV�z-���U�k�������d�~�����?.x5Hmmw?
�������}�T����G������N!�^7zv�����������n�V���.�v���4��Z�����-�����R>pm���Q��w��\����[3�e��&�p���tuFZ����^��z��g�i����we������������]�o��#D�����{��D�u�����|�/�D�����qf�����������	�D����{#�C���yJ�;�^fX��(D�q������F����t�,��/�a�k����{�2'������r��Gm�����;hs��!]��;%Hw��/n`�������lA�����`��x{g����?�����H�tu�>A����s��u�y{v�����,��9��������p#�@]��R���rPw�>zW��^���y��%�rr��YxE�<�z���b��^'�����_9����e*i��1�r��E������-�w0N\���u=j�'�,�||J_�RE_��k�?����=^G����ruW����]����8^/b��!�}o���p_��>�o��F�{������_�?���:���}�}��,Jt�Ob�;���}MWg�������$�����������wF��h�c��z���>�zM��5�������/�T�X��N��s����������p]�m����:_���uO�7e���o�����R/���
`����vNS���0�HoN��;<N�������zWe���y�p ��;����= ���'�����C��}q�3��:g���	x�;����	������{��rQy���>��|���pf��M��xy�~3(�������v�v���xy����{w[���#w�������ww����ww�
f���?�'�e#�=�(;Q��>��bwn��@;��W����l���������
��������i^7���9����Z��e�w	:;��w�[��z�kV��z_O���H�@���q��z���;S��I�u����
��%|v5}��S����e�;�!3~?�����|��b������������G�������ex�����������\������;���9`����N����������w0�h�7�"H�mk�ws*0Z7���S�h�m�\x6�F_�#�L����K���9�oRj��6��Gm�_�V&�z)wS�0\7�_�����>��e"����{Zh"����%�@t����
�D���!���w'k�K�Z���_.�N����/�Q����j��?]�b������#w�|4@]���'v�����nTt�@M������2|�g^�Cz7��5�74����eB�������kP]�������zv_�����������K�@um���T�m��oR��n�?���T��X�g���-���t��F�����2�����WB��<^��zg����������B{1}�5}G���]��m�?�?�x�����zp`��a<���w�������?N���y����-�G���v;0^'5���^�3�^��.�;jx�/��N�����:����{��7@��ih7*@�A�{� ��N�1<u��b�z�1�J���1,|,��Sw�6���`��x�p?��!�;.�������NZ��;L��N!�]��=	+Sw����V����������]��_�Y��0���g&�z�Ou��J��?���]��{Y�=�sM������e��Q�_��;�����:4�+@�1�������6�����k�a2t�����'�;S
@w����Ut��~��]���^������J��S��$�{_��}���c�'������k:�=�<\����/w���oq�����:�����/�]�Lt��zb�8�u|����t��.�]��\:_�[���:�Btm�C��Bt}w��#!��{l��t������
j!���]��z�?*�9��~���d��#�e~���[����� x������/��Q|��s
�wGg`|���,�x���]�����x�k��J��)����{#x�v/v��U��y�O�����#G��x�]k�o�'w'"��y��%3�������9����������{6�0�5�G���]�B��V����������5E�v���/�:r��$�����b.�V��D��zO���v�Sn�P;�'����{6G��zOKY=w�w����\�V����9����Z�����[ ��/�Q������U�����9���N;p���-�V��u��}=����9#���3�y��G���J��������^���u�o�������i@^-�]���t��%�������0���Ae���`����c���� 0~������Y�N���B?���:{?����U����I�w���Dk���{������Z�U��Q��A�{�,�shMo��`������������f�9��q������uZ�� �n���s$��O+��5�G_�����z��,��/3��g`�u�/�����G���	�r]��UX��NW��g���_����g��2YXn�Y������C
�������G�/`m����40wv�F��n�����������sMw?��u����`B��n��o-����c�a�s>�>�o�'7>��,�6�;H��:��uts����Q�s��h�a�k:��H�}�o���%o��j���\����r�M���;,x����oE���R'q|�p��^�yW�uRS�;�|��\��\���`�;��N��^���p����x�onh��u}�����N����]����������i�]w�yv/�;hs��,������=f�����X�-���H`�n�����]�����z->�`�~�w�D�n�["����G7����k]n9�v=��!���G������F���4��2������W��]���~�v����<UD������g�2��1|���C��]�����}�\���w���e|�w���� �������O��;��m�?�o~���>v~�as��\G'`�Q$}w��5�^�B�~k��^����k�����h{G->�:�^�����-�O!���~��`��<�w��}�����d���?mQs���^��<�j|�]������Z���o�������q���'�;;�:�����`�;��qf�`�;/�����=�����q�4���G���^F�w����z��������@x�����L�a|Kx/er��N�����������A��x{w?�u�x��y��[�{#�	����	Lx��uwG-�~��^�={z�����{W����I���������6����2
lw��}_,����sg�F����w��]�ge��=]�����<���v~r�l�{W��u��=-eI�Qxw����s���Q����2�
���z��}�
�]��5��]��'s�aO�i;������=����z2y���sk����q�j:��]m���qst����w�)������
��u��!�����?����a��m��w�W����':=��H;���������V�m��^�?��?���a6������F��n����_����K}0��m�7��1L�u��K{���m�r0��A����A���u���ln����P�m�->|��0�c�Y�g|���-��fY���y[�8�5����k��������K]s����k��-��;p�n*�;p����k��~�k��f��f(����W��N��BX���?����<�����s�
�3z������f���V�S�V^��2�������t��wM)�A�{s@�m��wx�������t��������%��H/�����w�� ������
��w�Fc�syT��y�i<�s�iJe�>�Ux�i���
��k���b�I�]s�d����7I;'+���]����f��t�t-o;��;�n;h�����6)=�i��4���{��g����C�g���M���o�Al;�������9���v`�g��l���'�l����w�A�|~]���w��x�����o����Z��O�����^����
����;�c���*tn^k��Z�� �w�x�o�]���2[R��ks�k�������A����{7��4��b������_��k�A�� 7��p�����D�5��{�:!,?�G�4rg���/������et���|�q��v/`�����2����8����]F�1`k�����l��v��`��'[�x�Z�e��7�MA���S���|T�[���uhuo`k���c��d����W����M��������8��w���]���W������i~�a�Z�ouu�s��[��{y�������
�v���
��p���[��{/�o���{������ok��?�����e��~K�[o4�?+��(��;F[{l�[���0��2.wG����L�4�3��K����e=�]�iZ���e�;zd�|���[��=����{�J������x���`"E�A�'zG��~q_��G��Mhe�������#���7�wY������;�E�zO����>���(�����9����7����[G����q��}��eec�=����%�Q��V������ `5M�F^�$�u`�}��V���
�X����s�F��7�K��+���o��z���6�k�?��0�����1�#��]
����+�����as����w;��{�@����G_�p���w.�������;�U��c����K3| ���?2�m���'����Nk{7%	�5�_��:���nk�������������)��n�MR����v��Z[�O������������8���yY:�~���Z��������T�a���{;�k0�Z���W���z���
-rZk���"9�7m���~{{[���~�7/G�|v�����w;��2@k����E�+p��/���\���\�����������O�����U�����S��om�|�Z����]v[��+
p���w��^{s�~\����Z����k-��zI����Z�����Ms����g]�o�A
p�o�}1�Z3%�r6�k�o~���M�'��K�q{�����8�������q{?W���v{�o��v����}�k���X�k�&);�G_I��4��z���;��k���Nn����u�MQ�z�a\���wp�(����?h��%��k��voH\���b�4\kY{�w������������������}��wF[�e������_���s/��D����,���9u_Gxk��^1�M�o-��;��S��AG^�>����m�?_9�{����w��^�|s�������o��O�s5���	���[��&��y��[�7��\����x�c% �Ncyl�wof�v���� ����g���A�������������}�y�~{��M
����@�e=����}o���8������
(�����K������$���
�r}s��P�eN��3��m����\�|��)�\;�Oum�����K����;�a�[���?��]����D���?h|�����3�\��G��I���e�V�����rG��|��\;�/a�7�y��������OZ�z�����0�n�;������Q��4���kZ�nm��3��k�����M?���P>�q���^��Oh!��C���
M�2z�&�A��F�
zG�p��Ou��(��{�����wl�����2�v2�;���N���c�
z����A������a9�a�zoa�����V���w�$�����:��'�����l������Z���������Y�e=���}�l��������������XG�2:Q�5����j���t�����|:���sZ�6M[�YFwt[���yW�#���5�[��M�n<��t����r@�^k?�/^�^G�v>�c~�T~w7�����(�k���a
P���G�����6���ss��G�&{�5�m�~h)O���{C�2�����2�����ba@��{�8���G��6M��U��4�_F���7�~s@���W�/@��7"Z;%F�4��=�e:��.���:�������tf8n������.k����e����>�������=hm����n�y�/w����� Z������������v����UDk�����d���Gm�s���m���#Dk[{�=���k�W��;@��{��8���G�}6M�M,���2{(�~}C{��P�������z���a����^+�Z����Z30C��@k����VZk����p��o�0G�6���P0����+���Lh�m><��km������Z{t�@���?���^��?�����~k{�>�������%Zk[��#���?�M�������Z�����Z�zxB�w���~I�Zn<^�N������0m-��N���l~z���#;����S=�n�6�������G�����O�h����������up��@�����6]i���2>����Z�����/��]=�������w���{���[{�����|����G��5����e�q�ox���9=3��G�r\�{{W�1��:�Qp\�9�K9��z��w92����9���o��'3�]~�}����G�>��_Gw�����}�� O�����Br^N��Ms��s��������oM���>�oM�uz��������|v?v�1|����8n����\�mX����{�����i�/�])�+���As{��d������}��@[�||��;
~������i|'2�uO���>,A[�{��*���� �����>�Z�����O�(����Uz��0O������`|
endstream
endobj
8
0
obj
20109
endobj
9
0
obj
[
]
endobj
12
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
13
0
R
/Resources
14
0
R
/Annots
16
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
13
0
obj
<<
/Filter
/FlateDecode
/Length
15
0
R
>>
stream
x��}M��H�\�w����~�t�^2>���,��J
������LcQ\iV����w7s��IfOaf����?fZ-���_^�_����o��\�������������_�����K.s����Z����.�.���R���\��}�[�u�-��p���o�BY~8/��xo�H�����_�e'�����_���;q��WA�;��;����/�t=,�,l6�]�p�,���_s,!�z�i����r�i)k�sj���^����D��(M�}=�����5�yY�x?���S��2�*	:��9��}���=A?����L7��WIr����Qg�-Q���S�!r�[���=W
x^�<���{[F�--��<�)�i����
5��n�3�k��|���A�,��F����=��#�#VTG�-3��Q����:�u��k��?���iu�\]������q����~���.��n
- ����St��{KF����)�=l���D��cv������M!�.��Oy�i3�*$[��d���Kc)_�G���R�q���X&�"]vGk@90�-e���n����p5��������D�e�=��\���|~�S��'qq�d"��xb�%#JV���K�>8N�d��T�[����8�����#�D��,&RI� ���vR��>F�Yg��*���A�������x���M#=��B�-��|D>W;��O����_@"=��B:/�Wk������{���l���4���(��H��2:=5���Z���$I��$�����u2�"r�������������P#�zK�nvJ��-Y�d"��H��A��f���#��<�:$��N�[2�3Z�q)'��]6�(���{[F���_�qD�������:4��!�:zi� ��U���E��"�#H��Q�(F���^�d"��H��Q�5�&5�
_���2	�s��zKF���"�S4�#�}$������ ]�����$�����-��S�fO�N��9RBi$��;Z������@����.g�����CG�&j����9g�ew��3bE�"2����[2�,bt@���'_s"I$���{[F�L]�j�3���0$�(!�0zi� e,6��D��w�Q������h�Y4x|���1�$"��H��q���xQw�PFO%Z�>���[2�T`K����I��R-h/
���2r��iC%�+ �&J����Kc����1�-����X���E�i���z����O�+�Y��_�g�H�&��^K�9 gKm��H��]���_P%�t[:���A�&S0��Kg3��d����[c]:u�TAj��HB���J�����RL���O]n�T���$H�o�����}��e���T��;�!�T������RR{)#H���R�w_P)��8H	fA���Ez|u�o_+��r
Z������m�!f����d�G7���z����A�`��zA��T������R*1�t)���Ja%���tn��A>7��t����Au��l�v��9�������U���P5�RU��j2T��?X��U�ZY�$��vh�)X]=���;��4�,v���Q�����)rB�U��.�}@�$��*���u��S4'�}���(�DQ%Q\��z��2����P+���X�&��}���X�����y��Ylr�9�~mX�9!'l]!Q��
@���J"�bE�)#|y8���/�j����b]�9��+�}��T�P�T�h{�9�N����
���z�+�J�~��R�]:J7r���%S��P2�<�.���@����X7U�%Co� &|���
�h�����7w����pL��+��T�P�TsH{�]�7��*��vh��X1=�����e��*����y)#���
;x:S�dT�`AUSHc)��E4�i����B�;H������j9G*�1g��8\�����M<���T�F#s>�
�W5A�CH��NJ��)�cN�
�R�XU=@�����p]��TeU�+���(H�&Z
���j9g*�1e]u���c{Y������)�`��3���d++��s��.sF�_5E�CH���%6�P=`%V9�� %\���]��	����$��`�S���5�����&k9*�1e}����+��x*Vc��8�iZ3gT�`�VS�<������h�Z��j�B`�����1P1P�VIW,� �,���p��x+���Xx��d���'k�M;���j��TX��P������Ai,�����e�W���P�B��K�F�.��y�~v��,�*G�A���tg�zB�q}iW0��IFSr���[��lNnt��t@U\K�q�t�s�������f*��H�[u
���A����
��Wshs$�J���A��
!'�T���{��P%B���k��rB����
#M!\�
��nA�i�q�/��J��,�&�R�LI�
�y�)-r�v9h�W�O�.��B%\#V+���E:�OO���.NrBN�w���,����T�H�s��R����"VP�>0�4w������	beIR��r�}�o,9
d��6h�Y���a����W����&��L�]�������)���6�z2�v�C�.�+#H	u%7����2-�U��&oA>K������HN��2ob����7UTTnP�'�";(�����***K��G��PN3�R��,������������,$�%Go��K	��f;�i����9;�$��c�zr�Ug ��%7h�	T<�p�U��Ua�j���h$���j�1P��CJ�c�>���z�b��T=��2��Z��HIhb��rTD��S�L�UeM�j��r\E>`J�&�Db�e�lq�����
�~�=X����z�����QCu�����!e���\�T�<�j���B"0Q!�C��N��*��|}$�	��1Q%�C�R�����"�.,kN����-�?t9�q�ITXn���P%��D�e��������r��r��C�a����;��%���j�QW���j���I\Z�� p���Z^i'qiYs
�u��S�7'�q��r������.�@.T���L@5��D�JUd9�x���0�,TV�+��=��'w�m��R����(^���!�|k��I���T������0q���z(
����
�~�D�d����"��P�Be��+G��*��~���TW���,���DKR�Z���t��:�>��3�'�:��K��T�pa9���y����ck��D��M>}]�4vp���o��2�T�PmYr�<��RJ�>��U2T\���Aw���TApmYH�"�u���4Ap�Y�
�u��sZ������@���4�$*,{�YB8Q��9���
�I!e|�N�*�*�B"P��s|c�U=p�YH�2�u)�-�z}��9U-\g�$��c��l�eA}�'�$�$���Bue&�ny��3����������"�!��x_]z>�{$��E�:��4v����aIhj�~0�4w�S�)@ !�v��G��\V�O���S��n�*27h��TU�p�����'�`#sBQ�BE��k��,����[zhF��BU�����e|l����[���t��:���-g�_	y��/�T�uP�X�0sB�����u���Z2{u=�$���j�Y����o���{�b�����p@���Kr&��,������T�����������i*��6h��T��p�����3'T�P�7G*�zH	pQ�$T�P�7G��zH	�Z��j�b��(�T���?��vT11p�W�
�\�=`J���L�Uz4�d*�zi���Q�t�J��T�
��
�r>3�����*�
��Lu])��<_��-�L)\�����R�������}��.�jR�����'��*��6h�)T��0cqp�K>�$�����+$u)�R�{TR��Z��+	����O�bI��cS5p�W8�R�u�t!�&U�p�W����C�������������O��������m�l��}�@�U�la��O{d_��H�E.�D�
�l�����2[�$�o�d����	H��_dG�M�#�=�I`��D�}Sl����@H��D��4�����@��,�7\���[c�b�k���D�d����I��[���3��=:��M+�����jx��(T����I������e��2kx�1�Q�!�
��-}~�0�f����/}���c��h������9a�� <g
�Y�K�F9CxY4�,^`T�E���N���F����PU'KJ'[������Uu��p�E==�Q��EU'��&[����zt�������uE����F���T'U��/����7��+d��Tu��`��{��n����bR�������|�;>B����Nn1NV��F5!��UU'��&��~����������
���Q�O�@u)����<Y���|�T���N����z��KEU'u��&��h����Q-�����1�4�M:�T�gU��s'����y
�.GU��t&s��{�.gU�����u�P��f�����u�P�����UU���3F����YU����%�����������Yr>�����;/B!���P���@!;��@���F�7@!�{��O7�=0G�F���w +��Pk�&���w k��P��f��
�v R�P��.@!���@V'����7,@!�R����f> �o�t5���!�Q9��8i[�'T�I#��<yB�<!O�0�Q�����Sf��3F2*3���<�����!�����I[P�=��P���h�����<9a$���'g��yJ�HF93OY��,�S�dT��y*�Y��������E=�c\
ET�C3�YL""���`�"�Y�""�YQ�-�g����G�Gd�Q���u�wC�8t u������J�`��Y%"�YP�%��XKD����ZX��1	��$������E=����*����,���$T�C��,���,�`��Y|("���`�
�Y)"�����-���g�������R*mQ�z����
v(����"R��
v(����"R��
v�����"q�y>"����,���
vH��m���,���t��zD�h{�ni)���^�Hm����G��oH���G��V����wd�+�2�+m#����
������J��<��sC�ck="���6#�l���n�Y�H���]�Gv�mH��G��W���MDw;�#4���W��M&;����������B%�T)��PY��L���P���#�FX��DVf"�3E;XfGV�U�9lvS�,�6G��>J�Hd����Dd1YN�`L�,g"���J�hsvde!��Y](���8�Z��R�'���W������#I����q��,��=���m&hd��]A����z�>���=�U���6���uH�N���&hd�$�ap=@�4A#[H�&��O4��$���z��j�Jf�z������t�x��$���z��k�F�H���c4��$�ar=@�6A#��Hu�lks�b�;��)u�qks�����H��pv=@|7A#�$y��b�	Y!�{�]�w�!4�J���� 6������������b�$y�������d/��/��7B�#�w�dd~����)��-����b�z#$<���HF���������p�x�/��7B�##u�dd~����)��-����b�z#�<2>�HF����������C?.�W�2?^��w.��K�Jd��������zchde�h����Lde&�:S��evd5YE��o�����u���
x'sp��,&"���r�hcrd9Y�DV2E;��#+�����B�����Bd�:��� �qiN=@�xa���]P?������h����C#[I�.��G�dj�[�����9����
x'spr=@�8B#K$y�����B��0��~���$y���#T25�-�A����z������98��~��%��������Lx�v��xO�14�J�����~���o�������wI����Ghd�$�at=@�8B#+$y����Y%�{X\P?�P����h��Ks���;�$y���`�z#�]PT�H�xo��8"�Y&�d�~\����]vG������K��<�e7$�E22?.m#����(D�a�~\�&��.���0���qi������������K�y���h]0���qi[��{����P��k���b+q9�~\�W"��R�hCudy"�<Y�(��<9�2Y�����`�Y
DVQ�j�{�����9j]-�����9:���,&"�������Ld9Y��`���,DV"�E;XGV�U�b�{�����9���#I�C����z@����K�������l%�{�� ~�����������D=@��98� ~���%��������l!�{�\?N��V�������	*�����qmN= M���������	Y"�{�~\�SH�z�p�v0� ~���U��������Lx�v���6��g�b���������,��=���'hd�$�av=@�8A#�$y����	*�����qmN=`���2����l?����C�7�g����
��94�:hx�mC�a Q�O�m��[�h�p<��!�0�����3p
sh��<�!�0��O���q
s��J:��T�s-�W#����m�,���L��������:�'d�3�	��D������Y���*IY���O���?H���m�,���3u���F�B=?^��m�,�C|��,1>���P2U��������D�����6B�!j��0>���P��b�����F�B=D�����6Bz��q3��$�F�B=?���m�,���|�|�F�B=?����m�,�C���^|�o#���7��n�,�C���]|p#d������c�!u��w6��On�,�C���\|Xp#���7���
n�,���C���F�B=?/�On�,���������~WXI���&r��/n��M.`u7�FI�Z��X1z#��%�$�����
K]�F:4#����� �t( ���w,��Dz���
K!�F:�����
/�&1�=�!���^QH�s{�C((7���%(�G����6 �h+�Y��H��R�h�Ddyrdu&�:S����*kJ�+A�>@��u���,F��0YL�L
-�C������'�U�=������xO�}��3(���t&�S�=$�A)���3����B:�Rl?�ND�N��Ld+���C	j��h��$��XF���t��~`!��k�hIgP=�V��G���t��v@�$A�>@�z���G���D��~`!����h�
��@%����hIgP��6d&�����H:��g?ip(����~ �h�J��$Z��q��'��
+��Cur�|!�� �X���99m^�L���H:�P�\k.N����P� �9����C�����N����i�Cy��5,N�Dz���\���!�H����89D�j��\�+
I�[!��9�.qru��\����N��E{X�L��H��x�u�Ddyrdu&2�B��U��:9�}��+prr F"��]0Yt��89��!
e����|j���PfN�<X�u�cH:3''"�L�[C��999�Igj��.�3srr`��l�(����V����$�����t��-�!����XHgj���������t��-�!���\? N��F ]2����@�U����DkNN,�3�ny������3�nyIg������3�nyIg���@��Q�[C����L�U����DN7��~WXIg�����������G;NN�$S'�i)��9h����u�0@����F��#�<�����)���!�A����G�k��^��m#������v������@<�� ��[:d���	ydK�������b�mF�W�!z��C��Xo� �l.����!�_��-�#;�5)��bl���h��!4�������4�D&{�u���������Ddy"2~Y��yrde&�2�<��2;�����m�1����R���o���YLD��V����r&����w��0gGV"+����,�#���*u���}�r����!4�H��pv=@7'Chd�!�V�v=@w(Chd+I������������������C#$y'�t�2~s������t�2�F���=L���e�l%�{�����!T2{�:����4�`o>l0��=�\���Y"�{\�����5iA���7/��
Y
V����������]�����9�{}O��$���z�nq���2I���z��s���
I���z�nv�~\`%�{X\����g�����r�����I�Ze<�FH�1��TV�q+�TV���H�f���*x���F:��*���7B�Q�����i�C	yO���J��G:�s5���W�i�C8M��*z����]�W���*����E{X��*�h�7��`��,O���D�on;�@d�5e�U��=�QWXYE�1�h�EW<��*�]���JPYE�1�h���U��C�>@�TV��n-�C�TV��n-���t�U��C��p&���@���C�w-{H���*�����m�
IgPYE�1<�B���^;�ie�M�{��{[�6v:��*�J�va��jm�G����t��u/2��N�ZYe�^�^L*����VV��W��Tjc�3���to�soM��NgZYe��u�^�������U����W��j�XDK/��^�{�V�������BduA2�n=�Asr�� ���!Y7!sP�\k.N�����4Asrw'�H#��G�\���!�H����kX�"��(�9����C��%�	,'qr�4�!��9��W�x�����\��89����\����N��E{X�L��H������������Ld�^�DVYS��j�����������y��������hi(3''x��&��PfN�<X�u�aIg���@$��u�hIg���@&��u�h��999�ND�N��Ld+����j��h����D:�&��������t&�M�=$����+�L��D{H:3''o���
+�M���!]2����@��&��h���{��%w��v�{wo�7��;����[������_��^��E�^�md����w��w��3��������u��}���+Q�������
����*na�1T2�n���V�����U����~�B��:�
/�}��C�0M���\��l�,�����Ux��F�B=��~���6Bz����Ux��F�B=L�XX��Y���su^��PQ$��*�)`c���8;9��A?�Jt���@�D�*�{�'�����3����8]e���#h�G�:�'b$:�f�#ND�0&� �0
s���������0
t��*n��1��#&�����t�;�J�������L��=B%���tgnO�����Lt��*n��1��#&����t��{J��������t��{J��������t�|J�����U�cch�GL�Xs~r ��u�O	?`��y?9���tgN	��������J���9%��Iw������;��S��~��;s�r ����sJ���i�^lx����a�1l�I����n����v�����B���fg�����B��
�
G�����-B\!�[�����0M�
6��d�d���Ld��}? ['#�h�������C�}��`�~@�NFh�&"Nvb	Z��$ ���XIp�~��!i6��b��X�xEB��^�t���2�mW/GRe�T���<1c�<c�����p������F��p��?�p^����12�'f�~����G�z^�������G��	n��w>5�#f=���r$���O���Y��y���G{�Sg�xa=��rd��q����gf\���v��11+���#��h�o��z�=����z��8u�f=�F�rde=���:��a7�~D�%cc8b�X���r$����N���Y��/�YX��.���qa=�~�r����L���Y��������hoejG�z����H��^���G\��]�dV����30+�f�>�������6��3����-5��njD!��F��#���y��~��������o`�7B�s����<������e�+rE�sz�������2������|37�_^��������������k~Y�*�Z/�3����~%����������{}�������������<>�����KA-�}����!(�c`C8��C�����lrm���_6H;;�o;����
��t���i4�u���V����i{/�/}�z���+�_������o���7������_v����v1���N�]�J��p���_C���o����������������/��_Iz�_!}�i�u�z������O~��?�|����R����O���/;����uz������Od��W�����ay��������c� ��t����|U�#��9�r�����K�g����y�-�
��o�oB|T�iy#��[���q�uIi���[�F��_�?���wWTy�������*G�_o���[����8f���Jm`<g���������Z���������u����� �Z��H?\���}��04��e�����hC�k��<C�g	��C����x��y�5��?_�C���8$m#��_�
q�.B���5������G��*���|}�O�?����� �X��H?]������/�/������|��>��GC�(���.��MT	�4�.����.4��.��\.�>���)��e����-�pD{u�<$mS���J�����~Q~������=c$<����F�1����y���������yH�F�/O�$���7�a��}�,��_�u���I����n�x�0���|~�L��M8���Wg�c��}>^,�!��������"�3����\������G����g������Q�!�1Z�>8i����J�\- �Q������5�8�o��b1���\*<���/�P�iP8�G�W'�c��i?>�-���4#-�����OO������	��3�V�}�/��Aq�Z�^����i���=�8��x�
Z��a����?�����1��f�����������e5���N���_��	#����s�=�F��Bi��8���R��x������j�����e5����q��>�O:�>�8�����TK��'x�}�W��K��S�g�����M���N�eU��/O���mZ}�����'�_.�##O�?_Z�c�	�����}��5P����~Y�#����?����^�&���uyB��I7��?�ryx9�����"�2(�?G�#��S�!k�d��~����'��6{������>�?a:�
���>�����/�q�������?�R��8J�.v��Pu<��pm�r<��_��r��?�G�A�a���.�!��������x�cj<����.�0���ts
������7d^���P75��
�'�q�|yv=d�5��_�ev=��o��]���:�>M����W�]�f�|i�����\�/�*����T9b�<��~��y�*G��<���������q�	�
_��.�T'��_.=���,��3OP����{H*3��O�E3L�f�OX�����yr��=����TFA�A��9�1?8�6A�Hm�����	��|-��gL���J����+��?���?y��Hr�|��=d�������!��������U�������������W�/r��� �4���x���u=��/����R'|t|=��?���8�����	��	��K��x���{|�#�1�>�p]�C��z�J%���4r�>�@>]��]�O|�t[����\z����fx����\�n�<A�#����!����u���������O���������'�_��~L�i:��qY�C��s��?����R9L 7����>��t��S�	�/�Z&~����������9��|?{�����'���W����_���<�ez<��?_�e�����R$�1�>�r]�#���!�<���'z<��=��?\Z��z<��/<3z�����.�����y�G��'�C�6�{�f�qL���Cz����������s�{����?>�z����y�G��'�CVY�p�4>�����wO(�����I�N�_����,�h�������<������!������������k�w'O��u�	��O�?_*��8�����3&�9���y�G�W71�^]Dz�_{.�8����%�|:O�}x���}����,=��/������������?>���/,=���6����w����x^F�=I�C��z���`O�����qL��Y�e���<�9�W�~����o��?��R
endstream
endobj
15
0
obj
12437
endobj
16
0
obj
[
]
endobj
17
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
18
0
R
/Resources
19
0
R
/Annots
21
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
18
0
obj
<<
/Filter
/FlateDecode
/Length
20
0
R
>>
stream
x�u����q���.�����~��X\�$��OWDe�i�B`b�G�*s:@�����~]_�����V����^_���~���->���?�����;�~~�?����������fWk^��O�e��_���?�~�)���������J���������������	w�E���n�GV�����)�����������{�0u��������['j��������
J��n������q]�A	e�������n�)�+���=g�����c��.��ka���n�%6(����W����Yc��.��gc����>��P��.�h����c���C{l?��u��]v�WN��x��8gF5L�����9us�93�a�����������)���]�������(��[�Sw?��-��Q	S���~��G�3��n
N��x��8gF#L�Z�����C�Ja��X��.����������zh/��e�����TFho���?1�1c�R��},�����b��8S�����������9]Wh���9������Sh���5�����Tsh�����}����Kh���S����e��BL�~��s�:�+L]����9u��93�a�r������=��Q����~�/�7h�XK�W-jW��7��=k-=���v�v#���mK�t^��M��w��=}-=������_�����@|��D�~�n��������=���ifK�4j[����Mx7��I�M��<xo��������=���i�lK�u����DO�lTu���D��l�u��B�^�-}�y��6�3���J�`����W|$zZ�`#��Q���M#N�Q�y�K?=+N���y��?���������0�|�o*=z�RWO�SF��;��PO���f��;��T�X�3V��;���3�B$xZz��RON��S��;�rRO�����7�;�!P[�����%v*���Y�g�yf�h��N�@m�gf�M9���������'�(�<#iD���	6b�H����0��Ml�����,_"�a�g��L(��C�2���mx��S�!V�Cf���,��-�{��m� ��"�0�{��m�4��*��2�{��m�H��&�56"��n��KX��y���NX>D�Cp�8�
�m"���uy�j��"xdSF��� �R����
O`a�l����;&���#k2���6�E�����c�
@�<�%#�8l�b!cr��
C�ay�&`�"3L�de����!��e��M��)�1���M��%��D��t��A�iv�##���rYM�m���jY�"�Y�
k6Y/"�Ed�H�a/&�Ud3nc��6<���qg�Y6���
@�<�*#��m������;�4B�y)~�������P�D�44��n���FH2/E�{�n���P���)hh��2���P�E�44��n���P^D�44��n���P^E�44��n���P�D�44����!��GS����e�'�|�)hh�����6�)(��uy�j���{v��IA#$��"e�=w�aHA(�
`
{��0�0E<�&#�Xl��"���wl�LA�l��;���������6�)�qN
{��0u��!�2D��t�a�6E�����n�6M���{v��IA#$��"e�=w��u�,'��$����0'��,��E��t�l�^D���f�n�^L6��f��{�n���P7�����g�/����GVe��mSP�#�2��LA9�$�R�&6�xB� �/��`K�
��P�S� e�=���� ��(C��$2C� �g��`�Y�
��X^D��
��6d
by�6�Y���)��Md;�lD����'��G�
0w�2�|�l`�m�m������;2�\6�8e��mR� e�=���� �� ����c�
@
<�&#�Xl���l��;6�� �#[2���6)H2����0� ��
`��.2��MV���Y�mX����K�A�!S�����Z�m8t��)C�����e��D���j�n��LV��jY��mX��zY/"�E�
{1��"�q{v��IA,��`����mR���Uy�l�$xd]F������K�������]���P�D�44��n���FH2/E�{�n���P���)hh��2���P�E�44��n���P^D�44��n���P^E�44��n���P�D�44����!��GS����e�'�|�)hh����'�\6�)hh�����6�)(��My�n��2����0� ��0
�=[f�l��"Y��w,�LA�l��;6����G�d��mSPD�N��?dP�)�qN
{��0u�!E<2���mxR�������m�����X"[K�
�n�IA)c�����d9�,'��$��9��f��,�����f��"�^D6�t�b�YE6������Q)�qN
{����
`
�xdUF�1�0E<2�|d��T��]��7��2�|�l�X�m�����)C����'�<E�
0'�2�<�l`��m���"�l��H�!S���v��*��LA,o"��f#b��6<)��=�l�����)��Cd;���n�l�$xd]F��)���6�)#��m��)C����'�\6�Xd��mR���5y�b�$xdCF��� 	���w�HA��1����!�<n�
�t��n�2DV����n�2L����S^�.��LA,_"KdkI���
`
��gw;^��rYN"�I�
s2Y�"�Yd=K�a�&�Ed��l�6��d��l�
`����'�<n�
0�8^�HA�GVe��mR���uy����������D,D�OxR�����6b!�|���v9RPD�{�|���P���)h#��'<)�YdHA�m>�IA(/"C
���h�	O
ByR�F,D�OxR�����nd�i�C
By�2���X�6��� ��!m�B����6�)(��uy�j���{�|���v��
`
�����)��LA���c�
`
�xdMF���0E<�!#��l��"���w�LA!;���O�)�qN
�X��S7Y"+CdmH�a&kSdm�lL�6l�dHA����'<)�.g
�HcO�Ox]&�Id9��&�6��d5��f��,��5����"�Y�����*�7���60� ��
8)hc�
p�l��"Y��w��LA����;�o�5	~��� ��0]S����KHMAt/���tMAMSP���D]S�����HMAt'��CuMA@|�)��l2|��)��#5�]L�O�5���� ���������)�i
�����k
�cIMAtw��vMA@|1�)�n�~��)�e��m�e��������
��������)�e���5���c�
������MF�����vMA�!#��l���� ���w���]SP��4��p>w�,]d������!�6�����)�6E6�t�i2|��)�/�5��p>���q�?��L���rYM�m���jY�"�Y�
k6Y/"�Ed�H�a/&�Ud3n��^S����xMA�,�x��yMA�*#��m���� `��wd
5$���7��2�|�l�X�m�����)C����'�<E�
0'�2�<�l`��m���"�l��H�!S���v��*��LA,o"��f#b��6<)��=�l�����)��Cd;���n�l�$xd]F��)���6�)#��m��)C����'�\6�Xd��mR���5y�b�$xdCF��� 	���w�HA��1����!�<n�
�t��n�2DV����n�2L����S^b�
��X�D6����n���$Hb��v�.��$��DV�t�d��EV��z�n��M���z�,�m���f������
O
by�`�
p�l������;f�� �#�2��LA+���R�&6�|B� �/��`K�
��P�S� e�=K?��)��)�l�9���)��Yd;�k�nC� ���`�E�
��X^E��
pV�6d
by�6KZ���MeHA�G�����X>D��
��6��HA�G�e��mR���MyG� �/���-�d�e�$xdEF�1� 	Y��w,�HA�G6d��mR���-y�a�$c��O���
`
<��Ef����Y"kC�
�0Y�"kSdcJ�a�&Kd�)/�nC����y)R�����1�u�,'��$����0'��,��E��t�l�^D���f�n�^L6��f���������$xdY6���
@
<�*#��m������;�������s`����FN�|
b�R��7�����<�$|�<-h���A�Hb�B����|�bGV#N���s�;�q��58����Q���������)���F*���m�s8W�]�<Y�����B�j��B�:���}gx��p0v����3<gH8���w�6$|�<_h���C�����C��;���w��	c�jp>G�Cw������7�@$�Kw����O!��<t��ID���9g�KW�s�n,C�e��
58���6����T�s�nK�c�q-58��s<��1�D�m���rcNj�I�5��9'7��"�`�Y
���"v5���Y����gU���9�m�s8���3��"p�����w��	c����;�������|��fr���{V8���C��O6c�R#�x,58�l�d3�c<�Y��������l�����|�;����<�z�O6cGQ#���[o���f��jD6���mp>��M��f��A���9d3vt12��y��6:�l���Fd30�����;�l&�]w������p0N����3�f��xN��
�!��Cw��\tG>8��0�	c����;�l&�Cw������p0.���3�f�4������1��Cv�d3p�jtN��e��5���2����T��jpn��c�q,5���a;s���1�3�o�_�sRcNj�I
�9��f5������\�{Qc/j�E
���qV5N��p��mp���3'��y��6:�l����ug������}g�������|0�.�}�B���{?�2�x��K�1�Ib��E���/��t'Y�������s��4�s4�H�zLt����x=�K�1�I2��H�]�/��'���&�������1��4�b�zLn�����x��K�1�I�L���/���&���3�����s��.��4�|�zLh�����x
�K�1�I��[O���/��T&��@�������+�5���zLb���r�xu�K�1�I���R�=�/�U�!}I��kS���/���%)��P�7�����'�P5^��z�Z���v�x��K�1eIf��V�E�/��|%i��]��������$��5^��z�T���F�x_�K�1MI6��Y���/��%���j�7�����!��5^��z�N����x��K�15m�s�s�_����,�)
endstream
endobj
20
0
obj
5433
endobj
21
0
obj
[
]
endobj
10
0
obj
<<
/CA
0.14901961
/ca
0.14901961
>>
endobj
7
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
/Alpha0
10
0
R
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
14
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
/Alpha0
10
0
R
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
19
0
obj
<<
/Font
<<
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
/Alpha0
10
0
R
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
11
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+ArialMT
/Encoding
/Identity-H
/DescendantFonts
[
22
0
R
]
/ToUnicode
23
0
R
>>
endobj
23
0
obj
<<
/Filter
/FlateDecode
/Length
26
0
R
>>
stream
x�eR�n�0��>����R	Y�RU���J�`/)R1�!��f�I�	�������z�l?�������u�5���5����d���/���A\/�Ce�1*K���4��m���]�z�������G��a;3)�t���q/�,F��2����6h��&
�G�k4�� *y�+���"����I�v��*O��x�I���+D;B�"����J��&�5j9	�K���������	�V�(�G���rq�"��"�O"d'�m�H����@E���X�(AF�s
�eT,p)$*�uQ\}����Cr�>z��'	����p>lnt�
���
endstream
endobj
25
0
obj
<<
/Filter
/FlateDecode
/Length
27
0
R
>>
stream
x���	xTE�7~���{��tw��ug%��@ ��7�� @YDVwPDT��Q\�
u!@@g�� �����;*���:��@��s��B�q�}������;�[���U����������%W.�m���W�� �:f��K��zM�z�G\~�e���n�lpO���9}��#����t����P������s_=���8L�(�v��K�J�����3�N�z������?b����������8���"My"r���!>��iV�s�O!�k��x�e��Y��g���V�
u�GH���^K���
T��9��(�*��KI�A1l�q���x��@���_�R�Yzk�n��^0��m���0	>�o�np!\������������1�-�1�NH�K�{ �������k����.�H`+���C��K�2K^�<�=����2��������3fK�>��G�5�}X*�a&��=��
�������aW�S��Z���z�
��\�����q�@��S����qySO1G�=T��y�[x�\�;>Oq)%JB�6�����>�5?e��7�w��_���
�;i������b6�����<���l�b�N�Y8�����gq����A�Q�i����t$��)��!�s����"v#;�>�}�d��H�����gm*��/`.�O�?��ug#�El&[�V�;���;�>���>�#��H��{�w��H�IY���~�4�i_�����,I���H�������f�� ����#�0'��7���Xv~o`��G���)V��b�/���;v�~U���y~s�B~�%���!�7���*�Hq�\����y��U�:���>����r��D�W��<�<��^9���m`{����E��7A���{�j���B�0
G!
��������}/R�Vx��p��X��]�#3��f��8�+�z�������8Ja�`��<C��/���p���O��:~����)I���W
JE��Z�.-�����j�W��������&e��s�9.�'�W�����)��W�OT�:W]�����u�.�Fh#�j�m���m
R�`��vDZ.��v���T����kH��a�4�#��'�j~=��y���y�<6��8���~��'
eC�h���OS�*�?�1�y|����W�.v�FuA-^�m� u���+�����M���`��B�T��edK����z���8N��"c[�/�a%��R$>����1�s�_�����+6M�n�R�>��qU�W.W�� {�����V\~
����1I	�
V-�W��o�pPv���3������P��2���p=����p�2A�3�$6��#���H%r6�K��LB��W������F���b,r�������4��x�b�A�:������!��_i�����K���]�����|���	�O��������+�}v���T�';�5�->��{���h��0|��_c��9X#�FC�����H���a���a0���[(5@i�0�-�_�����L>��2�L^��yxLS`��9�a�����Q�����Y8w�($p��@�sK���1�=/8���������t�\��c�xQ��v�y�9��hVfFzZ$�
R�>��q���MSY�:���?%VS0�F.�8�#�s�b��Sjb����25�)�X���	,9�U��Q2�\���J���!�/7Vs�on��M9���������"�N�����
�~��}c5lJ�_M�+g��7�/>n���'��tG������c5����������~=�q���S5i�}��Dr�Rj��~S���9�_������jX�Kr/����5��(}D35j�M4�Eo���uhX��^����]�r�M�4�F�ZEm���n���k���&���>V���.������5�b5GNhy7��UU�����OY��^��8dt[�7WM�a7c�1zz+�������)�c5����3����S���F]�]�����<i�bk�L������[5�o���u��H"9�N��t�1��<^3�r��Lo�'b�8���jYF=��Q�$�=������.����K�c1�T1�U3
gdV����5z���5J��[� ����9S�5_�(Jt�Ljx�����5EED"Z�S��"]�����<7w���F��N��Q����M|k}.�D����t.N��Dq���O�;
���X�����\}J.Rr����4�y�PJ��=jX�_��n�2:w���b��L1�v��sR������XMJ�	R:7c<]w�('5��W���� �i��
�R��X�}�@�Z����7+�'�S-��fv��G���y�����k��FQ9d��5k��CR3dH�0fBv�O
�������������Y*��gd��s
���*�uv����5�sc��LY3�>�������f7�=������X�S��skzM��U8V3Y\zo�e�GnK���'N�����z��Z�x�)������	�c	��)�2)�a����&���N,we�!���3y6+��%�����<�y���y�!�g����#�dUG�F����j�@�/����B�LLj8�P�4��*tU�eu��:Z
���Pc��u�ui|�--=���Ox��@8	����.q��|�QP����SJ�D$��R~�;�
������z�cO����s�)W)��i���^Q6~b8���>v��~/��X�1�R����1�|��
��N���}]:��\��J�g��P�e0V��17�F!UF������2��$/�u�B�\�x�+�B
^t�?

/7?��,g�)e]KKP��9,G
B�%]��
$V��O����{������m��O���}R�]g>d�<�+��
o���~U������=/�g/,~��E������"���?���C7^d�?�v1������>�	-4���i}Z�g#L���9X%8��	P�k=���:u��8�����A>Q}��~��X�n�n������Z��]��<0b|IEW�����L������l>)�C"2����P6������y�m��0�h��)=��3,��+����Y���~�.l�^��D�Sg+�.ny#��(�^���&"0:����H�06���S@�L�M��.���X����=V�O~��QZ	SZ���N�N��0}"r�{��P��u�U�����2J8��d%��v���r����R$q�1��@���������^`�����>�>I�'}Z�v��������#u����+�5?��3��XD�@V�L��[TB?��<�&���j�j�K��9��_�����>�cR&'EF����8/I�,8'2%�~�z��Z�*�>�^������z���7����[I�|[G���&4H��������vv����Q	1DT��.�����
��2}��q"�Uc@|L��Q�e�x+�;�tu�^)Z�@D�A`�����!���%�U���i!��A��ZlW���J���9�:v����]�{��������O-Y��S7,\�_g2;������o755����v���~��q��g=k%Q�Hz����v������:���:��W)�������;��6���AU�dW����b(4�����'���:R}���O�W� /� /��D����A-i.@�B����S
W"�=����B\`���`�rh#.����WA�	�������V�K��������_}T�X���%�_xy2��m1
G!��w��+�F���!�_��B�#��v����`���pJ�:�M4�����	��c,+�d�R��,�����;Y.z��z�\��������<E~U����b"�xO��+����t��\4��{}�j�H��O�c��G���G/�����#�c������5�X��y�y�s�^�9�E�K� W�k�g�k��Z��)������I��4�^����g��z���6yG��0���e9t�������H�e�I���2$w�^�7o�c�z�Ao������BuX�AUp3�����xi����r����'\�=�d>�/E�c�C!}�6���I��+��z�l�D	\}��'Z��*O���!)�Z����U/4Wb~0��RPW��9B��Q��O��t������������A�;��z���'~����g��UyI�
�>�1����|��g'ly~���]h�K~*�����n�L���@�lo����p�&/���F��	�>������p�fe�Y�3���\�+a�����bZ�x&/(&�{�������e�1��c����
}_��@�_��C�~��n��o���tiT�2}v`Z�
�5���5�[�s;��$���r{d�a����t���>��_&(�����3v����t��)�[S�o���E�c�b<�%[�YE5���\T[T �v�������E�
�u�������������u�gw������O����r�QZ�D������[PE�%�[�T��wk�Z�A��P�W�g����\_za��\T�r������/}��sfL�q]����d7��_Us��M�����_r���/^Z;m����~s{C�w�����KW� �v�]���<�x����
�,\W����U�\W����U�	�'��\6��OlWE\Uq���!�\3]�]O�^r)J�)K~\��R%Mq8%
�����$$I���]nY�����
5��	�2��r=��KQ��h��C����z��g�n-��[�-�.��y9-�;PhN����2�����T������b��F2���	Z������������
f��2��������np'���W �}#�,��r:VHrff%=�
��$�����lD�+QP�����c��U��v���V)g���`�O�1~o�
������5����I;�~�i��{�� �0[y���DJ��N��:�J��bg�H"�b)B;�������BK��_\�a1�THL��\h70m7
/��8���?�0�W��2�Z��P.EDp,.�qhU��j��\��L{��^q�vM���{S�H������t�=%�)J������=(P$�4z�#�OIy��
xR��M"�:��l�p���2�S��2{�X������������w���3L%,�J�AX���T��b��Y9x�=����zv�s���\�a/�d� C�P�C >���)� ��AB���H:���%�A&������@0���6�7��/�������>u;�q��w60���N���-���������{��=�t����z���#�Q�"}QeB�v��}\���������Y�v�6�Q�a���G�(-8i*�$q��iQ/���m����������+9Y�#�A�N��y�g��DS�B��
�$��%�q@����c��j���s"�������M������i�l���b�mWd�l[�q��F���hb�<C��AqF�lqC��b��l���^�ps�g:{�t�ziV�	��o
Q�EY�����H��3�����E����r����,���c�,6��U$rz�NN���4UN�_����M
QOS�y��x��a����w�jB��0�t&��Zr)�9�V�N��F��O�Nn��n$�X�J���;�3����y��/�k������>t������m���n~��k���=�;�|���^2��^s�6�-aO�����y�zKU���m��#:e[Hy�0?���f2�q�lG�A��3g���^$I�r���$���-����&s�'��R4�"s1�OT/�z��0�$Y:���
C��G��
���v���M�v�m:6��w�t��o�O=���&���w�e_�$��h\5\5������!�2������1ssD�X\;Zz^�f�]p����s�J�hN��t{k�u��X��Q �Fu"���3��i"�������������Y6_����N$X�n�c;u�gg�Ja���E�p��)�q/�Z�^������
Kt�$�d������tQ��xbP���Y�����P�UT4� v���VGT�#G�/�k	~�}����DeR�T3(�NP�������X���V���X������O]_tV����p=["/�-p.t]��6�VX���+m��+\+���������A�Q�K� +��c��xJ�}�Yapa76vb-F�b#Y�"6"�h���vf��/��Y(n�7���"o"������{���������	&��FbR=�4�7���<q3oQ�2�c�D��uy��mB�	���f���=�+�� �Fe�!��.�U?������r��h)��9)�<�%b��_����/��]u[����j:y��+��������1h���O>{��'��������g��}�}��Oc
w����������Zq&9t�������I�G+;��������K�W!�5�X�b����rH:8�3)k����bR�d���;�[����7���n
��[8v9�����P59)�;lHZ9�($����s��u�;����g�G�x0�1aH�u�:��~�C68M��xu��%b���i5�$i�{����g5�D�P�|bQ�D��������
%�z�]J�^�;C�,47���k���Z�vk���Si��}�4v2�r~����!�G�i*oG�b.��pvL�f
�L&�0��d��0�vu"j����|���'��U�����a��N������\5q]���r��(�f~_�:�%|��.~�:BIm�!s5+��������*a�Vx//kGRC����	wa�����6�d���Gte]�������q��k���g���<������\�����������
���l�����>n��������+�=�s��kQT�0������?����k�CF����������'�*	T�G��F$����tL�*UH���\����O�Q&W=���c�	��W�����f��'�w�\bg��+|y����'���4�z�/���XN"������k��W����F����$q����(��O'"����r{���7r{i�v�l����07�3��0�fx/��zYLY�lU$)����Pr1�	#�8�?���@R���a�a�7����{E�dB(���G��U�f(������UW-XX�Xm��H?gi-N���+{N�G.78���!_���?��`w��"�������
w�`��E�����E�����t^�^�~}������B�O	��s���xdK���s��"�|���7�H����������[Nro8����p�CQY�\�a�<��8[U|�mV�J�*�K�����}��<L����RK��������3�===wx6x�e�g��d-��S#��}�$Ju!�|��=*�KO-J����$���$��B*�����������$Cr���O�1�0'��#\������<�=j�"����4uy�����g�.�y7�����~a�L5�W�/Jx�%��+�\��@�@"��x�X�[R~Pw�R!8yr���
���U�fGb=:U�vR��9���T,�Sq�^����jO��!���G0��*T35L��
���*��U�|Q�tov�=.0��rdkb����f�l��'��?GV?jx���.�4I�f& 2�aA>�8R�v_�z�5|'�[j(��.5��x7��*���������y�RV�o��k2k���e���=5��������T2w��G
2o�����-������w<�jAx��CSw���������+����r����">4"��tWS��nHE)�#|ob ���W]\}�j��e��^�u2b�����N-�!;�'��h�51�����W1`��/�(Z�%���.����'ut���))SR�H�����i.�;���gI��+\������v�w:v�\!�J��\��L���.�J^Fk����VS�[��}A�e��	g���]��(F�M���d��&�[N��	:a������s���!�b	O�m���%��*Xv���%���@"2�FOd�2����������y5�zj\�P5�A�4�u4c/T������k����%����a���^�w� ����|�z�Q�#�D����dY�a!����T�����*�e~��������[�}7�5�t��-�W������� �d�g_�uS������������?��(�|���&�8���w������@y�x>�1*0:�R>M�n�$0%�!���f�{�OR>	|��U���#�d4���*C�iC��G�E�N<��)������~���A�����?Q?�b'<:J���t�Z8������C�]�v(���v���|��*p.�D����|LG�g�o�O�&�r�b��'�l���O%:���=z������y��j���ILi��pE^QA
�<M(����^������d�����e��#���e�+A3B@hi�f"Ye#���bJ��P"���W��J]Pc���a�T|�^R5�,X�]N9�A(�YKS���}K��b�7M��x{c��+�|�������������5#{q��������������:�r;9M���Dj2�hNW+������e�}��$ac���(�efg��-�T�d����#�%��hZ����I�QS�s��f\�^<�O�u1�;5uDhJh~H
ex��u��rz�C�=|�!��!!�J�}O
��T�wx�f�W�	7�=���-���6���eoWTV�f��()T�e�"�e���I�6*5��.f]t��i���2k��Uop�X�y��n��1���F������~�P�����4pr�
mV�����W6.����&���b:���h��pe����J������_4}����<������/Y��6��>��%O�q����(���6������g&�ge�����IArX����}"+`g�Hq�s$�y����)�-�]���4D�
k"-Z�isK.o��y<�"K*86X �"�0%!��H��T��:� ������]�����xF�l�H�Vo$�����
�5��3t0��o�0�B{5�3�T��9%<��h8�<��p�9 ��l��h��D�^)V����j��U�=�
���sM"���]Sm���v:�To:��x���,��qa�/�����96�O��6L��
)i7]y����%��<(�_�`NY������\����U�>&���
!�`j���P�\.����e�L����|._@Rx3-�t�,n��V��8`
�+��(�Z���;	VJ����5@j'��'\oB���Q9;��a�4�vb���mw��sN���a!���e]�jB�C|~hc�&��!334�3���W��h�
 �dY;N%R�<������s|*s���rq�cXp��pK����+IQ<o9����>r�����-�������st�c9����7���/�'���������_��b���*�=���U�����|���F�~}�s�QW�b�$4�Z�a9��*����"�X��k�����2�]il���*���b�FK	j����Vn4E(��(WU\5q-��f���-��u]i�����7�k�
���������e�
f�{����Q��d.�U�f��h����U�W@� o�k����6��erL>�)Y&�����_�/;�}Y,q�Z�i^��A� ��&4Z�V���7�$>�pA��>)��A	�{u]]�������r����+�lJ���t��;%�����u��[k��Mt1<qg����s!
�q�D��u�]9�Jv��j�8bC�����d����c�����R����������$1�+"#[d$V`�*���v���|��c��*�
��������r�-�V�v��twW�U���~�|�r�}��g��zT�B����-�w8I���jv�
v�-_S��J���8��p�t�6�S�����C�g��]��y�c�T�����]���C1����i�;��\dr���C������iOt�A"�p^
�`B9a%@���0{����N4m��>N��d�z(Z��@��Db���V�R��z�� ���*m����G��C�,j_!q{��+CV�����O8�2+����J���k3+0x�6&�m��>*q��A<.N�����
Z6�!
���+T#)��9����?5�Of�@[*�k��
S��mK7���*����I�`xm�Dn.�����/�f���7mZ��9�<�i��q�^�DgYo��Mx��&�-��9������T�p'c��%/:���iEp������|��a�.F�cx��(&�JT��|����r\���|e��Td�+.���$DN%�`
ho��r��g�Nf�c���[�L���:I
�����e?��-az�D������:�83��OQ�m�ERS��z����Y�q�d�*���q�����}����p2,�lO �G=��!���qy,e�cq%���
�T^X�����Bp

���8��;sD	�)4������Sh��N��J����sX��i��a>?�1\n�a��Cbi��2��t���3����G+%��B	��9iH�[+�R���c�'�bp����$�+{�<vV3�>�������A�����!-�S�8�t����*Z�P}����)�F����9=!�jk��CK�o\�W^>��]�6>O�W_������!����#j)����y�<1�bq��9"��@�8��v�:�f+�{�{�����!�!�~�I�$�(��_����O��������bA���/��(c�.��+�������p:�,�;@*_�P������.HGk>|�	����@�E���V�fn"�F]^~Yg���k14������A:K�s�,��G���e�T5���C��_H���A���5�K,Xq��\z�Aj�m�C��ouI#��<H���0�F�P}.}Xg��'�k���_�\l���

Bg�pC�
���oy����[?h:��v����7���)���W6}�x��Ys����z�������f��H~�b�&������]����h��+7�$X��;s~l]��#�G�����U��\�R'����q�����Io��x/�^��YGG�����P����r���_�O�?q~���;}4�3T�%'x"AD,��ADi#y�Lw$S�rL�E,a���4�$�p���)K��Z���:�� �.�Ir,f)���t������*��j�q&GYO6mW��`B`B`����0b*��GE����V��_$:�[������8T'q���x6��y�z�����Z���L�1��'���U�{�5s���W|p��;:������X�h[�,�7kF�\�������^��������W�|�����Y�~�w��Z��t���ery�<C^,�v��n��S|v7H6�{�:���RX
�9���~��j��L�Z�\U,*����;a�^jk���3����'�A�!���A���*���h��Rs��YC~y�#��y�/.����_���M�x����S6��}���\��#�YB	i��Cq-lq����86eD��H���"�V��%�
*'���>��7o\���%���+�Oy���%�=5-��yH���J:��^��I�I�I�I�I�I����������������v�vy���w����rN+�V�8wq������*�U�{:ov<�z�����/�
���+�kE�������
��R�����LT�����v�.��+��N�i����t��H������������D#�"D�h����) ��(�r��Y�q�B������n{ T&�u��W�X�I��e����&{R������>M�����4��I���J�z�����+�����P$F5#1�DD���.��~h�ow
�+��#��P+�6�~����"�|�J�/��n=�(M� �]Q������dY	/!�c�
[C�{�|n�^�~	j�R�b�
cy^�������8�D*K�8�������^q�����e�D��~��CO��:����������-l��8F{	q�����E�&r0�^���3�v�r�@����StI�q���^��3�#^������CN��ek�Hg���5.�CT�$�'N��q;E����7�0N{QB6g�s��]A�N���k7�q6�j������
�Y���{�uK�.��{���{u/�s������q-��dv(T��b����������sN�{~n8�d��a�)��^wix��Q�r32Sy���L��a�3�[����"�~HeQ:��d��h��6Y�w��jEbS��l���h�,�0`.��I��q�e����9�����1�!�\,�����M��k��u���l�j����j$�
k�l��b�Mgh�fD8�
5�P�H��c����Q��gC�u�6��M"~�h��������$�|���Kd��E�S�rl���3�����.����+Vl��#%^��i�~��G�%k�vY�mk��!
GZPv��W&�0�^,N�'3]��O�ua*�����i�%&XH�U5����bl��e��?�8eM�yC7�5j�(6\�*�p���9s�Rj�H���=[�
����:�,�B���7>�\A�AJ_��mR���s���g7k�t�	������������C`��s7�<���!QL1_B�>���)S�p�]*������}q���'@<�>��A'"�c���]��(��Hn���}�n�I����	�t�lE�\w��.�32
�TbW4�LV]�5��+2�������:�H-����D}>_+��=eP������+
P�P�g��`�E�Q�9�4���k�k������N�w�i{��W��v�Bo;q�;t�_e[i�O���	�$���k�T�x�(V��.���B=e�p��c����q������2=<��W������5o��TQ�&��+�]�<��Fj����"�o�Y Eu8}��o�<�1�w�o�o���s�0�cb�u�y>��QZ?J_�c�_z" )
W5M�;6$g�����!��������zb�i�������P�������n�
�������p�+�_�k~����<n�=�����4�s���^�����Mq�~���gO$��6�����|l�>��������RN]aS��ER��v2��!�"COTW�Q�_Z���6;�t�k�n���HU����_u��U��?�*Wy�}�G�$P�0�&:zB�;�����GP_9���:����:>�|Z�jHM�h\q���m�����CjJ��-yd�3r���%r�A;Q������:�k�;�c�����z���/yd�#&�@�����'��Nt@�G0��tU�����j��l^�?��f�)�]���/,Wj'�!M��y��\���
�����T��S��"4>p��2����W����Kv�9��� ��j�3�a����f/w�p���y�$�eNU�v��n$l�P���qA�>b"���~���?fbD�b��^�^������Am�^����PE��b���r����������UZ�s��W�>�}�]���G���W���7�r����[k����v�0'c=�����pp�C	*nl�'f�����v��vE�Pd{u����^��d`�n�����{�~�o�z>�v�����;�%\.�a�KWQ+t��1�����7�r����������J�#�e����>	OL������[���
)-����1����O��Y���6����O����&���b@+��Vi�V�'�Y���vfV�rR+$�k�+t�C�`����'2*,��,�%�d)c���u���|�R;�e+�����N�����Nv�{o�h�����t�]z����\�T
�_tQ"�7����������m�_�z��vE�2�����6^���%�%C&j�z�M�������V3����R�,�7�L����/�ED�2�*eL`�`"Z+KR&$�g`!����^���X~,�D%b"�����Mi,����3��sD�&��0O�l���U^���1���1<�V�\Lo�z{e�nT���n��0�A�	�=��Lo��$������� B!B�����[��m'�������T�3#Vb#0���eR0��X�^�����#xC�	�_E���f8�s3�����0}���*�^D6��?<�cX��?�xo��0�������f�~��CZ��
"�W$?����o�qS+�Ja�s������/��u��	H�<�����`��s�R�)�*�(�(?���]����1
���$��AG5�"}���/G<���\��4��w��T�D��J�Zl�k�hl0��u�u�V�����q��/�1�yg��*��Q,3������D�T�����M:|�l�`��p\�`(#����L�����BEd":!>A<������(�����"�m
�@�P^�1��	�5��a1����d>���V��9&����^�f�/��g��"��BA�s�M�I4�������������������R��p=��MD��?+�q!Zc�k�+[�kg�F0�rMZ��
��hg�f|��b�)a��Jw���q�+��NJg�����5�+ek�R������UxA{��V�=���|�1] ��s�7��<��B���4�A��k0�QHhy�?����V�F��t�K��d��.Z�W�3"f��_�X�@���g�������Yy1ON@%���� �y\�?V��J��-�����2X����Z��{��a[�0�D��c8��Cs�i�
-zm�7i*�����5GM�D|�t4i2B�������+
zM�j����1o�����iE���t�:����N��X�O��x�H�s�g�������-H����Ds]���}��\���q��'�j��j]�I��|R-��_J�	|���e��d�)O�[����%G�R�k�����|�rt���]�
K��8��E7�k��=G��c���{D�U�11��D�@���D�g�E��M�;�/P�R�	y��c�_y(S)�<e<<�~%�X��
0������Cso��� ��7����	��mc��']P�9�R�Xh���4;���6�:	����Y������0�>S
�(�O|��0��&mlR����������X��K�����j�M������������>W#_GH�p������p�x����cW����@��z�a�'��5r��s�6��MA>����y+p�v��{���|��[0���$]�tZ/ZR�eB��S�}��$
��H��l��8���Ii�Bt1 �7�Xk@��F��%��|^
��I�����0K%R\�>�(�	������������\k)-�@�T��_��%��������`�\��W���dX$mC�{��k����t�������`�Di����>��m�%����Q�k�W�����[
�9��R���b_��i��G�'���������{G�w�F�4��O#6����4�aO&����o��-�r9[��$��.�r�w�����Fu�rxq3>�wnW����
])����!^�����c�-��'�������N$�Z��q���u��O�! -&�K!�]	��ga�Vi%���� ������~:��D�w������w[�1
M�����8�K�b|���AC�a���b8�]�i����N����'��[���?��1o��:�z^.�����������rO,�h������?L�O�&B�����4���iu8�#�<�k��5�hND�����h�x�k�������b\���J�����X��z~�	�5�a���p��\���m�<���X�Vk��O=��%��y�"b���� �"t��.�!=Q�|�����`#��3�������/�����=��q�]��C����B��@������+#����k3�7�����N ��OoA���!P��~��ax��������������<����e�	_�a�����K }�v��z�������r	�3J>/���!������	[����\��g�*4�m��P��ii��+�
q>�ZB�lD��Ez4���?����&�Xl `��;��J�3��n>E�g,���_��h�[�	x�#��p����K��������F�	�� �HD��EY���w�t&��Y2���?��?#��������o���[���-t71��Z���9�����?!�[���i����@	AK$�Z��?�~&�sz��n�w���Vz��n��oM{�>�i�h���S�m!�8��[}h�������1���
M�������Q��0��(�=%���@��t�it�
�6�o'1}#�u��(;�������tK���q�\G��b�y?bb�5�dCb�oq��d������!Z��?������b���8���o'�	��c�����}�Y_�Q�V�,|��a ����7����'|zM��\b�&��Q�O�� ����K���n��-���(�$;��qbOh�L~�o�����>���K&��+��������1t�'A_DO���K���W�����0x����ql���/���i���T�7�+=7a�������+c-�J2�G|��Lk�i���Z'���?�e�V=[���
?�����e��4c�"y�����WM��LS�_�,�[��'�H���,����_�+������K�O�B�n�����g���*��>\1�_��Ds��q
{i���ec��|5����A�G�?G�a�x������V�����
X6���w��o��F��~��z[h���|���XX# �j�Gx �������5�#�Z#|�g��r��[�c8��k0�'���X%���:zG��u�{6�<�Gj��vA-������v��������.�n0��n�>�,�;\"�`��O����5u�����N�/�O�=ak_��O�i��QW@�{�����e���Uf<��c^����o,��q����
�x=�����H:��V�:�(4� ��5v.������50����0��Z�)�����G��S�~�?�_��_�����znnk`~��������!�A?~j��Z���E?���k��Oh�6�G���j��/0�C���}G�"9�L��,�+����h+'{�@��$x�C�]�yM/a�a����I��(B�3���M�m�m6m7�7>��[�C�O��D��{�`��Xo��j����Mw�-��a���WsI	1
�G1}M;$�����}�������w�E�:����y���m�������<GV���cxR��$��J(Q���<�Io �L�oU��l�O����9�P� �|���������o�
�/��&�M:�tE��J�'D{'W�*G��_t,�?���{�f[�L��U����\�W+������W=�}}f������B�Z��:���3���h0��O!��
��C0����v��������C��#.}���&�m��
(�8&�D��=��@F�oPv�B���W1���a�z�CE=5.��g�c�����K���

,�]=��<VH�q�?u�M�L�/�����h�g�~�2XKg%Z�5���S�>�f���>��l~3l�o>��O��}<�i�}�x
HK�>��������0C]	���8.)0Z���&�L��^7�d��=����������B�Z`������3�z��E�bmb^�z36b	b�q��%���o���{K�������[�j>0 ��XK=�<K�����{���?��>4Z�t��G��[�wc8�J�������n�Zzt�P6���`�B7��13|�h�t��a��+?u��_���:��s��X�/����\���-���
�I3��w}w��-�
�����;�?��Z�bN@2�X���}~:��/�|��F��s1�@�	~*J�v��0����z�C�����;���@�_�x� 1��������������>��"l
h/����h��m~�$�%P� h����B2I���Gk\��>�����g�}����y������{�����yF�
����������q�f��Pq\�C<�x���\+itVI���4]�Wl��:�
mS��6���*jvZ�Xt��T���h�
����$����'�n�������G�&��l�x�]Z������su��h��N>�rR��>e1���$U�E�p<�Ge)�lk���Ll4t��V��*�o��Zm�,���!3�m�ch���?�/��J���8
q�4!���,��gAD�
���@�M�T�E2C�������</K���14��q!=�b}��:W�grh������E}��/��9�����8�����|�u"�HB�B�t1��L�I���&����;f��Q����!�w���1�a��*���+���?�tr
�#$��b��m�`��&������	�P'���3��)�Pa��hk�����8ZJjR���x�f�������@��=���l��w���j
����Wr�*����9u#�q�_3���"p��c�e��\��On���0K9:*����tp*�����
�(�����d������,����,��u�/��C��V���p�w��G���M����F&�OkM��Z_�	�q�@���cuP\{�qN�P~2
=�l�&�$�����}���6�-S���g����Mv��
>��k��vi����D�����O���������~j����f��Y����=��g7~�,���[����~�*���Q��U�$���.���_7dH&����n�����
:�L�(�I��e�k�O��<HA�����'���s�T���^*��� ���5����[���f?m�%^K<U�:��v��i�[�KP��<��)���_�7���E��<�7�y)���$o�%��<���WO�����2�4��7���2N ������MM�!�����'/$?$���Q��Kk�~�s���[>�*|�
N/4�<m��ays�eI���/B{:��lw�����
{e �'���o���<�{���]@�94��Mo����hN6 �4��g��9P�^(�@'�{%O��$�$�tzk��g�r��p��0l�.E]�3�I�����f�8C�<&�2c�y��@Cn���O���
?U�[��!�f{h�h��������!��1�i,$�8�~W@�������E����
���G��!^5��,P��l��\�i�����G�44	 r��f������,�������;��,6P��R�f�c�����e��l��r��G�4�E�$@�D���6��
mhC���6��
mhC���6��
mhC���6��
mhC���6��
mhC���6��
mhC���������o�
8�PL���������
c���������p�������n���Y{^4Q/�n�K��:J1|Z����:��!�d)�u�.E,ClE�EB�x��1�<���#eJ����������WJ�oI�Q�#�#&#�@l@����C,E�EwRj�]�����[E�}�e%"9�HN������p�H#�;�(��(��������u0B~�2
���^!)�/����+����Da���T3'!����l�+��$.1��d��j���^������k~����m��J6��?������������1�kO��^�A�7����}��^�#z"&#6 �"�Ah�=���]�q�xO���U���k��W/co���k��v�(�-"�b3�7#��f�*�����=RT�4R�sR\�RNm~�h������o���{u�o@
�cO�����bb
b>B��a��e�u���R^uD���xq:#�?T���������B�5�"������|�_�/��%�0|�����B/'���c�c���u�mSa��NV;���atauc�I����t��8YB>�k��h��E�������D[H�����4�$����
�D/\@��;����)��9�VQ�������{�{������n�o�~���R=�����I(�)�b��h�ns���R���U� �B�1�5�&���)�q�T(���BqU��p�r�lR�\�-*��?B�
���hQ��u-*�����B�@�
eh-����������~n�4�U��*�/7KO��K��e��+vESwwI�2X���XE���:�A�����%�Kk��h�mh�B`��u,,�8X1�dHh���S9��i��C}h�>~.������3�6��u���$w����TwV�Rn{_�����kxj�7�oP
�Q
��0�e
1������#����_`����BL �C�G4���Gp�lc����u7&=D[\
�N<#\D��5p�� �?C�z�K��6BH�Uhu�e�Q���Z�/��.q�o��
�Py�!9�yEY����g$��U�q�}�d��D��%"wuOE<���e���i�E����tOt84�W�_e����X�~���=�P-���$�I7W�<�T�9��w�A���E7p�����F�q����S�fb�E)%��������"NAu�.��n�

���'8���/�y~������^�;�v~��va��U�&�+pv8���Jwx���6z�?��ev����?��V��k�~��s�ht��4��d����[���[���:��2v��;|}�N�����/\2�ks8@�����B�<�_"�.�S�����ABm�R�Tp���+�M�dC>��jh��a_�G��7��C�z��������<�?s�%x@��_�����{���;p�����+��	��LyD�.����c���B��G������W6c�l9c��21��)?�Y�#'g�6��2�j�E9���"R�"��."2����Jw�2�N�g#y�	Gt9-w�8-w��?{�n<�U�j��.����d4WDL��;��)Y.O4 �erj��'��-f��hV.'��4��f�����Z1[Ij�\�D���&6�5�>V��&��i�^:�`b�p���X	:V��5�
��[����@2�����\�V\���#�x{�-�d$4����V��f���E3v������i�S4���F(4���/��F(���h���s��rof����J3���T��:0���Y�D�nw��v��x�������d�����sN������~��x���w��|�������.��R-%b;��q�(�k-������x0
�@T0�r4����L��5�fV���v{bs�$�-��^�&$��5%
endstream
endobj
22
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+ArialMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
24
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
0
0
277
]
4
7
0
8
[
889
]
9
14
0
15
[
277
333
277
0
]
19
28
556
29
32
0
33
[
583
]
34
43
0
44
[
277
]
45
67
0
68
[
556
0
500
556
556
277
0
556
222
0
500
222
0
]
81
83
556
84
[
0
333
500
277
0
0
722
500
]
]
>>
endobj
24
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+ArialMT
/Flags
4
/FontBBox
[
-664
-324
2000
1005
]
/Ascent
728
/Descent
-210
/ItalicAngle
0
/CapHeight
716
/StemV
80
/FontFile2
25
0
R
>>
endobj
26
0
obj
332
endobj
27
0
obj
21269
endobj
1
0
obj
<<
/Type
/Pages
/Kids
[
5
0
R
12
0
R
17
0
R
]
/Count
3
>>
endobj
xref
0 28
0000000002 65535 f 
0000062140 00000 n 
0000000000 00000 f 
0000000016 00000 n 
0000000142 00000 n 
0000000237 00000 n 
0000000402 00000 n 
0000039120 00000 n 
0000020585 00000 n 
0000020606 00000 n 
0000039068 00000 n 
0000039579 00000 n 
0000020625 00000 n 
0000020794 00000 n 
0000039277 00000 n 
0000033307 00000 n 
0000033329 00000 n 
0000033349 00000 n 
0000033518 00000 n 
0000039435 00000 n 
0000039027 00000 n 
0000039048 00000 n 
0000061476 00000 n 
0000039723 00000 n 
0000061901 00000 n 
0000040131 00000 n 
0000062098 00000 n 
0000062118 00000 n 
trailer
<<
/Size
28
/Root
3
0
R
/Info
4
0
R
>>
startxref
62213
%%EOF

#50

Melanie Plageman

melanieplageman@gmail.com

11 months ago

In reply to: Masahiko Sawada (#49)

Re: Parallel heap vacuum

On Mon, Feb 24, 2025 at 8:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

What I can see from these results was that we might not benefit much
from parallelizing phase III, unfortunately. Although in the best case
the phase III got about 2x speedup, as for the total duration it's
about only 10% speedup. My analysis for these results matches what
John mentioned; phase III is already the fastest phase and accounts
only ~10% of the total execution time, and the overhead of shared
TidStore offsets the speedup of phase III.

So, are you proposing to drop the patches for parallelizing phase III
for now? If so, are you planning on posting a set of patches just to
parallelize phase I? I haven't looked at the prelim refactoring
patches to see if they have independent value. What do you think is
reasonable for us to try and do in the next few weeks?

The same commit that made the parallel scanning patch more difficult
should also reduce the risk of having a large amount of freezing work
at once in the first place. (There are still plenty of systemic things
that can go wrong, of course, and it's always good if unpleasant work
finishes faster.)

I think that vacuum would still need to scan a large amount of blocks
of the table especially when it is very large and heavily modified.
Parallel heap vacuum (only phase I) would be beneficial in case where
autovacuum could not catch up. And we might want to consider using
parallel heap vacuum also in autovacuum while integrating it with
eagar freeze scan.

I'd be interested to hear more about this.

- Melanie

#51

Masahiko Sawada

sawada.mshk@gmail.com

11 months ago

In reply to: Melanie Plageman (#50)

Re: Parallel heap vacuum

On Tue, Feb 25, 2025 at 9:59 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Feb 24, 2025 at 8:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

What I can see from these results was that we might not benefit much
from parallelizing phase III, unfortunately. Although in the best case
the phase III got about 2x speedup, as for the total duration it's
about only 10% speedup. My analysis for these results matches what
John mentioned; phase III is already the fastest phase and accounts
only ~10% of the total execution time, and the overhead of shared
TidStore offsets the speedup of phase III.

So, are you proposing to drop the patches for parallelizing phase III
for now? If so, are you planning on posting a set of patches just to
parallelize phase I? I haven't looked at the prelim refactoring
patches to see if they have independent value. What do you think is
reasonable for us to try and do in the next few weeks?

Given that we have only about one month until the feature freeze, I
find that it's realistic to introduce either one parallelism for PG18
and at least we might want to implement the one first that is more
beneficial and helpful for users. Since we found that parallel phase
III is not very efficient in many cases, I'm thinking that in terms of
PG18 development, we might want to switch focus to parallel phase I,
and then go for phase III if we have time.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#52

Melanie Plageman

melanieplageman@gmail.com

11 months ago

In reply to: Masahiko Sawada (#51)

Re: Parallel heap vacuum

On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Given that we have only about one month until the feature freeze, I
find that it's realistic to introduce either one parallelism for PG18
and at least we might want to implement the one first that is more
beneficial and helpful for users. Since we found that parallel phase
III is not very efficient in many cases, I'm thinking that in terms of
PG18 development, we might want to switch focus to parallel phase I,
and then go for phase III if we have time.

Okay, well let me know how I can be helpful. Should I be reviewing a
version that is already posted?

- Melanie

#53

Masahiko Sawada

sawada.mshk@gmail.com

11 months ago

In reply to: Melanie Plageman (#52)

Re: Parallel heap vacuum

On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Given that we have only about one month until the feature freeze, I
find that it's realistic to introduce either one parallelism for PG18
and at least we might want to implement the one first that is more
beneficial and helpful for users. Since we found that parallel phase
III is not very efficient in many cases, I'm thinking that in terms of
PG18 development, we might want to switch focus to parallel phase I,
and then go for phase III if we have time.

Okay, well let me know how I can be helpful. Should I be reviewing a
version that is already posted?

Thank you so much. I'm going to submit the latest patches in a few
days for parallelizing the phase I. I would appreciate it if you could
review that version.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#54

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Masahiko Sawada (#53)

5 attachment(s)

Re: Parallel heap vacuum

On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Given that we have only about one month until the feature freeze, I
find that it's realistic to introduce either one parallelism for PG18
and at least we might want to implement the one first that is more
beneficial and helpful for users. Since we found that parallel phase
III is not very efficient in many cases, I'm thinking that in terms of
PG18 development, we might want to switch focus to parallel phase I,
and then go for phase III if we have time.

Okay, well let me know how I can be helpful. Should I be reviewing a
version that is already posted?

Thank you so much. I'm going to submit the latest patches in a few
days for parallelizing the phase I. I would appreciate it if you could
review that version.

I've attached the updated patches that make the phase I (heap
scanning) parallel. I'll share the benchmark results soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v10-0005-Support-parallelism-for-collecting-dead-items-du.patchapplication/octet-stream; name=v10-0005-Support-parallelism-for-collecting-dead-items-du.patchDownload

From a6d2810a0fabc7ed93604df5955ab2b4cc757cf7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Feb 2025 13:41:35 -0800
Subject: [PATCH v10 5/5] Support parallelism for collecting dead items during
 lazy vacuum.

This feature allows the vacuum to leverage multiple CPUs in order to
collect dead items (i.e. the first pass over heap table) with parallel
workers. The parallel degree for parallel heap vacuuming is determined
based on the number of blocks to vacuum unless PARALLEL option of
VACUUM command is specified, and further limited by
max_parallel_maintenance_workers.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/ref/vacuum.sgml             |  54 +-
 src/backend/access/heap/heapam_handler.c |   4 +
 src/backend/access/heap/vacuumlazy.c     | 872 ++++++++++++++++++++---
 src/backend/access/table/tableam.c       |  15 +
 src/backend/commands/vacuumparallel.c    |  20 +
 src/include/access/heapam.h              |  11 +
 src/include/access/tableam.h             |   4 +
 src/include/commands/vacuum.h            |   2 +
 src/tools/pgindent/typedefs.list         |   4 +
 9 files changed, 884 insertions(+), 102 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 971b1237d47..9d73f6074de 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -279,25 +279,41 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
      <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
+      Perform scanning heap, index vacuum, and index cleanup phases of
+      <command>VACUUM</command> in parallel using
+      <replaceable class="parameter">integer</replaceable> background workers
+      (for the details of each vacuum phase, please refer to
+      <xref linkend="vacuum-phases"/>).
+     </para>
+     <para>
+      For heap tables, the number of workers used to perform the scanning
+      heap is determined based on the size of table. A table can participate in
+      parallel scanning heap if and only if the size of the table is more than
+      <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+      the heap table's blocks will be divided into ranges and shared among the
+      cooperating processes. Each worker process will complete the scanning of
+      its given range of blocks before requesting an additional range of blocks.
+     </para>
+     <para>
+      The number of workers used to perform parallel index vacuum and index
+      cleanup is equal to the number of indexes on the relation that support
+      parallel vacuum. An index can participate in parallel vacuum if and only
+      if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+      Only one worker can be used per index. So parallel workers for index vacuum
+      and index cleanup are launched only when there are at least <literal>2</literal>
+      indexes in the table.
+     </para>
+     <para>
+      Workers for vacuum are launched before the start of each phase and exit
+      at the end of the phase. The number of workers for each phase is limited by
+      the number of workers specified with <literal>PARALLEL</literal> option if
+      any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+      Please note that in any parallel vacuum phase, it is not guaanteed that the
+      number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+      will be used during execution. It is possible for a vacuum to run with fewer
+      workers than specified, or even with no workers at all. These behaviors might
+      change in a future release. This option can't be used with the <literal>FULL</literal>
+      option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index b5a756802e9..a337b847997 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2691,6 +2691,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
 
 	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_initialize_worker = heap_parallel_vacuum_initialize_worker,
+	.parallel_vacuum_collect_dead_items = heap_parallel_vacuum_collect_dead_items,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b1133e33763..51ccdc8a6e4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -99,6 +99,34 @@
  * After pruning and freezing, pages that are newly all-visible and all-frozen
  * are marked as such in the visibility map.
  *
+ * Parallel Lazy Heap Scanning:
+ *
+ * Lazy vacuum on heap tables supports parallel processing for phase I and
+ * phase II. Before starting phase I, we initialize parallel vacuum state,
+ * ParallelVacuumState, and allocate the TID store in a DSA area if we can
+ * use parallel mode for any of these two phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size and number of indexes. Parallel
+ * workers are launched at the beginning of each phase and exit at the end of
+ * each phase.
+ *
+ * For scanning the heap table with parallel workers, we utilize the
+ * table_block_parallelscan_xxx facility which splits the table into several
+ * chunks and parallel workers allocate chunks to scan. If the TID store is
+ * close to overrunning the available space during phase I, parallel workers
+ * exit and leader process gathers the scan results. Then, it performs a index
+ * vacuuming that could also use the parallelism. After vacuuming both indexes
+ * and heap table, the leader process vacuums FSM to make newly-freed space
+ * visible. Then, it relaunches parallel workers to resume the scanning heap
+ * phase with parallel workers again. In order to be able to resume the parallel
+ * heap scan from the previous status, the workers' parallel scan descriptions
+ *are stored in the shared memory (DSM) space to share among parallel workers.
+ * If the leader could launch fewer workers to resume the parallel heap scan,
+ * some blocks are remained as un-scanned. The leader process serially deals
+ * with such blocks at the end of scanning heap phase (see
+ * parallel_heap_complete_unfinished_scan()).
+ *
  * Dead TID Storage:
  *
  * The major space usage for vacuuming is storage for the dead tuple IDs that
@@ -147,6 +175,7 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -214,11 +243,21 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * DSM keys for parallel heap vacuum. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_KEY_SHARED				0xFFFF0001
+#define LV_PARALLEL_KEY_SCANDESC			0xFFFF0002
+#define LV_PARALLEL_KEY_WORKER_SCANSTATE	0xFFFF0003
+
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
  * parallel mode and the DSM segment is initialized.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->plvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -307,6 +346,87 @@ typedef struct LVScanData
 	bool		skippedallvis;
 } LVScanData;
 
+/*
+ * Struct for information that needs to be shared among parallel workers
+ * for parallel heap vacuum.
+ */
+typedef struct PLVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState vistest;
+
+	/* Per-worker scan data for parallel lazy heap scan */
+	LVScanData	worker_scandata[FLEXIBLE_ARRAY_MEMBER];
+} PLVShared;
+#define SizeOfPLVShared	(offsetof(PLVShared, worker_scandata))
+
+/* Per-worker scan state for parallel heap vacuum */
+typedef struct PLVScanWorkerState
+{
+	/* Has this worker data been initialized? */
+	bool		inited;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData pbscanwork;
+
+	/*
+	 * True if a parallel vacuum scan worker allocated blocks in state but
+	 * might have not scanned all of them. The leader process will take over
+	 * for scanning these remaining blocks.
+	 */
+	bool		maybe_have_unprocessed_blocks;
+
+	/* Last block number the worker scanned */
+	BlockNumber last_blkno;
+} PLVScanWorkerState;
+
+/*
+ * Struct to store the parallel lazy scan state.
+ */
+typedef struct PLVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Shared information */
+	PLVShared  *shared;
+
+	/* Scan state for parallel heap vacuum */
+	PLVScanWorkerState *scanstate;
+} PLVState;
+
+/*
+ * Struct for leader in parallel heap vacuum.
+ */
+typedef struct PLVLeader
+{
+	/* Shared memory size for each shared object */
+	Size		pbscan_len;
+	Size		shared_len;
+	Size		scanstates_len;
+
+	int			nworkers_launched;
+
+	/*
+	 * Points to all per-worker scan states stored on DSM area.
+	 *
+	 * During parallel heap scan, each worker allocates some chunks of blocks
+	 * to scan in its scan state, and could exit while leaving some chunks
+	 * un-scanned if the size of dead_items TIDs is close to overrunning the
+	 * the available space. We store the scan states on shared memory area so
+	 * that workers can resume heap scans from the previous point.
+	 */
+	PLVScanWorkerState *scanstates;
+} PLVLeader;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -369,6 +489,9 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 
+	/* Next block to check for FSM vacuum */
+	BlockNumber next_fsm_block_to_vacuum;
+
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
 	BlockNumber next_unskippable_block; /* next unskippable block */
@@ -376,6 +499,16 @@ typedef struct LVRelState
 	bool		next_unskippable_eager_scanned; /* if it was eagerly scanned */
 	Buffer		next_unskippable_vmbuffer;	/* buffer containing its VM bit */
 
+	/* Fields used for parallel heap vacuum */
+
+	/* Parallel lazy vacuum working state */
+	PLVState   *plvstate;
+
+	/*
+	 * The leader state for parallel heap vacuum. NULL for parallel workers.
+	 */
+	PLVLeader  *leader;
+
 	/* State related to managing eager scanning of all-visible pages */
 
 	/*
@@ -435,12 +568,14 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static bool do_lazy_scan_heap(LVRelState *vacrel);
 static void heap_vacuum_eager_scan_setup(LVRelState *vacrel,
 										 VacuumParams *params);
 static BlockNumber heap_vac_scan_next_block(ReadStream *stream,
 											void *callback_private_data,
 											void *per_buffer_data);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -451,6 +586,12 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
 							  bool *has_lpdead_items);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static BlockNumber parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel);
+static void complete_unfinihsed_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_begin(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
+static void parallel_lazy_scan_gather_scan_results(LVRelState *vacrel);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -530,6 +671,22 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	if (vacrel->aggressive)
 		return;
 
+	/*
+	 * Disable eager scanning if parallel heap vacuum is enabled.
+	 *
+	 * One might think that it would make sense to use the eager scanning even
+	 * during parallel heap scanning, but parallel vacuum is available only in
+	 * VACUUM command and would not be something that happens frequently,
+	 * which seems not fit to the purpose of the eager scanning. Also, it
+	 * would require making the code complex. So it would make sense to
+	 * disable it for now.
+	 *
+	 * XXX: this limitation might be eliminated  in the future for example
+	 * when we use parallel vacuum also in autovacuum.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		return;
+
 	/*
 	 * Aggressively vacuuming a small relation shouldn't take long, so it
 	 * isn't worth amortizing. We use two times the region size as the size
@@ -772,6 +929,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
+	vacrel->next_fsm_block_to_vacuum = 0;
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -814,13 +972,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
-	/*
-	 * Set up eager scan tracking state. This must happen after determining
-	 * whether or not the vacuum must be aggressive, because only normal
-	 * vacuums use the eager scan algorithm.
-	 */
-	heap_vacuum_eager_scan_setup(vacrel, params);
-
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -845,6 +996,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
 
+	/*
+	 * Set up eager scan tracking state. This must happen after determining
+	 * whether or not the vacuum must be aggressive, because only normal
+	 * vacuums use the eager scan algorithm.
+	 */
+	heap_vacuum_eager_scan_setup(vacrel, params);
+
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
@@ -1204,13 +1362,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
-				blkno = 0,
-				next_fsm_block_to_vacuum = 0;
-	BlockNumber orig_eager_scan_success_limit =
-		vacrel->eager_scan_remaining_successes; /* for logging */
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1231,6 +1383,76 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/* Do the actual work */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel);
+
+	/*
+	 * Report that everything is now scanned. We never skip scanning the last
+	 * block in the relation, so we can pass rel_pages here.
+	 */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
+								 rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * We can pass rel_pages here because we never skip scanning the last
+	 * block of the relation.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * Return true if the scan reaches the end of the table, otherwise false. In single
+ * process vacuum, since we loop the cycle of heap scanning, vacuuming and heap
+ * vacuuming until we reach to the end of table, it always returns true. On the other
+ * hand, in parallel vacuum case, if the dead items space is overrunning the available
+ * space, we exit from this function without invoking a cycle of index and heap vacuuming.
+ * In this case, we return false.
+ */
+static bool
+do_lazy_scan_heap(LVRelState *vacrel)
+{
+	ReadStream *stream;
+	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber orig_eager_scan_success_limit =
+		vacrel->eager_scan_remaining_successes; /* for logging */
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		reach_eot = true;
+
 	/* Set up the read stream for vacuum's first pass through the heap */
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
 										vacrel->bstrategy,
@@ -1260,8 +1482,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * that point.  This check also provides failsafe coverage for the
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
+		 *
+		 * The failsafe check should be done only by the leader process.
 		 */
-		if (vacrel->scan_data->scanned_pages > 0 &&
+		if (!IsParallelWorker() &&
+			vacrel->scan_data->scanned_pages > 0 &&
 			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
@@ -1285,6 +1510,19 @@ lazy_scan_heap(LVRelState *vacrel)
 				vmbuffer = InvalidBuffer;
 			}
 
+			/*
+			 * In parallel heap vacuum, we return false to the caller without
+			 * index and heap vacuuming. The parallel vacuum workers will exit
+			 * and the leader process will perform both index and heap
+			 * vacuuming.
+			 */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				vacrel->plvstate->scanstate->last_blkno = blkno;
+				reach_eot = false;
+				break;
+			}
+
 			/* Perform a round of index and heap vacuuming */
 			vacrel->consider_bypass_optimization = false;
 			lazy_vacuum(vacrel);
@@ -1294,9 +1532,9 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * upper-level FSM pages. Note that blkno is the previously
 			 * processed block.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 									blkno + 1);
-			next_fsm_block_to_vacuum = blkno;
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
@@ -1457,10 +1695,13 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * also be no opportunity to update the FSM later, because we'll never
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
+		 *
+		 * FSM vacuum should be done only by the leader process.
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1474,11 +1715,17 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				/*
+				 * XXX: Since the following logic doesn't consider the
+				 * progress of workers' scan processes, there might be
+				 * unprocessed pages between next_fsm_block_to_vacuum and
+				 * blkno.
+				 */
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
@@ -1489,50 +1736,10 @@ lazy_scan_heap(LVRelState *vacrel)
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/*
-	 * Report that everything is now scanned. We never skip scanning the last
-	 * block in the relation, so we can pass rel_pages here.
-	 */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
-								 rel_pages);
-
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_data->scanned_pages,
-													 vacrel->scan_data->live_tuples);
-
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
-		vacrel->scan_data->missed_dead_tuples;
-
 	read_stream_end(stream);
 
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
-
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 * We can pass rel_pages here because we never skip scanning the last
-	 * block of the relation.
-	 */
-	if (rel_pages > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, rel_pages);
-
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
-
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
+	Assert(reach_eot || ParallelHeapVacuumIsActive(vacrel));
+	return reach_eot;
 }
 
 /*
@@ -1566,10 +1773,28 @@ heap_vac_scan_next_block(ReadStream *stream,
 {
 	BlockNumber next_block;
 	LVRelState *vacrel = callback_private_data;
+	PLVState   *plvstate = vacrel->plvstate;
 	uint8		blk_info = 0;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+retry:
+
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		/*
+		 * Get the next block to scan using parallel scan.
+		 *
+		 * If we reach the end of the relation,
+		 * table_block_parallelscan_nextpage returns InvalidBlockNumber.
+		 */
+		next_block = table_block_parallelscan_nextpage(vacrel->rel,
+													   &(plvstate->scanstate->pbscanwork),
+													   plvstate->pbscan);
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
+	}
 
 	/* Have we reached the end of the relation? */
 	if (next_block >= vacrel->scan_data->rel_pages)
@@ -1594,8 +1819,22 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * visibility map.
 		 */
 		bool		skipsallvis;
+		BlockNumber end_block;
+		BlockNumber nblocks_skip;
+
+		/*
+		 * In parallel heap vacuum, compute how man blocks are remaining in
+		 * the current chunk. WE look for the next unskippable block within
+		 * the chunk.
+		 */
+		if (ParallelHeapVacuumIsActive(vacrel))
+			end_block = next_block +
+				plvstate->scanstate->pbscanwork.phsw_chunk_remaining + 1;
+		else
+			end_block = vacrel->scan_data->rel_pages;
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		find_next_unskippable_block(vacrel, &skipsallvis, next_block,
+									end_block);
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1612,11 +1851,33 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
 				vacrel->scan_data->skippedallvis = true;
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/* Tell the parallel scans to skip blocks */
+				table_block_parallelscan_skip_pages_in_chunk(vacrel->rel,
+															 &(plvstate->scanstate->pbscanwork),
+															 plvstate->pbscan,
+															 nblocks_skip);
+
+				/* Did we consume all blocks in the chunk? */
+				if (plvstate->scanstate->pbscanwork.phsw_chunk_remaining == 0)
+				{
+					/*
+					 * Reset the next_unskippable_blocks and try to find an
+					 * unskippable block in the next chunk.
+					 */
+					vacrel->next_unskippable_block = InvalidBlockNumber;
+					goto retry;
+				}
+			}
+
+			next_block = vacrel->next_unskippable_block;
 		}
 	}
 
@@ -1652,7 +1913,9 @@ heap_vac_scan_next_block(ReadStream *stream,
 }
 
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in a range of start_blk (inclusive) and end_blk (exclusive).
+ *
  * The next unskippable block and its visibility information is updated in
  * vacrel.
  *
@@ -1665,17 +1928,20 @@ heap_vac_scan_next_block(ReadStream *stream,
  * to skip such a range is actually made, making everything safe.)
  */
 static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start_blk, BlockNumber end_blk)
 {
 	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
 	bool		next_unskippable_allvis;
 
 	*skipsallvis = false;
 
-	for (;; next_unskippable_block++)
+	for (next_unskippable_block = start_blk;
+		 next_unskippable_block < end_blk;
+		 next_unskippable_block++)
 	{
 		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
 													   next_unskippable_block,
@@ -1762,6 +2028,236 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 	vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
 }
 
+/*
+ * A parallel variant of do_lazy_scan_hep(). The leader process launches
+ * parallel workers to scan the heap in parallel.
+*/
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	PLVScanWorkerState *scanstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Launch parallel workers */
+	parallel_lazy_scan_heap_begin(vacrel);
+
+	/*
+	 * Setup the parallel scan description for the leader to join as a worker.
+	 */
+	scanstate = palloc0(sizeof(PLVScanWorkerState));
+	scanstate->last_blkno = InvalidBlockNumber;
+	table_block_parallelscan_startblock_init(vacrel->rel,
+											 &(scanstate->pbscanwork),
+											 vacrel->plvstate->pbscan);
+	vacrel->plvstate->scanstate = scanstate;
+
+	for (;;)
+	{
+		bool		reach_eot;
+		BlockNumber min_scan_blk;
+
+		/*
+		 * Scan the table until either we are close to overrunning the
+		 * available space for dead_items TIDs or we reach the end of the
+		 * relation.
+		 */
+		reach_eot = do_lazy_scan_heap(vacrel);
+
+		/*
+		 * Parallel lazy heap scan finished. Wait for parallel workers to
+		 * finish and gather scan results.
+		 */
+		parallel_lazy_scan_heap_end(vacrel);
+
+		/* We reach the end of the table */
+		if (reach_eot)
+			break;
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		min_scan_blk = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (min_scan_blk > vacrel->next_fsm_block_to_vacuum)
+		{
+			/*
+			 * min_scanned_blkno was updated when gathering the workers' scan
+			 * results.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									min_scan_blk + 1);
+			vacrel->next_fsm_block_to_vacuum = min_scan_blk;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+		/* Re-launch workers to restart parallel heap scan */
+		parallel_lazy_scan_heap_begin(vacrel);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process launched fewer workers in the next cycle.
+	 */
+	complete_unfinihsed_lazy_scan_heap(vacrel);
+}
+
+/*
+ * Return the minimum block number the leader and workers have scanned so far.
+ */
+static BlockNumber
+parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel)
+{
+	BlockNumber min_blk;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	min_blk = vacrel->plvstate->scanstate->last_blkno;
+
+	/*
+	 * We check all worker scan states here to compute the minimum block
+	 * number among all scan states.
+	 */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		PLVScanWorkerState *scanstate = &(vacrel->leader->scanstates[i]);
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!scanstate->inited)
+			continue;
+
+		if (!BlockNumberIsValid(min_blk) || scanstate->last_blkno < min_blk)
+			min_blk = scanstate->last_blkno;
+	}
+
+	Assert(BlockNumberIsValid(min_blk));
+	return min_blk;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+complete_unfinihsed_lazy_scan_heap(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		PLVScanWorkerState *scanstate = &(vacrel->leader->scanstates[i]);
+
+		/*
+		 * Skip if this worker's scan has not been used or doesn't have
+		 * unprocessed block in chunks.
+		 */
+		if (!scanstate->inited || !scanstate->maybe_have_unprocessed_blocks)
+			continue;
+
+		/* Attach the worker's scan state and do heap scan */
+		vacrel->plvstate->scanstate = scanstate;
+		do_lazy_scan_heap(vacrel);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_begin(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->leader->nworkers_launched = parallel_vacuum_collect_dead_items_begin(vacrel->pvs);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for collecting dead tuples (planned: %d)",
+							 "launched %d parallel vacuum workers for collecting dead tuples (planned: %d)",
+							 vacrel->leader->nworkers_launched),
+					vacrel->leader->nworkers_launched,
+					parallel_vacuum_get_nworkers_table(vacrel->pvs))));
+}
+
+/*
+ * Helper routine to finish the parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_scan_end(vacrel->pvs);
+
+	/* Gather the workers' scan results */
+	parallel_lazy_scan_gather_scan_results(vacrel);
+}
+
+/* Accumulate each worker's scan results into the leader's */
+static void
+parallel_lazy_scan_gather_scan_results(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		LVScanData *data = &(vacrel->plvstate->shared->worker_scandata[i]);
+
+#define ACCUM_COUNT(item) vacrel->scan_data->item += data->item
+		ACCUM_COUNT(scanned_pages);
+		ACCUM_COUNT(removed_pages);
+		ACCUM_COUNT(new_frozen_tuple_pages);
+		ACCUM_COUNT(vm_new_visible_pages);
+		ACCUM_COUNT(vm_new_visible_frozen_pages);
+		ACCUM_COUNT(vm_new_frozen_pages);
+		ACCUM_COUNT(lpdead_item_pages);
+		ACCUM_COUNT(missed_dead_pages);
+		ACCUM_COUNT(tuples_deleted);
+		ACCUM_COUNT(tuples_frozen);
+		ACCUM_COUNT(lpdead_items);
+		ACCUM_COUNT(live_tuples);
+		ACCUM_COUNT(recently_dead_tuples);
+		ACCUM_COUNT(missed_dead_tuples);
+#undef ACCUM_COUNT
+
+		Assert(TransactionIdIsValid(data->NewRelfrozenXid));
+		Assert(MultiXactIdIsValid(data->NewRelminMxid));
+
+		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
+			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
+
+		if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
+			vacrel->scan_data->nonempty_pages = data->nonempty_pages;
+
+		vacrel->scan_data->skippedallvis |= data->skippedallvis;
+	}
+}
+
 /*
  *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
  *
@@ -3475,12 +3971,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -3498,11 +3990,17 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize the parallel vacuum state for either lazy heap
+			 * scan or index vacuuming, or both.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3542,9 +4040,15 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
@@ -3747,12 +4251,214 @@ update_relstats_all_indexes(LVRelState *vacrel)
 /*
  * Compute the number of workers for parallel heap vacuum.
  *
- * Disabled so far.
+ * The calculation logic is borrowed from compute_parallel_worker().
  */
 int
 heap_parallel_vacuum_compute_workers(Relation rel, int max_workers)
 {
-	return 0;
+	int			parallel_workers = 0;
+	int			heap_parallel_threshold;
+	int			heap_pages;
+
+	/*
+	 * Select the number of workers based on the log of the size of the
+	 * relation. Note that the upper limit of the min_parallel_table_scan_size
+	 * GUC is chosen to prevent overflow here.
+	 */
+	heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+	heap_pages = RelationGetNumberOfBlocks(rel);
+	while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+	{
+		parallel_workers++;
+		heap_parallel_threshold *= 3;
+		if (heap_parallel_threshold > INT_MAX / 3)
+			break;
+	}
+
+	parallel_workers = Min(parallel_workers, max_workers);
+
+	return parallel_workers;
+}
+
+/*
+ * Estimate shared memory size required for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							  void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	Size		size = 0;
+
+	vacrel->leader = palloc(sizeof(PLVLeader));
+
+	/* Estimate space for PLVShared */
+	size = add_size(size, SizeOfPLVShared);
+	size = add_size(size, mul_size(sizeof(LVScanData), nworkers));
+	vacrel->leader->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for ParallelBlockTableScanDesc */
+	vacrel->leader->pbscan_len = table_block_parallelscan_estimate(rel);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->pbscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of PLVScanWorkerState */
+	vacrel->leader->scanstates_len = mul_size(sizeof(PLVScanWorkerState), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scanstates_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	PLVShared  *shared;
+	ParallelBlockTableScanDesc pbscan;
+	PLVScanWorkerState *scanstates;
+
+	vacrel->plvstate = palloc0(sizeof(PLVState));
+
+	/* Initialize PLVShared */
+	shared = shm_toc_allocate(pcxt->toc, vacrel->leader->shared_len);
+	MemSet(shared, 0, vacrel->leader->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_data->NewRelminMxid;
+	shared->vistest = *vacrel->vistest;
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SHARED, shared);
+	vacrel->plvstate->shared = shared;
+
+	/* Initialize ParallelBlockTableScanDesc */
+	pbscan = shm_toc_allocate(pcxt->toc, vacrel->leader->pbscan_len);
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pbscan);
+	pbscan->base.phs_syncscan = false;	/* always start from the first block */
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCANDESC, pbscan);
+	vacrel->plvstate->pbscan = pbscan;
+
+	/* Initialize the array of PLVScanWorkerState */
+	scanstates = shm_toc_allocate(pcxt->toc, vacrel->leader->scanstates_len);
+	MemSet(scanstates, 0, vacrel->leader->scanstates_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_WORKER_SCANSTATE, scanstates);
+	vacrel->leader->scanstates = scanstates;
+}
+
+/*
+ * Initialize lazy vacuum state with the information retrieved from
+ * shared memory.
+ */
+void
+heap_parallel_vacuum_initialize_worker(Relation rel, ParallelVacuumState *pvs,
+									   ParallelWorkerContext *pwcxt,
+									   void **state_out)
+{
+	LVRelState *vacrel;
+	PLVState   *plvstate;
+	PLVShared  *shared;
+	PLVScanWorkerState *scanstates;
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Initialize PLVState and prepare the related objects */
+
+	plvstate = palloc0(sizeof(PLVState));
+
+	/* Prepare PLVShared */
+	shared = (PLVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_SHARED, false);
+	plvstate->shared = shared;
+
+	/* Prepare ParallelBlockTableScanWorkerData */
+	pbscan = shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_SCANDESC, false);
+	plvstate->pbscan = pbscan;
+
+	/* Prepare PLVScanWorkerState */
+	scanstates = shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_WORKER_SCANSTATE, false);
+	plvstate->scanstate = &(scanstates[ParallelWorkerNumber]);
+
+	/* Initialize LVRelState and prepare fields required by lazy scan heap */
+	vacrel = palloc0(sizeof(LVRelState));
+	vacrel->rel = rel;
+	vacrel->indrels = parallel_vacuum_get_table_indexes(pvs,
+														&vacrel->nindexes);
+	vacrel->pvs = pvs;
+	vacrel->aggressive = shared->aggressive;
+	vacrel->skipwithvm = shared->skipwithvm;
+	vacrel->cutoffs = shared->cutoffs;
+	vacrel->vistest = &(shared->vistest);
+	vacrel->dead_items = parallel_vacuum_get_dead_items(pvs,
+														&vacrel->dead_items_info);
+	vacrel->plvstate = plvstate;
+	vacrel->scan_data = &(shared->worker_scandata[ParallelWorkerNumber]);
+	MemSet(vacrel->scan_data, 0, sizeof(LVScanData));
+	vacrel->scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = shared->NewRelminMxid;
+	vacrel->scan_data->skippedallvis = false;
+	vacrel->scan_data->rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Initialize the scan state if not yet. The chunk of blocks will be
+	 * allocated when to get the scan block for the first time.
+	 */
+	if (!vacrel->plvstate->scanstate->inited)
+	{
+		vacrel->plvstate->scanstate->inited = true;
+		table_block_parallelscan_startblock_init(rel,
+												 &(vacrel->plvstate->scanstate->pbscanwork),
+												 vacrel->plvstate->pbscan);
+		vacrel->plvstate->scanstate->maybe_have_unprocessed_blocks = false;
+	}
+
+	*state_out = (void *) vacrel;
+}
+
+/*
+ * Parallel heap vacuum callback for collecting dead items (i.e., lazy heap scan).
+ */
+void
+heap_parallel_vacuum_collect_dead_items(Relation rel, ParallelVacuumState *pvs,
+										void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ErrorContextCallback errcallback;
+	bool		reach_eot;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel->dbname = get_database_name(MyDatabaseId);
+	vacrel->relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Join the parallel heap vacuum */
+	reach_eot = do_lazy_scan_heap(vacrel);
+
+	/*
+	 * If the leader or a worker finishes the heap scan because dead_items
+	 * TIDs is close to the limit, lazy heap scan stops while it might have
+	 * some unscanned blocks in the allocated chunk. Since this scan state
+	 * could not be used in the next heap scan, we remember that it might have
+	 * some unconsumed blocks so that the leader complete the scans after the
+	 * heap scan phase finishes.
+	 */
+	vacrel->plvstate->scanstate->maybe_have_unprocessed_blocks = !reach_eot;
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
 }
 
 /*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..a0a92dc8be5 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -600,6 +600,21 @@ table_block_parallelscan_nextpage(Relation rel,
 	return page;
 }
 
+/*
+ * skip some blocks to scan.
+ *
+ * Consume the given number of blocks in the current chunk. It doesn't skip blocks
+ * beyond the current chunk.
+ */
+void
+table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+											 ParallelBlockTableScanWorker pbscanwork,
+											 ParallelBlockTableScanDesc pbscan,
+											 BlockNumber nblocks_skip)
+{
+	pbscanwork->phsw_chunk_remaining -= Min(nblocks_skip, pbscanwork->phsw_chunk_remaining);
+}
+
 /* ----------------------------------------------------------------------------
  * Helper functions to implement relation sizing for block oriented AMs.
  * ----------------------------------------------------------------------------
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 5469abd14d4..6bc28cb78ab 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -502,6 +502,26 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
+/*
+ * Return the number of parallel workers initialized for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
 /*
  * Returns the dead items space and dead items information.
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fcc5a5d1e3d..e343ba33dd4 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -15,6 +15,7 @@
 #define HEAPAM_H
 
 #include "access/heapam_xlog.h"
+#include "access/parallel.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
@@ -407,9 +408,19 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 
 /* in heap/vacuumlazy.c */
 struct VacuumParams;
+struct ParallelVacuumState;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
 extern int	heap_parallel_vacuum_compute_workers(Relation rel, int max_workers);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+										  void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+												   ParallelWorkerContext *pwcxt,
+												   void **state_out);
+extern void heap_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+													void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 98f0cccdb41..7c51d247e81 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -2212,6 +2212,10 @@ extern BlockNumber table_block_parallelscan_nextpage(Relation rel,
 extern void table_block_parallelscan_startblock_init(Relation rel,
 													 ParallelBlockTableScanWorker pbscanwork,
 													 ParallelBlockTableScanDesc pbscan);
+extern void table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+														 ParallelBlockTableScanWorker pbscanwork,
+														 ParallelBlockTableScanDesc pbscan,
+														 BlockNumber nblocks_skip);
 
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 9fa8dae5f94..35f0596d5f6 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -383,6 +383,8 @@ extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels
 												 BufferAccessStrategy bstrategy,
 												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 93c848a0942..e96a6a2312a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1922,6 +1922,10 @@ PLpgSQL_type
 PLpgSQL_type_type
 PLpgSQL_var
 PLpgSQL_variable
+PLVLeader
+PLVScanWorkerState
+PLVShared
+PLVState
 PLwdatum
 PLword
 PLyArrayToOb
-- 
2.43.5

v10-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patchapplication/octet-stream; name=v10-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patchDownload

From ff2c82596b61d848ae9450b27ffb09171129b59b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:00:46 -0800
Subject: [PATCH v10 4/5] Move GlobalVisState definition to snapmgr_internal.h.

This commit expose the GlobalVisState struct in
snapmgr_internal.h. This is a preparatory work for parallel vacuum
heap scan.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/storage/ipc/procarray.c  | 74 ----------------------
 src/include/utils/snapmgr.h          |  2 +-
 src/include/utils/snapmgr_internal.h | 91 ++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+), 75 deletions(-)
 create mode 100644 src/include/utils/snapmgr_internal.h

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 2e54c11f880..4813a07860d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,80 +99,6 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
-/*
- * State for the GlobalVisTest* family of functions. Those functions can
- * e.g. be used to decide if a deleted row can be removed without violating
- * MVCC semantics: If the deleted row's xmax is not considered to be running
- * by anyone, the row can be removed.
- *
- * To avoid slowing down GetSnapshotData(), we don't calculate a precise
- * cutoff XID while building a snapshot (looking at the frequently changing
- * xmins scales badly). Instead we compute two boundaries while building the
- * snapshot:
- *
- * 1) definitely_needed, indicating that rows deleted by XIDs >=
- *    definitely_needed are definitely still visible.
- *
- * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
- *    definitely be removed
- *
- * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
- * && XID < definitely_needed), the boundaries can be recomputed (using
- * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
- * maintaining an accurate value all the time.
- *
- * As it is not cheap to compute accurate boundaries, we limit the number of
- * times that happens in short succession. See GlobalVisTestShouldUpdate().
- *
- *
- * There are three backend lifetime instances of this struct, optimized for
- * different types of relations. As e.g. a normal user defined table in one
- * database is inaccessible to backends connected to another database, a test
- * specific to a relation can be more aggressive than a test for a shared
- * relation.  Currently we track four different states:
- *
- * 1) GlobalVisSharedRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in any database, nor a
- *    replication slot's xmin, nor a replication slot's catalog_xmin might
- *    still consider XID as running.
- *
- * 2) GlobalVisCatalogRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin, nor a replication slot's
- *    catalog_xmin might still consider XID as running.
- *
- *    I.e. the difference to GlobalVisSharedRels is that
- *    snapshot in other databases are ignored.
- *
- * 3) GlobalVisDataRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin consider XID as running.
- *
- *    I.e. the difference to GlobalVisCatalogRels is that
- *    replication slot's catalog_xmin is not taken into account.
- *
- * 4) GlobalVisTempRels, which only considers the current session, as temp
- *    tables are not visible to other sessions.
- *
- * GlobalVisTestFor(relation) returns the appropriate state
- * for the relation.
- *
- * The boundaries are FullTransactionIds instead of TransactionIds to avoid
- * wraparound dangers. There e.g. would otherwise exist no procarray state to
- * prevent maybe_needed to become old enough after the GetSnapshotData()
- * call.
- *
- * The typedef is in the header.
- */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..3b6fb603544 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -17,6 +17,7 @@
 #include "utils/relcache.h"
 #include "utils/resowner.h"
 #include "utils/snapshot.h"
+#include "utils/snapmgr_internal.h"
 
 
 extern PGDLLIMPORT bool FirstSnapshotSet;
@@ -95,7 +96,6 @@ extern char *ExportSnapshot(Snapshot snapshot);
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
  */
-typedef struct GlobalVisState GlobalVisState;
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
diff --git a/src/include/utils/snapmgr_internal.h b/src/include/utils/snapmgr_internal.h
new file mode 100644
index 00000000000..4363adf7f62
--- /dev/null
+++ b/src/include/utils/snapmgr_internal.h
@@ -0,0 +1,91 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapmgr_internal.h
+ *		This file contains declarations of structs for snapshot manager
+ *		for internal use.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/snapmgr_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SNAPMGR_INTERNAL_H
+#define SNAPMGR_INTERNAL_H
+
+#include "access/transam.h"
+
+/*
+ * State for the GlobalVisTest* family of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track four different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisDataRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * 4) GlobalVisTempRels, which only considers the current session, as temp
+ *    tables are not visible to other sessions.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ */
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
+#endif							/* SNAPMGR_INTERNAL_H */
-- 
2.43.5

v10-0002-vacuumparallel.c-Support-parallel-table-vacuumin.patchapplication/octet-stream; name=v10-0002-vacuumparallel.c-Support-parallel-table-vacuumin.patchDownload

From baf624b249cad1d3c92bbdb45d076c31fa066b60 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 18 Feb 2025 17:45:36 -0800
Subject: [PATCH v10 2/5] vacuumparallel.c: Support parallel table vacuuming.

Since parallel vacuum was available only for index vacuuming and index
cleanup, ParallelVacuumState was initialized only when the table has
at least two indexes that are eligible for parallel index vacuuming
and cleanup.

This commit extends vacuumparallel.c to support parallel table
vacuuming. parallel_vacuum_init() now initializes ParallelVacuumState
and it enables parallel table vacuuming and parallel index
vacuuming/cleanup separately. During the initialization, it asks the
table AM for the number of parallel workers required for parallel
table vacuuming. If >0, it enables parallel table vacuuming and calls
further table AM APIs such as parallel_vacuum_estimate.

For parallel table vacuuming, this commit introduces
parallel_vacuum_remove_dead_items_begin() function, which can be used
to remove the collected dead items from the table (for example, the
second pass over heap table in lazy vacuum).

Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/heapam_handler.c |   4 +-
 src/backend/access/heap/vacuumlazy.c     |  13 +-
 src/backend/commands/vacuumparallel.c    | 289 +++++++++++++++++++----
 src/include/access/heapam.h              |   1 +
 src/include/commands/vacuum.h            |   5 +-
 src/tools/pgindent/typedefs.list         |   1 +
 6 files changed, 268 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e78682c3cef..b5a756802e9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2688,7 +2688,9 @@ static const TableAmRoutine heapam_methods = {
 	.scan_bitmap_next_block = heapam_scan_bitmap_next_block,
 	.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
 	.scan_sample_next_block = heapam_scan_sample_next_block,
-	.scan_sample_next_tuple = heapam_scan_sample_next_tuple
+	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1af18a78a2b..39f29226c5b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3486,7 +3486,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3728,6 +3728,17 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Disabled so far.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int max_workers)
+{
+	return 0;
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..5469abd14d4 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -6,15 +6,24 @@
  * This file contains routines that are intended to support setting up, using,
  * and tearing down a ParallelVacuumState.
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan or both index bulk deletion and
+ * index cleanup or all of them with parallel worker processes. Different
+ * numbers of workers are launched for the table vacuuming and index processing.
+ * ParallelVacuumState contains shared information as well as the memory space
+ * for storing dead items allocated in the DSA area.
+ *
+ * When initializing parallel table vacuum scan, we invoke table AM routines for
+ * estimating DSM sizes and initializing DSM memory. Parallel table vacuum
+ * workers invoke the table AM routine for vacuuming the table.
+ *
+ * For processing indexes in parallel, individual indexes are processed by one
+ * vacuum process. We launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the parallel
+ * worker processes exit.
+ *
+ * Each time we process table or indexes in parallel, the parallel context is
+ * re-initialized so that the same DSM can be used for multiple passes of table vacuum
+ * or index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -26,8 +35,10 @@
  */
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -50,6 +61,13 @@
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		4
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
+/* The kind of parallel vacuum work */
+typedef enum
+{
+	PV_WORK_ITEM_PROCESS_INDEXES,	/* index vacuuming or cleanup */
+	PV_WORK_ITEM_COLLECT_DEAD_ITEMS,	/* collect dead tuples */
+} PVWorkItem;
+
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
  * segment.
@@ -65,6 +83,11 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * Processing indexes or removing dead tuples from the table.
+	 */
+	PVWorkItem	work_item;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +187,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Do we need to reinitialize parallel DSM? */
+	bool		need_reinitialize_dsm;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -171,6 +197,12 @@ struct ParallelVacuumState
 	Relation   *indrels;
 	int			nindexes;
 
+	/*
+	 * The number of workers for parallel table vacuuming. If > 0, the
+	 * parallel table vacuum is enabled.
+	 */
+	int			nworkers_for_table;
+
 	/* Shared information among parallel vacuum workers */
 	PVShared   *shared;
 
@@ -221,7 +253,8 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
+static int	parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
 											bool *will_parallel_vacuum);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
@@ -242,7 +275,7 @@ static void parallel_vacuum_error_callback(void *arg);
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -256,22 +289,21 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
 	int			querylen;
 
-	/*
-	 * A parallel vacuum must be requested and there must be indexes on the
-	 * relation
-	 */
+	/* A parallel vacuum must be requested */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
+	parallel_workers = parallel_vacuum_compute_workers(rel, indrels, nindexes,
 													   nrequested_workers,
+													   &nworkers_for_table,
 													   will_parallel_vacuum);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
@@ -291,6 +323,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 parallel_workers);
 	Assert(pcxt->nworkers > 0);
 	pvs->pcxt = pcxt;
+	pvs->need_reinitialize_dsm = false;
+	pvs->nworkers_for_table = nworkers_for_table;
 
 	/* Estimate size for index vacuum stats -- PARALLEL_VACUUM_KEY_INDEX_STATS */
 	est_indstats_len = mul_size(sizeof(PVIndStats), nindexes);
@@ -327,6 +361,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, pvs->nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -419,6 +457,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Initialize AM-specific DSM space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, pvs->nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -533,26 +575,35 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.  Return the maximum number of parallel
+ * workers for table vacuuming and index vacuuming.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * nrequested is the number of parallel workers that user requested, which
+ * applies to both the number of workers for table vacuum and index vacuum.
+ * If nrequested is 0, we compute the parallel degree for them differently
+ * as described below.
+ *
+ * For parallel table vacuum, we ask AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to nworkers_table_p.
+ *
+ * For parallel index vacuum, the index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. This function sets
+ * will_parallel_vacuum to remember indexes that participate in parallel vacuum.
  */
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table_p,
 								bool *will_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			nworkers_table = 0;
+	int			nworkers_index = 0;
+
+	*nworkers_table_p = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -561,6 +612,24 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;
 
+	if (nrequested > 0)
+	{
+		/*
+		 * If the parallel degree is specified, accept that as the number of
+		 * parallel workers for table vacuum (though still cap at
+		 * max_parallel_maintenance_workers).
+		 */
+		nworkers_table = Min(nrequested, max_parallel_maintenance_workers);
+	}
+	else
+	{
+		/* Compute the number of workers for parallel table scan */
+		nworkers_table = table_parallel_vacuum_compute_workers(rel,
+															   max_parallel_maintenance_workers);
+
+		Assert(nworkers_table <= max_parallel_maintenance_workers);
+	}
+
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
@@ -590,17 +659,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	nindexes_parallel--;
 
 	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table_p = nworkers_table;
+	return Max(nworkers_table, nworkers_index);
 }
 
 /*
@@ -669,8 +739,10 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
+		pvs->shared->work_item = PV_WORK_ITEM_PROCESS_INDEXES;
+
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (pvs->need_reinitialize_dsm)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -764,6 +836,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
 }
 
 /*
@@ -979,6 +1054,115 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Begin the parallel scan to collect dead items. Return the number of
+ * launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_scan_end() to finish the parallel
+ * table scan.
+ */
+int
+parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->work_item = PV_WORK_ITEM_COLLECT_DEAD_ITEMS;
+
+	if (pvs->need_reinitialize_dsm)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	Assert(pvs->nworkers_for_table >= pvs->pcxt->nworkers);
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel vacuum workers launched by
+ * parallel_vacuum_collect_dead_items_begin(), and gather workers' statistics.
+ */
+void
+parallel_vacuum_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
+}
+
+/*
+ * The function is for parallel workers to execute the parallel scan to
+ * collect dead tuples.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs, void *state)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do the parallel scan to collect dead tuples */
+	table_parallel_vacuum_collect_dead_items(pvs->heaprel, pvs, state);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -998,6 +1182,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
+	void	   *state;
 	ErrorContextCallback errcallback;
 
 	/*
@@ -1030,7 +1215,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	/*
 	 * Apply the desired value of maintenance_work_mem within this process.
@@ -1076,6 +1260,17 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
+	/* Initialize AM-specific vacuum state for parallel table vacuuming */
+	if (shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)
+	{
+		ParallelWorkerContext pwcxt;
+
+		pwcxt.toc = toc;
+		pwcxt.seg = seg;
+		table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+												&state);
+	}
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = parallel_vacuum_error_callback;
 	errcallback.arg = &pvs;
@@ -1085,8 +1280,18 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	if (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)
+	{
+		/* Scan the table to collect dead items */
+		parallel_vacuum_process_table(&pvs, state);
+	}
+	else
+	{
+		Assert(pvs.shared->work_item == PV_WORK_ITEM_PROCESS_INDEXES);
+
+		/* Process indexes to perform vacuum/cleanup */
+		parallel_vacuum_process_safe_indexes(&pvs);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..fcc5a5d1e3d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -409,6 +409,7 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int max_workers);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1571a66c6bf..9fa8dae5f94 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -380,7 +380,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -392,6 +393,8 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_scan_end(ParallelVacuumState *pvs);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 56989aa0b84..e7002238bb0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1994,6 +1994,7 @@ PVIndStats
 PVIndVacStatus
 PVOID
 PVShared
+PVWorkItem
 PX_Alias
 PX_Cipher
 PX_Combo
-- 
2.43.5

v10-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchapplication/octet-stream; name=v10-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchDownload

From 89b655102394f003b4491dd36308ec45d81c7919 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 26 Feb 2025 11:31:55 -0800
Subject: [PATCH v10 3/5] Move lazy heap scan related variables to new struct
 LVScanData.

This is a pure refactoring for upcoming parallel heap scan, which
requires storing relation statistics collected during lazy heap scan
to a shared memory area.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c | 340 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 179 insertions(+), 162 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 39f29226c5b..b1133e33763 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -256,6 +256,57 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * Data and counters updated during lazy heap scan.
+ */
+typedef struct LVScanData
+{
+	BlockNumber rel_pages;		/* total number of pages */
+
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+
+	/*
+	 * Count of all-visible blocks eagerly scanned (for logging only). This
+	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
+	 */
+	BlockNumber eager_scanned_pages;
+
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVScanData;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -282,10 +333,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -310,35 +357,8 @@ typedef struct LVRelState
 	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
 	VacDeadItemsInfo *dead_items_info;
 
-	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-
-	/*
-	 * Count of all-visible blocks eagerly scanned (for logging only). This
-	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
-	 */
-	BlockNumber eager_scanned_pages;
-
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
-
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
-
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	/* Data and counters updated during lazy heap scan */
+	LVScanData *scan_data;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -348,13 +368,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -524,7 +537,7 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	 * the first region, making the second region the first to be eager
 	 * scanned normally.
 	 */
-	if (vacrel->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
+	if (vacrel->scan_data->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
 		return;
 
 	/*
@@ -616,6 +629,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVScanData *scan_data;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -729,13 +743,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->eager_scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
+	scan_data = palloc(sizeof(LVScanData));
+	scan_data->scanned_pages = 0;
+	scan_data->eager_scanned_pages = 0;
+	scan_data->removed_pages = 0;
+	scan_data->new_frozen_tuple_pages = 0;
+	scan_data->lpdead_item_pages = 0;
+	scan_data->missed_dead_pages = 0;
+	scan_data->nonempty_pages = 0;
+	scan_data->tuples_deleted = 0;
+	scan_data->tuples_frozen = 0;
+	scan_data->lpdead_items = 0;
+	scan_data->live_tuples = 0;
+	scan_data->recently_dead_tuples = 0;
+	scan_data->missed_dead_tuples = 0;
+	scan_data->vm_new_visible_pages = 0;
+	scan_data->vm_new_visible_frozen_pages = 0;
+	scan_data->vm_new_frozen_pages = 0;
+	scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel->scan_data = scan_data;
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/* Allocate/initialize output statistics state */
@@ -746,17 +772,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
-
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
-	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -777,15 +792,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_data->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_data->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 
 	/*
 	 * Initialize state related to tracking all-visible page skipping. This is
 	 * very important to determine whether or not it is safe to advance the
 	 * relfrozenxid/relminmxid.
 	 */
-	vacrel->skippedallvis = false;
+	vacrel->scan_data->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -873,15 +888,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_data->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_data->NewRelfrozenXid));
+	Assert(vacrel->scan_data->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_data->NewRelminMxid));
+	if (vacrel->scan_data->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -889,15 +904,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_data->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_data->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * pg_class.relpages to
 	 */
-	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
+	new_rel_pages = vacrel->scan_data->rel_pages;	/* After possible rel
+													 * truncation */
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
@@ -911,7 +927,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_data->NewRelfrozenXid, vacrel->scan_data->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -927,8 +943,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples,
+						 vacrel->scan_data->recently_dead_tuples +
+						 vacrel->scan_data->missed_dead_tuples,
 						 starttime);
 	pgstat_progress_end_command();
 
@@ -1002,23 +1018,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u eagerly scanned\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_data->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_data->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages /
+							 100.0 * vacrel->scan_data->scanned_pages /
 							 orig_rel_pages,
-							 vacrel->eager_scanned_pages);
+							 vacrel->scan_data->eager_scanned_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->scan_data->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->scan_data->recently_dead_tuples);
+			if (vacrel->scan_data->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->scan_data->missed_dead_tuples,
+								 vacrel->scan_data->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -1026,33 +1042,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_data->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_data->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_data->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_data->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_data->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_data->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 (long long) vacrel->scan_data->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_data->vm_new_visible_pages,
+							 vacrel->scan_data->vm_new_visible_frozen_pages +
+							 vacrel->scan_data->vm_new_frozen_pages,
+							 vacrel->scan_data->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -1072,10 +1088,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_data->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->scan_data->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->scan_data->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -1189,7 +1205,7 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->rel_pages,
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
 				blkno = 0,
 				next_fsm_block_to_vacuum = 0;
 	BlockNumber orig_eager_scan_success_limit =
@@ -1245,8 +1261,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages > 0 &&
-			vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_data->scanned_pages > 0 &&
+			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1298,9 +1314,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		blkno = BufferGetBlockNumber(buf);
 
-		vacrel->scanned_pages++;
+		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
-			vacrel->eager_scanned_pages++;
+			vacrel->scan_data->eager_scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -1482,16 +1498,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
 
 	read_stream_end(stream);
 
@@ -1538,7 +1554,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * callback_private_data contains a reference to the LVRelState, passed to the
  * read stream API during stream setup. The LVRelState is an in/out parameter
  * here (locally named `vacrel`). Vacuum options and information about the
- * relation are read from it. vacrel->skippedallvis is set if we skip a block
+ * relation are read from it. vacrel->scan_data->skippedallvis is set if we skip a block
  * that's all-visible but not all-frozen (to ensure that we don't update
  * relfrozenxid in that case). vacrel also holds information about the next
  * unskippable block -- as bookkeeping for this function.
@@ -1556,7 +1572,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 	next_block = vacrel->current_block + 1;
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->rel_pages)
+	if (next_block >= vacrel->scan_data->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1600,7 +1616,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_data->skippedallvis = true;
 		}
 	}
 
@@ -1651,7 +1667,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 static void
 find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
@@ -1882,11 +1898,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_data->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1961,10 +1977,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_data->NewRelfrozenXid, &vacrel->scan_data->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_data->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_data->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1974,7 +1990,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_data->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -2009,7 +2025,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -2024,15 +2040,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_data->tuples_deleted += presult.ndeleted;
+	vacrel->scan_data->tuples_frozen += presult.nfrozen;
+	vacrel->scan_data->lpdead_items += presult.lpdead_items;
+	vacrel->scan_data->live_tuples += presult.live_tuples;
+	vacrel->scan_data->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2081,17 +2097,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (presult.all_frozen)
 			{
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 				*vm_page_frozen = true;
 			}
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2179,8 +2195,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 
@@ -2190,7 +2206,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		else
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2231,8 +2247,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_data->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2359,8 +2375,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_data->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -2387,25 +2403,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_data->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_data->live_tuples += live_tuples;
+	vacrel->scan_data->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_data->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_data->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -2434,7 +2450,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_data->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -2463,12 +2479,12 @@ lazy_vacuum(LVRelState *vacrel)
 	 * HOT through careful tuning.
 	 */
 	bypass = false;
-	if (vacrel->consider_bypass_optimization && vacrel->rel_pages > 0)
+	if (vacrel->consider_bypass_optimization && vacrel->scan_data->rel_pages > 0)
 	{
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_data->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2494,8 +2510,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * be negligible.  If this optimization is ever expanded to cover more
 		 * cases then this may need to be reconsidered.
 		 */
-		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		threshold = (double) vacrel->scan_data->rel_pages * BYPASS_THRESHOLD_PAGES;
+		bypass = (vacrel->scan_data->lpdead_item_pages < threshold &&
 				  TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);
 	}
 
@@ -2633,7 +2649,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2785,8 +2801,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
+			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2902,14 +2918,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -2985,7 +3001,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_data->scanned_pages < vacrel->scan_data->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -3166,10 +3182,10 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->scan_data->rel_pages - vacrel->scan_data->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
-		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
+		 possibly_freeable >= vacrel->scan_data->rel_pages / REL_TRUNCATE_FRACTION))
 		return true;
 
 	return false;
@@ -3181,7 +3197,7 @@ should_attempt_truncation(LVRelState *vacrel)
 static void
 lazy_truncate_heap(LVRelState *vacrel)
 {
-	BlockNumber orig_rel_pages = vacrel->rel_pages;
+	BlockNumber orig_rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber new_rel_pages;
 	bool		lock_waiter_detected;
 	int			lock_retry;
@@ -3192,7 +3208,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_data->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -3293,15 +3309,15 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
-		vacrel->rel_pages = new_rel_pages;
+		vacrel->scan_data->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_data->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
 				(errmsg("table \"%s\": truncated %u to %u pages",
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_data->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -3325,11 +3341,11 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * unsigned.)  To make the scan faster, we prefetch a few blocks at a time
 	 * in forward direction, so that OS-level readahead can kick in.
 	 */
-	blkno = vacrel->rel_pages;
+	blkno = vacrel->scan_data->rel_pages;
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_data->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -3441,7 +3457,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_data->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e7002238bb0..93c848a0942 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1491,6 +1491,7 @@ LSEG
 LUID
 LVRelState
 LVSavedErrInfo
+LVScanData
 LWLock
 LWLockHandle
 LWLockMode
-- 
2.43.5

v10-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchapplication/octet-stream; name=v10-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchDownload

From d6306f92e8ae974aabb5f0ade6044b0d259272c1 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v10 1/5] Introduces table AM APIs for parallel table
 vacuuming.

This commit introduces the following new table AM APIs for parallel
heap vacuuming:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

There is no code using these new APIs for now. Upcoming parallel
vacuum patches utilize these APIs.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/include/access/tableam.h | 117 +++++++++++++++++++++++++++++++++++
 1 file changed, 117 insertions(+)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 131c050c15f..98f0cccdb41 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -35,6 +35,9 @@ extern PGDLLIMPORT bool synchronize_seqscans;
 
 struct BulkInsertStateData;
 struct IndexInfo;
+struct ParallelVacuumState;
+struct ParallelContext;
+struct ParallelWorkerContext;
 struct SampleScanState;
 struct VacuumParams;
 struct ValidateIndexState;
@@ -655,6 +658,63 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * function must return 0 to disable parallel table vacuum.
+	 *
+	 * 'max_workers' is the limit on the number of workers.  This comes from a
+	 * GUC.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel, int max_workers);
+
+	/*
+	 * Estimate the size of shared memory that the parallel table vacuum needs
+	 * for AM
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 struct ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   struct ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * Initialize AM-specific vacuum state for worker processes.
+	 *
+	 * The state_out is the output parameter so that an arbitrary data can be
+	 * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_initialize_worker) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  struct ParallelWorkerContext *pwcxt,
+													  void **state_out);
+
+	/*
+	 * Execute a parallel scan to collect dead items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 */
+	void		(*parallel_vacuum_collect_dead_items) (Relation rel,
+													   struct ParallelVacuumState *pvs,
+													   void *state);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1716,6 +1776,63 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int max_workers)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel, max_workers);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, struct ParallelContext *pcxt,
+							   int nworkers, void *state)
+{
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, struct ParallelContext *pcxt,
+								 int nworkers, void *state)
+{
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize AM-specific vacuum state for worker processes.
+ */
+static inline void
+table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+										struct ParallelWorkerContext *pwcxt,
+										void **state_out)
+{
+	rel->rd_tableam->parallel_vacuum_initialize_worker(rel, pvs, pwcxt, state_out);
+}
+
+/*
+ * Perform a parallel vacuums scan to collect dead items.
+ */
+static inline void
+table_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+										 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_collect_dead_items(rel, pvs, state);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
-- 
2.43.5

#55

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Masahiko Sawada (#54)

1 attachment(s)

Re: Parallel heap vacuum

On Mon, Mar 3, 2025 at 1:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Given that we have only about one month until the feature freeze, I
find that it's realistic to introduce either one parallelism for PG18
and at least we might want to implement the one first that is more
beneficial and helpful for users. Since we found that parallel phase
III is not very efficient in many cases, I'm thinking that in terms of
PG18 development, we might want to switch focus to parallel phase I,
and then go for phase III if we have time.

Okay, well let me know how I can be helpful. Should I be reviewing a
version that is already posted?

Thank you so much. I'm going to submit the latest patches in a few
days for parallelizing the phase I. I would appreciate it if you could
review that version.

I've attached the updated patches that make the phase I (heap
scanning) parallel. I'll share the benchmark results soon.

I've attached the benchmark test results.

Overall, with the parallel heap scan (phase I), the vacuum got speedup
much. On the other hand, looking at each phase I can see performance
regressions in some cases:

First, we can see the regression on a table with one index due to
overhead of the shared TidStore. Currently, we disable parallel index
vacuuming if the table has only one index as the leader process always
takes one index. With this patch, we enable parallel heap scan even if
the parallel index vacuuming is disabled, ending up using the shared
TidStore. In the benchmark test, while the regression due to that
overhead is about ~25% the speedup by parallel heap scan is 50%~, so
the performance number is good overall. I think we can improve the
shared TidStore in the future.

Another performance regression I can see in the results is that heap
vacuum phase (phase III) got slower with the patch. It's weired to me
since I don't touch the code of heap vacuum phase. I'm still
investigating the cause.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#56

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Masahiko Sawada (#55)

Re: Parallel heap vacuum

On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Another performance regression I can see in the results is that heap
vacuum phase (phase III) got slower with the patch. It's weired to me
since I don't touch the code of heap vacuum phase. I'm still
investigating the cause.

I have investigated this regression. I've confirmed that In both
scenarios (patched and unpatched), the entire table and its associated
indexes were loaded into the shared buffer before the vacuum. Then,
the 'perf record' analysis, focused specifically on the heap vacuum
phase of the patched code, revealed numerous soft page faults
occurring:

I did not observe these page faults in the 'perf record' results for
the HEAD version. Furthermore, when I disabled parallel heap vacuum
while keeping parallel index vacuuming enabled, the regression
disappeared. Based on these findings, the likely cause of the
regression appears to be that during parallel heap vacuum operations,
table blocks were loaded into the shared buffer by parallel vacuum
workers. However, in the heap vacuum phase, the leader process needed
to process all blocks, resulting in soft page faults while creating
Page Table Entries (PTEs). Without the patch, the backend process had
already created PTEs during the heap scan, thus preventing these
faults from occurring during the heap vacuum phase.

It appears to be an inherent side effect of utilizing parallel
queries. Given this understanding, it's likely an acceptable trade-off
that we can accommodate.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#57

Peter Smith

smithpb2250@gmail.com

10 months ago

In reply to: Masahiko Sawada (#54)

Re: Parallel heap vacuum

Some minor review comments for patch v10-0001.

======
src/include/access/tableam.h

1.
 struct IndexInfo;
+struct ParallelVacuumState;
+struct ParallelContext;
+struct ParallelWorkerContext;
 struct SampleScanState;

Use alphabetical order for consistency with existing code.

~~~

2.
+ /*
+ * Estimate the size of shared memory that the parallel table vacuum needs
+ * for AM
+ *

2a.
Missing period (.)

2b.
Change the wording to be like below, for consistency with the other
'estimate' function comments, and for consistency with the comment
where this function is implemented.

Estimate the size of shared memory needed for a parallel table vacuum
of this relation.

~~~

3.
+ * The state_out is the output parameter so that an arbitrary data can be
+ * passed to the subsequent callback, parallel_vacuum_remove_dead_items.

Typo? "an arbitrary data"

~~~

4. General/Asserts

All the below functions have a comment saying "Not called if parallel
table vacuum is disabled."
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

But, it's only a comment. I wondered if they should all have some
Assert as an integrity check on that.

~~~

5.
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */

"Return the number" or "Compute the number"?
The comment should match the comment in the fwd declaration of this function.

~~~

6.
+/*
+ * Perform a parallel vacuums scan to collect dead items.
+ */

6a.
"Perform" or "Execute"?
The comment should match the one in the fwd declaration of this function.

6b.
Typo "vacuums"

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#58

Peter Smith

smithpb2250@gmail.com

10 months ago

In reply to: Masahiko Sawada (#54)

Re: Parallel heap vacuum

Hi Sawada-San.

FYI. I am observing the following test behaviour:

I apply patch v10-0001, do a clean rebuild and run 'make check', and
all tests are OK.

Then, after I apply just patch v10-0002 on top of 0001, do a clean
rebuild and run 'make check' there are many test fails.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#59

Peter Smith

smithpb2250@gmail.com

10 months ago

In reply to: Masahiko Sawada (#54)

Re: Parallel heap vacuum

Hi Sawada-San. Here are some review comments for patch v10-0002.

======
src/backend/access/heap/heapam_handler.c

1.
  .scan_bitmap_next_block = heapam_scan_bitmap_next_block,
  .scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
  .scan_sample_next_block = heapam_scan_sample_next_block,
- .scan_sample_next_tuple = heapam_scan_sample_next_tuple
+ .scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+ .parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,

Pretty much every other heam AM method here has a 'heapam_' prefix. Is
it better to be consistent and call the new function
'heapam_parallel_vacuum_compute_workers' instead of
'heap_parallel_vacuum_compute_workers'?

======
src/backend/access/heap/vacuumlazy.c

2.
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Disabled so far.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int max_workers)
+{
+ return 0;
+}
+

Instead of saying "Disabled so far", the function comment maybe should say:
"Return 0 means parallel heap vacuum is disabled."

Then the comment doesn't need to churn later when the function gets
implemented in later patches.

======
src/backend/commands/vacuumparallel.c

3.
+/* The kind of parallel vacuum work */
+typedef enum
+{
+ PV_WORK_ITEM_PROCESS_INDEXES, /* index vacuuming or cleanup */
+ PV_WORK_ITEM_COLLECT_DEAD_ITEMS, /* collect dead tuples */
+} PVWorkItem;
+

Isn't this more like a PVWorkPhase instead of PVWorkItem? Ditto for
the field name: 'work_phase' seems more appropriate.

~~~

4.
+ /*
+ * Processing indexes or removing dead tuples from the table.
+ */
+ PVWorkItem work_item;

Missing question mark for this comment?

~~~

5.
+ /*
+ * The number of workers for parallel table vacuuming. If > 0, the
+ * parallel table vacuum is enabled.
+ */
+ int nworkers_for_table;
+

I guess this field will never be negative. So is it simpler to modify
the comment to say:
"If 0, parallel table vacuum is disabled."

~~~

parallel_vacuum_init:

6.
+ /* A parallel vacuum must be requested */
Assert(nrequested_workers >= 0);

It's not very intuitive to say the user requested a parallel vacuum
when the 'nrequested_workers' is 0. I felt some more commentary is
needed here or in the function header; it seems nrequested_workers ==
0 has a special meaning of having the system decide the number of
workers.

~~~

parallel_vacuum_compute_workers:

7.
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int
nrequested,
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+ int nrequested, int *nworkers_table_p,
  bool *will_parallel_vacuum)
 {
  int nindexes_parallel = 0;
  int nindexes_parallel_bulkdel = 0;
  int nindexes_parallel_cleanup = 0;
- int parallel_workers;
+ int nworkers_table = 0;
+ int nworkers_index = 0;
+
+ *nworkers_table_p = 0;

7a.
AFAICT you can just remove the variable 'nworkers_table', and instead
call the 'nworkers_table_p' parameter as 'nworkers_table'

7b.
IIUC this 'will_parallel_vacuum' only has meaning for indexes, but now
that the patch introduces parallel table vacuuming it makes this
existing 'will_parallel_vacuum' name generic/misleading. Maybe it now
needs to be called 'will_parallel_index_vacuum' or similar (in all
places).

~~~

8.
+ if (nrequested > 0)
+ {
+ /*
+ * If the parallel degree is specified, accept that as the number of
+ * parallel workers for table vacuum (though still cap at
+ * max_parallel_maintenance_workers).
+ */
+ nworkers_table = Min(nrequested, max_parallel_maintenance_workers);
+ }
+ else
+ {
+ /* Compute the number of workers for parallel table scan */
+ nworkers_table = table_parallel_vacuum_compute_workers(rel,
+    max_parallel_maintenance_workers);
+
+ Assert(nworkers_table <= max_parallel_maintenance_workers);
+ }
+

The Assert can be outside of the if/else because it is the same for both.

~~~

9.
  /* No index supports parallel vacuum */
- if (nindexes_parallel <= 0)
- return 0;
-
- /* Compute the parallel degree */
- parallel_workers = (nrequested > 0) ?
- Min(nrequested, nindexes_parallel) : nindexes_parallel;
+ if (nindexes_parallel > 0)
+ {
+ /* Take into account the requested number of workers */
+ nworkers_index = (nrequested > 0) ?
+ Min(nrequested, nindexes_parallel) : nindexes_parallel;

- /* Cap by max_parallel_maintenance_workers */
- parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+ /* Cap by max_parallel_maintenance_workers */
+ nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+ }

"No index supports..." seems to be an old comment that is not correct
for this new code block.

~~~

10.
+ pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+

Double blank line after function 'parallel_vacuum_process_table'

~~~

main:

11.
+ /* Initialize AM-specific vacuum state for parallel table vacuuming */
+ if (shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)
+ {
+ ParallelWorkerContext pwcxt;
+
+ pwcxt.toc = toc;
+ pwcxt.seg = seg;
+ table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+ &state);
+ }
+

Wondering if this code can be done in the same if block later: "if
(pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)"

~~~

12.
+ if (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)
+ {
+ /* Scan the table to collect dead items */
+ parallel_vacuum_process_table(&pvs, state);
+ }
+ else
+ {
+ Assert(pvs.shared->work_item == PV_WORK_ITEM_PROCESS_INDEXES);
+
+ /* Process indexes to perform vacuum/cleanup */
+ parallel_vacuum_process_safe_indexes(&pvs);
+ }

Would this if/else be better implemented as a 'switch' for the possible phases?

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#60

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Masahiko Sawada (#55)

1 attachment(s)

Re: Parallel heap vacuum

On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 3, 2025 at 1:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Given that we have only about one month until the feature freeze, I
find that it's realistic to introduce either one parallelism for PG18
and at least we might want to implement the one first that is more
beneficial and helpful for users. Since we found that parallel phase
III is not very efficient in many cases, I'm thinking that in terms of
PG18 development, we might want to switch focus to parallel phase I,
and then go for phase III if we have time.

Okay, well let me know how I can be helpful. Should I be reviewing a
version that is already posted?

Thank you so much. I'm going to submit the latest patches in a few
days for parallelizing the phase I. I would appreciate it if you could
review that version.

I've attached the updated patches that make the phase I (heap
scanning) parallel. I'll share the benchmark results soon.

I've attached the benchmark test results.

Overall, with the parallel heap scan (phase I), the vacuum got speedup
much. On the other hand, looking at each phase I can see performance
regressions in some cases:

First, we can see the regression on a table with one index due to
overhead of the shared TidStore. Currently, we disable parallel index
vacuuming if the table has only one index as the leader process always
takes one index. With this patch, we enable parallel heap scan even if
the parallel index vacuuming is disabled, ending up using the shared
TidStore. In the benchmark test, while the regression due to that
overhead is about ~25% the speedup by parallel heap scan is 50%~, so
the performance number is good overall. I think we can improve the
shared TidStore in the future.

Another performance regression I can see in the results is that heap
vacuum phase (phase III) got slower with the patch. It's weired to me
since I don't touch the code of heap vacuum phase. I'm still
investigating the cause.

Discussing with Amit offlist, I've run another benchmark test where no
data is loaded on the shared buffer. In the previous test, I loaded
all table blocks before running vacuum, so it was the best case. The
attached test results showed the worst case.

Overall, while the numbers seem not stable, the phase I got sped up a
bit, but not as scalable as expected, which is not surprising. Please
note that the test results shows that the phase III also got sped up
but this is because in parallel vacuum we use more ring buffers than
the single process vacuum. So we need to compare the only phase I time
in terms of the benefit of the parallelism.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#61

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Peter Smith (#57)

Re: Parallel heap vacuum

On Thu, Mar 6, 2025 at 5:33 PM Peter Smith <smithpb2250@gmail.com> wrote:

Some minor review comments for patch v10-0001.

======
src/include/access/tableam.h
1.
struct IndexInfo;
+struct ParallelVacuumState;
+struct ParallelContext;
+struct ParallelWorkerContext;
struct SampleScanState;
Use alphabetical order for consistency with existing code.

~~~
2.
+ /*
+ * Estimate the size of shared memory that the parallel table vacuum needs
+ * for AM
+ *
2a.
Missing period (.)

2b.
Change the wording to be like below, for consistency with the other
'estimate' function comments, and for consistency with the comment
where this function is implemented.

Estimate the size of shared memory needed for a parallel table vacuum
of this relation.

~~~
3.
+ * The state_out is the output parameter so that an arbitrary data can be
+ * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
Typo? "an arbitrary data"

~~~

4. General/Asserts

All the below functions have a comment saying "Not called if parallel
table vacuum is disabled."
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

But, it's only a comment. I wondered if they should all have some
Assert as an integrity check on that.

~~~
5.
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
"Return the number" or "Compute the number"?
The comment should match the comment in the fwd declaration of this function.

~~~
6.
+/*
+ * Perform a parallel vacuums scan to collect dead items.
+ */
6a.
"Perform" or "Execute"?
The comment should match the one in the fwd declaration of this function.

6b.
Typo "vacuums"

Thank you for reviewing the patch. I'll address these comments and
submit the updated version patches soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#62

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Peter Smith (#59)

5 attachment(s)

Re: Parallel heap vacuum

On Thu, Mar 6, 2025 at 10:56 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-San. Here are some review comments for patch v10-0002.

Thank you for reviewing the patch.

======
src/backend/access/heap/heapam_handler.c
1.
.scan_bitmap_next_block = heapam_scan_bitmap_next_block,
.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
.scan_sample_next_block = heapam_scan_sample_next_block,
- .scan_sample_next_tuple = heapam_scan_sample_next_tuple
+ .scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+ .parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
Pretty much every other heam AM method here has a 'heapam_' prefix. Is
it better to be consistent and call the new function
'heapam_parallel_vacuum_compute_workers' instead of
'heap_parallel_vacuum_compute_workers'?

Hmm, given that the existing vacuum-related callback, heap_vacuum_rel,
uses 'heap' I guess it's okay with 'heap_' prefix for parallel vacuum
callbacks.

======
src/backend/access/heap/vacuumlazy.c
2.
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Disabled so far.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int max_workers)
+{
+ return 0;
+}
+
Instead of saying "Disabled so far", the function comment maybe should say:
"Return 0 means parallel heap vacuum is disabled."

Then the comment doesn't need to churn later when the function gets
implemented in later patches.

Updated.

======
src/backend/commands/vacuumparallel.c
3.
+/* The kind of parallel vacuum work */
+typedef enum
+{
+ PV_WORK_ITEM_PROCESS_INDEXES, /* index vacuuming or cleanup */
+ PV_WORK_ITEM_COLLECT_DEAD_ITEMS, /* collect dead tuples */
+} PVWorkItem;
+
Isn't this more like a PVWorkPhase instead of PVWorkItem? Ditto for
the field name: 'work_phase' seems more appropriate.

Agreed.

~~~

4.
+ /*
+ * Processing indexes or removing dead tuples from the table.
+ */
+ PVWorkItem work_item;

Missing question mark for this comment?

Updated.

~~~
5.
+ /*
+ * The number of workers for parallel table vacuuming. If > 0, the
+ * parallel table vacuum is enabled.
+ */
+ int nworkers_for_table;
+
I guess this field will never be negative. So is it simpler to modify
the comment to say:
"If 0, parallel table vacuum is disabled."

Agreed.

~~~

parallel_vacuum_init:

6.
+ /* A parallel vacuum must be requested */
Assert(nrequested_workers >= 0);

It's not very intuitive to say the user requested a parallel vacuum
when the 'nrequested_workers' is 0. I felt some more commentary is
needed here or in the function header; it seems nrequested_workers ==
0 has a special meaning of having the system decide the number of
workers.

Added some comments.

~~~

parallel_vacuum_compute_workers:

7.
static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int
nrequested,
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+ int nrequested, int *nworkers_table_p,
bool *will_parallel_vacuum)
{
int nindexes_parallel = 0;
int nindexes_parallel_bulkdel = 0;
int nindexes_parallel_cleanup = 0;
- int parallel_workers;
+ int nworkers_table = 0;
+ int nworkers_index = 0;
+
+ *nworkers_table_p = 0;

7a.
AFAICT you can just remove the variable 'nworkers_table', and instead
call the 'nworkers_table_p' parameter as 'nworkers_table'

Yes, but I wanted to not modify the output parameter many times.

~

7b.
IIUC this 'will_parallel_vacuum' only has meaning for indexes, but now
that the patch introduces parallel table vacuuming it makes this
existing 'will_parallel_vacuum' name generic/misleading. Maybe it now
needs to be called 'will_parallel_index_vacuum' or similar (in all
places).

Agreed.

~~~

8.
+ if (nrequested > 0)
+ {
+ /*
+ * If the parallel degree is specified, accept that as the number of
+ * parallel workers for table vacuum (though still cap at
+ * max_parallel_maintenance_workers).
+ */
+ nworkers_table = Min(nrequested, max_parallel_maintenance_workers);
+ }
+ else
+ {
+ /* Compute the number of workers for parallel table scan */
+ nworkers_table = table_parallel_vacuum_compute_workers(rel,
+    max_parallel_maintenance_workers);
+
+ Assert(nworkers_table <= max_parallel_maintenance_workers);
+ }
+

The Assert can be outside of the if/else because it is the same for both.

This part has been removed by modifying the parallel degree computation logic.

~~~

9.
/* No index supports parallel vacuum */
- if (nindexes_parallel <= 0)
- return 0;
-
- /* Compute the parallel degree */
- parallel_workers = (nrequested > 0) ?
- Min(nrequested, nindexes_parallel) : nindexes_parallel;
+ if (nindexes_parallel > 0)
+ {
+ /* Take into account the requested number of workers */
+ nworkers_index = (nrequested > 0) ?
+ Min(nrequested, nindexes_parallel) : nindexes_parallel;

- /* Cap by max_parallel_maintenance_workers */
- parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+ /* Cap by max_parallel_maintenance_workers */
+ nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+ }

"No index supports..." seems to be an old comment that is not correct
for this new code block.

Removed.

~~~
10.
+ pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+
Double blank line after function 'parallel_vacuum_process_table'

Removed.

~~~

main:
11.
+ /* Initialize AM-specific vacuum state for parallel table vacuuming */
+ if (shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)
+ {
+ ParallelWorkerContext pwcxt;
+
+ pwcxt.toc = toc;
+ pwcxt.seg = seg;
+ table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+ &state);
+ }
+
Wondering if this code can be done in the same if block later: "if
(pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)"

If we do that, we would end up calling InstrStartParallelQuery()
before the worker initialization. I guess we want to avoid counting
these usage during the initialization.

~~~

12.
+ if (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)
+ {
+ /* Scan the table to collect dead items */
+ parallel_vacuum_process_table(&pvs, state);
+ }
+ else
+ {
+ Assert(pvs.shared->work_item == PV_WORK_ITEM_PROCESS_INDEXES);
+
+ /* Process indexes to perform vacuum/cleanup */
+ parallel_vacuum_process_safe_indexes(&pvs);
+ }

Would this if/else be better implemented as a 'switch' for the possible phases?

Okay, changed.

I've attached the updated version patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v11-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchapplication/octet-stream; name=v11-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchDownload

From f1152ea648ab7c8cd2a23c80d79fe0fd795325d5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v11 1/5] Introduces table AM APIs for parallel table
 vacuuming.

This commit introduces the following new table AM APIs for parallel
heap vacuuming:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

There is no code using these new APIs for now. Upcoming parallel
vacuum patches utilize these APIs.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/heapam_handler.c |   4 +-
 src/backend/access/heap/vacuumlazy.c     |  11 ++
 src/backend/access/table/tableamapi.c    |  11 ++
 src/include/access/heapam.h              |   1 +
 src/include/access/tableam.h             | 136 +++++++++++++++++++++++
 5 files changed, 162 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e78682c3cef..b5a756802e9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2688,7 +2688,9 @@ static const TableAmRoutine heapam_methods = {
 	.scan_bitmap_next_block = heapam_scan_bitmap_next_block,
 	.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
 	.scan_sample_next_block = heapam_scan_sample_next_block,
-	.scan_sample_next_tuple = heapam_scan_sample_next_tuple
+	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3b91d02605a..5a3ef564685 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3738,6 +3738,17 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Return 0 to disable parallel vacuum so far.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested)
+{
+	return 0;
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 760a36fd2a1..c531aeb24e0 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,6 +81,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
+	Assert(routine->parallel_vacuum_compute_workers != NULL);
 	Assert(routine->scan_analyze_next_block != NULL);
 	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
@@ -97,6 +98,16 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->scan_sample_next_block != NULL);
 	Assert(routine->scan_sample_next_tuple != NULL);
 
+	/*
+	 * Callbacks for parallel vacuum are also optional (except for
+	 * parallel_vacuum_compute_workers). But one callback implies presence of
+	 * the others.
+	 */
+	Assert(((((routine->parallel_vacuum_estimate == NULL) ==
+			  (routine->parallel_vacuum_initialize == NULL)) ==
+			 (routine->parallel_vacuum_initialize_worker == NULL)) ==
+			(routine->parallel_vacuum_collect_dead_items == NULL)));
+
 	return routine;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..c80c6b16143 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -409,6 +409,7 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 131c050c15f..3e1e6aefeb5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -35,6 +35,9 @@ extern PGDLLIMPORT bool synchronize_seqscans;
 
 struct BulkInsertStateData;
 struct IndexInfo;
+struct ParallelContext;
+struct ParallelVacuumState;
+struct ParallelWorkerContext;
 struct SampleScanState;
 struct VacuumParams;
 struct ValidateIndexState;
@@ -655,6 +658,80 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * parallel degree for parallel vacuum is further limited by
+	 * max_parallel_maintenance_workers. The function must return 0 to disable
+	 * parallel table vacuum.
+	 *
+	 * 'nworkers_requested' is a >=0 number and the requested number of
+	 * workers. This comes from the PARALLEL option. 0 means to choose the
+	 * parallel degree based on the table AM specific factors such as table
+	 * size.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel,
+													int nworkers_requested);
+
+	/*
+	 * Estimate the size of shared memory needed for a parallel table vacuum
+	 * of this relation.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 struct ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   struct ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * Initialize AM-specific vacuum state for worker processes.
+	 *
+	 * The state_out is the output parameter so that arbitrary data can be
+	 * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize_worker) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  struct ParallelWorkerContext *pwcxt,
+													  void **state_out);
+
+	/*
+	 * Execute a parallel scan to collect dead items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_collect_dead_items) (Relation rel,
+													   struct ParallelVacuumState *pvs,
+													   void *state);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1716,6 +1793,65 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Compute the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel, nworkers_requested);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, struct ParallelContext *pcxt,
+							   int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, struct ParallelContext *pcxt,
+								 int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize AM-specific vacuum state for worker processes.
+ */
+static inline void
+table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+										struct ParallelWorkerContext *pwcxt,
+										void **state_out)
+{
+	rel->rd_tableam->parallel_vacuum_initialize_worker(rel, pvs, pwcxt, state_out);
+}
+
+/*
+ * Execute a parallel vacuum scan to collect dead items.
+ */
+static inline void
+table_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+										 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_collect_dead_items(rel, pvs, state);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
-- 
2.43.5

v11-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patchapplication/octet-stream; name=v11-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patchDownload

From 03edf0a27c984c145f984e53aa9fab35d77359c5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:00:46 -0800
Subject: [PATCH v11 4/5] Move GlobalVisState definition to snapmgr_internal.h.

This commit expose the GlobalVisState struct in
snapmgr_internal.h. This is a preparatory work for parallel vacuum
heap scan.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/storage/ipc/procarray.c  | 74 ----------------------
 src/include/utils/snapmgr.h          |  2 +-
 src/include/utils/snapmgr_internal.h | 91 ++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+), 75 deletions(-)
 create mode 100644 src/include/utils/snapmgr_internal.h

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 2e54c11f880..4813a07860d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,80 +99,6 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
-/*
- * State for the GlobalVisTest* family of functions. Those functions can
- * e.g. be used to decide if a deleted row can be removed without violating
- * MVCC semantics: If the deleted row's xmax is not considered to be running
- * by anyone, the row can be removed.
- *
- * To avoid slowing down GetSnapshotData(), we don't calculate a precise
- * cutoff XID while building a snapshot (looking at the frequently changing
- * xmins scales badly). Instead we compute two boundaries while building the
- * snapshot:
- *
- * 1) definitely_needed, indicating that rows deleted by XIDs >=
- *    definitely_needed are definitely still visible.
- *
- * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
- *    definitely be removed
- *
- * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
- * && XID < definitely_needed), the boundaries can be recomputed (using
- * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
- * maintaining an accurate value all the time.
- *
- * As it is not cheap to compute accurate boundaries, we limit the number of
- * times that happens in short succession. See GlobalVisTestShouldUpdate().
- *
- *
- * There are three backend lifetime instances of this struct, optimized for
- * different types of relations. As e.g. a normal user defined table in one
- * database is inaccessible to backends connected to another database, a test
- * specific to a relation can be more aggressive than a test for a shared
- * relation.  Currently we track four different states:
- *
- * 1) GlobalVisSharedRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in any database, nor a
- *    replication slot's xmin, nor a replication slot's catalog_xmin might
- *    still consider XID as running.
- *
- * 2) GlobalVisCatalogRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin, nor a replication slot's
- *    catalog_xmin might still consider XID as running.
- *
- *    I.e. the difference to GlobalVisSharedRels is that
- *    snapshot in other databases are ignored.
- *
- * 3) GlobalVisDataRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin consider XID as running.
- *
- *    I.e. the difference to GlobalVisCatalogRels is that
- *    replication slot's catalog_xmin is not taken into account.
- *
- * 4) GlobalVisTempRels, which only considers the current session, as temp
- *    tables are not visible to other sessions.
- *
- * GlobalVisTestFor(relation) returns the appropriate state
- * for the relation.
- *
- * The boundaries are FullTransactionIds instead of TransactionIds to avoid
- * wraparound dangers. There e.g. would otherwise exist no procarray state to
- * prevent maybe_needed to become old enough after the GetSnapshotData()
- * call.
- *
- * The typedef is in the header.
- */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..3b6fb603544 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -17,6 +17,7 @@
 #include "utils/relcache.h"
 #include "utils/resowner.h"
 #include "utils/snapshot.h"
+#include "utils/snapmgr_internal.h"
 
 
 extern PGDLLIMPORT bool FirstSnapshotSet;
@@ -95,7 +96,6 @@ extern char *ExportSnapshot(Snapshot snapshot);
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
  */
-typedef struct GlobalVisState GlobalVisState;
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
diff --git a/src/include/utils/snapmgr_internal.h b/src/include/utils/snapmgr_internal.h
new file mode 100644
index 00000000000..4363adf7f62
--- /dev/null
+++ b/src/include/utils/snapmgr_internal.h
@@ -0,0 +1,91 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapmgr_internal.h
+ *		This file contains declarations of structs for snapshot manager
+ *		for internal use.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/snapmgr_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SNAPMGR_INTERNAL_H
+#define SNAPMGR_INTERNAL_H
+
+#include "access/transam.h"
+
+/*
+ * State for the GlobalVisTest* family of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track four different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisDataRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * 4) GlobalVisTempRels, which only considers the current session, as temp
+ *    tables are not visible to other sessions.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ */
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
+#endif							/* SNAPMGR_INTERNAL_H */
-- 
2.43.5

v11-0005-Support-parallelism-for-collecting-dead-items-du.patchapplication/octet-stream; name=v11-0005-Support-parallelism-for-collecting-dead-items-du.patchDownload

From 6011d15f89f450fd753d8dd4252e90b992523f42 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Feb 2025 13:41:35 -0800
Subject: [PATCH v11 5/5] Support parallelism for collecting dead items during
 lazy vacuum.

This feature allows the vacuum to leverage multiple CPUs in order to
collect dead items (i.e. the first pass over heap table) with parallel
workers. The parallel degree for parallel heap vacuuming is determined
based on the number of blocks to vacuum unless PARALLEL option of
VACUUM command is specified, and further limited by
max_parallel_maintenance_workers.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/ref/vacuum.sgml             |  54 +-
 src/backend/access/heap/heapam_handler.c |   4 +
 src/backend/access/heap/vacuumlazy.c     | 875 ++++++++++++++++++++---
 src/backend/access/table/tableam.c       |  15 +
 src/backend/commands/vacuumparallel.c    |  23 +-
 src/include/access/heapam.h              |  12 +
 src/include/access/tableam.h             |   4 +
 src/include/commands/vacuum.h            |   2 +
 src/test/regress/expected/vacuum.out     |   9 +
 src/test/regress/sql/vacuum.sql          |  11 +
 src/tools/pgindent/typedefs.list         |   4 +
 11 files changed, 910 insertions(+), 103 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 971b1237d47..9d73f6074de 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -279,25 +279,41 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
      <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
+      Perform scanning heap, index vacuum, and index cleanup phases of
+      <command>VACUUM</command> in parallel using
+      <replaceable class="parameter">integer</replaceable> background workers
+      (for the details of each vacuum phase, please refer to
+      <xref linkend="vacuum-phases"/>).
+     </para>
+     <para>
+      For heap tables, the number of workers used to perform the scanning
+      heap is determined based on the size of table. A table can participate in
+      parallel scanning heap if and only if the size of the table is more than
+      <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+      the heap table's blocks will be divided into ranges and shared among the
+      cooperating processes. Each worker process will complete the scanning of
+      its given range of blocks before requesting an additional range of blocks.
+     </para>
+     <para>
+      The number of workers used to perform parallel index vacuum and index
+      cleanup is equal to the number of indexes on the relation that support
+      parallel vacuum. An index can participate in parallel vacuum if and only
+      if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+      Only one worker can be used per index. So parallel workers for index vacuum
+      and index cleanup are launched only when there are at least <literal>2</literal>
+      indexes in the table.
+     </para>
+     <para>
+      Workers for vacuum are launched before the start of each phase and exit
+      at the end of the phase. The number of workers for each phase is limited by
+      the number of workers specified with <literal>PARALLEL</literal> option if
+      any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+      Please note that in any parallel vacuum phase, it is not guaanteed that the
+      number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+      will be used during execution. It is possible for a vacuum to run with fewer
+      workers than specified, or even with no workers at all. These behaviors might
+      change in a future release. This option can't be used with the <literal>FULL</literal>
+      option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index b5a756802e9..a337b847997 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2691,6 +2691,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
 
 	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_initialize_worker = heap_parallel_vacuum_initialize_worker,
+	.parallel_vacuum_collect_dead_items = heap_parallel_vacuum_collect_dead_items,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8632ee9cc43..6c54305545f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -99,6 +99,34 @@
  * After pruning and freezing, pages that are newly all-visible and all-frozen
  * are marked as such in the visibility map.
  *
+ * Parallel Lazy Heap Scanning:
+ *
+ * Lazy vacuum on heap tables supports parallel processing for phase I and
+ * phase II. Before starting phase I, we initialize parallel vacuum state,
+ * ParallelVacuumState, and allocate the TID store in a DSA area if we can
+ * use parallel mode for any of these two phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size and number of indexes. Parallel
+ * workers are launched at the beginning of each phase and exit at the end of
+ * each phase.
+ *
+ * For scanning the heap table with parallel workers, we utilize the
+ * table_block_parallelscan_xxx facility which splits the table into several
+ * chunks and parallel workers allocate chunks to scan. If the TID store is
+ * close to overrunning the available space during phase I, parallel workers
+ * exit and leader process gathers the scan results. Then, it performs a index
+ * vacuuming that could also use the parallelism. After vacuuming both indexes
+ * and heap table, the leader process vacuums FSM to make newly-freed space
+ * visible. Then, it relaunches parallel workers to resume the scanning heap
+ * phase with parallel workers again. In order to be able to resume the parallel
+ * heap scan from the previous status, the workers' parallel scan descriptions
+ *are stored in the shared memory (DSM) space to share among parallel workers.
+ * If the leader could launch fewer workers to resume the parallel heap scan,
+ * some blocks are remained as un-scanned. The leader process serially deals
+ * with such blocks at the end of scanning heap phase (see
+ * parallel_heap_complete_unfinished_scan()).
+ *
  * Dead TID Storage:
  *
  * The major space usage for vacuuming is storage for the dead tuple IDs that
@@ -147,6 +175,7 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -214,11 +243,21 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * DSM keys for parallel heap vacuum. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_KEY_SHARED				0xFFFF0001
+#define LV_PARALLEL_KEY_SCANDESC			0xFFFF0002
+#define LV_PARALLEL_KEY_WORKER_SCANSTATE	0xFFFF0003
+
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
  * parallel mode and the DSM segment is initialized.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->plvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -307,6 +346,87 @@ typedef struct LVScanData
 	bool		skippedallvis;
 } LVScanData;
 
+/*
+ * Struct for information that needs to be shared among parallel workers
+ * for parallel heap vacuum.
+ */
+typedef struct PLVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState vistest;
+
+	/* Per-worker scan data for parallel lazy heap scan */
+	LVScanData	worker_scandata[FLEXIBLE_ARRAY_MEMBER];
+} PLVShared;
+#define SizeOfPLVShared	(offsetof(PLVShared, worker_scandata))
+
+/* Per-worker scan state for parallel heap vacuum */
+typedef struct PLVScanWorkerState
+{
+	/* Has this worker data been initialized? */
+	bool		inited;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData pbscanwork;
+
+	/*
+	 * True if a parallel vacuum scan worker allocated blocks in state but
+	 * might have not scanned all of them. The leader process will take over
+	 * for scanning these remaining blocks.
+	 */
+	bool		maybe_have_unprocessed_blocks;
+
+	/* Last block number the worker scanned */
+	BlockNumber last_blkno;
+} PLVScanWorkerState;
+
+/*
+ * Struct to store the parallel lazy scan state.
+ */
+typedef struct PLVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Shared information */
+	PLVShared  *shared;
+
+	/* Scan state for parallel heap vacuum */
+	PLVScanWorkerState *scanstate;
+} PLVState;
+
+/*
+ * Struct for leader in parallel heap vacuum.
+ */
+typedef struct PLVLeader
+{
+	/* Shared memory size for each shared object */
+	Size		pbscan_len;
+	Size		shared_len;
+	Size		scanstates_len;
+
+	int			nworkers_launched;
+
+	/*
+	 * Points to all per-worker scan states stored on DSM area.
+	 *
+	 * During parallel heap scan, each worker allocates some chunks of blocks
+	 * to scan in its scan state, and could exit while leaving some chunks
+	 * un-scanned if the size of dead_items TIDs is close to overrunning the
+	 * the available space. We store the scan states on shared memory area so
+	 * that workers can resume heap scans from the previous point.
+	 */
+	PLVScanWorkerState *scanstates;
+} PLVLeader;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -369,6 +489,9 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 
+	/* Next block to check for FSM vacuum */
+	BlockNumber next_fsm_block_to_vacuum;
+
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
 	BlockNumber next_unskippable_block; /* next unskippable block */
@@ -376,6 +499,16 @@ typedef struct LVRelState
 	bool		next_unskippable_eager_scanned; /* if it was eagerly scanned */
 	Buffer		next_unskippable_vmbuffer;	/* buffer containing its VM bit */
 
+	/* Fields used for parallel heap vacuum */
+
+	/* Parallel lazy vacuum working state */
+	PLVState   *plvstate;
+
+	/*
+	 * The leader state for parallel heap vacuum. NULL for parallel workers.
+	 */
+	PLVLeader  *leader;
+
 	/* State related to managing eager scanning of all-visible pages */
 
 	/*
@@ -435,12 +568,14 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static bool do_lazy_scan_heap(LVRelState *vacrel);
 static void heap_vacuum_eager_scan_setup(LVRelState *vacrel,
 										 VacuumParams *params);
 static BlockNumber heap_vac_scan_next_block(ReadStream *stream,
 											void *callback_private_data,
 											void *per_buffer_data);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -451,6 +586,12 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
 							  bool *has_lpdead_items);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static BlockNumber parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel);
+static void complete_unfinihsed_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_begin(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
+static void parallel_lazy_scan_gather_scan_results(LVRelState *vacrel);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -530,6 +671,22 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	if (vacrel->aggressive)
 		return;
 
+	/*
+	 * Disable eager scanning if parallel heap vacuum is enabled.
+	 *
+	 * One might think that it would make sense to use the eager scanning even
+	 * during parallel heap scanning, but parallel vacuum is available only in
+	 * VACUUM command and would not be something that happens frequently,
+	 * which seems not fit to the purpose of the eager scanning. Also, it
+	 * would require making the code complex. So it would make sense to
+	 * disable it for now.
+	 *
+	 * XXX: this limitation might be eliminated  in the future for example
+	 * when we use parallel vacuum also in autovacuum.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		return;
+
 	/*
 	 * Aggressively vacuuming a small relation shouldn't take long, so it
 	 * isn't worth amortizing. We use two times the region size as the size
@@ -773,6 +930,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
+	vacrel->next_fsm_block_to_vacuum = 0;
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -815,13 +973,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
-	/*
-	 * Set up eager scan tracking state. This must happen after determining
-	 * whether or not the vacuum must be aggressive, because only normal
-	 * vacuums use the eager scan algorithm.
-	 */
-	heap_vacuum_eager_scan_setup(vacrel, params);
-
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -846,6 +997,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
 
+	/*
+	 * Set up eager scan tracking state. This must happen after determining
+	 * whether or not the vacuum must be aggressive, because only normal
+	 * vacuums use the eager scan algorithm.
+	 */
+	heap_vacuum_eager_scan_setup(vacrel, params);
+
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
@@ -1215,13 +1373,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
-				blkno = 0,
-				next_fsm_block_to_vacuum = 0;
-	BlockNumber orig_eager_scan_success_limit =
-		vacrel->eager_scan_remaining_successes; /* for logging */
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1242,6 +1394,76 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/* Do the actual work */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel);
+
+	/*
+	 * Report that everything is now scanned. We never skip scanning the last
+	 * block in the relation, so we can pass rel_pages here.
+	 */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
+								 rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * We can pass rel_pages here because we never skip scanning the last
+	 * block of the relation.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * Return true if the scan reaches the end of the table, otherwise false. In single
+ * process vacuum, since we loop the cycle of heap scanning, vacuuming and heap
+ * vacuuming until we reach to the end of table, it always returns true. On the other
+ * hand, in parallel vacuum case, if the dead items space is overrunning the available
+ * space, we exit from this function without invoking a cycle of index and heap vacuuming.
+ * In this case, we return false.
+ */
+static bool
+do_lazy_scan_heap(LVRelState *vacrel)
+{
+	ReadStream *stream;
+	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber orig_eager_scan_success_limit =
+		vacrel->eager_scan_remaining_successes; /* for logging */
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		reach_eot = true;
+
 	/* Set up the read stream for vacuum's first pass through the heap */
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
 										vacrel->bstrategy,
@@ -1271,8 +1493,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * that point.  This check also provides failsafe coverage for the
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
+		 *
+		 * The failsafe check should be done only by the leader process.
 		 */
-		if (vacrel->scan_data->scanned_pages > 0 &&
+		if (!IsParallelWorker() &&
+			vacrel->scan_data->scanned_pages > 0 &&
 			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
@@ -1296,6 +1521,19 @@ lazy_scan_heap(LVRelState *vacrel)
 				vmbuffer = InvalidBuffer;
 			}
 
+			/*
+			 * In parallel heap vacuum, we return false to the caller without
+			 * index and heap vacuuming. The parallel vacuum workers will exit
+			 * and the leader process will perform both index and heap
+			 * vacuuming.
+			 */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				vacrel->plvstate->scanstate->last_blkno = blkno;
+				reach_eot = false;
+				break;
+			}
+
 			/* Perform a round of index and heap vacuuming */
 			vacrel->consider_bypass_optimization = false;
 			lazy_vacuum(vacrel);
@@ -1305,9 +1543,9 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * upper-level FSM pages. Note that blkno is the previously
 			 * processed block.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 									blkno + 1);
-			next_fsm_block_to_vacuum = blkno;
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
@@ -1468,10 +1706,13 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * also be no opportunity to update the FSM later, because we'll never
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
+		 *
+		 * FSM vacuum should be done only by the leader process.
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1485,11 +1726,17 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				/*
+				 * XXX: Since the following logic doesn't consider the
+				 * progress of workers' scan processes, there might be
+				 * unprocessed pages between next_fsm_block_to_vacuum and
+				 * blkno.
+				 */
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
@@ -1500,50 +1747,10 @@ lazy_scan_heap(LVRelState *vacrel)
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/*
-	 * Report that everything is now scanned. We never skip scanning the last
-	 * block in the relation, so we can pass rel_pages here.
-	 */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
-								 rel_pages);
-
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_data->scanned_pages,
-													 vacrel->scan_data->live_tuples);
-
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
-		vacrel->scan_data->missed_dead_tuples;
-
 	read_stream_end(stream);
 
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
-
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 * We can pass rel_pages here because we never skip scanning the last
-	 * block of the relation.
-	 */
-	if (rel_pages > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, rel_pages);
-
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
-
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
+	Assert(reach_eot || ParallelHeapVacuumIsActive(vacrel));
+	return reach_eot;
 }
 
 /*
@@ -1577,10 +1784,28 @@ heap_vac_scan_next_block(ReadStream *stream,
 {
 	BlockNumber next_block;
 	LVRelState *vacrel = callback_private_data;
+	PLVState   *plvstate = vacrel->plvstate;
 	uint8		blk_info = 0;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+retry:
+
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		/*
+		 * Get the next block to scan using parallel scan.
+		 *
+		 * If we reach the end of the relation,
+		 * table_block_parallelscan_nextpage returns InvalidBlockNumber.
+		 */
+		next_block = table_block_parallelscan_nextpage(vacrel->rel,
+													   &(plvstate->scanstate->pbscanwork),
+													   plvstate->pbscan);
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
+	}
 
 	/* Have we reached the end of the relation? */
 	if (next_block >= vacrel->scan_data->rel_pages)
@@ -1605,8 +1830,22 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * visibility map.
 		 */
 		bool		skipsallvis;
+		BlockNumber end_block;
+		BlockNumber nblocks_skip;
+
+		/*
+		 * In parallel heap vacuum, compute how man blocks are remaining in
+		 * the current chunk. WE look for the next unskippable block within
+		 * the chunk.
+		 */
+		if (ParallelHeapVacuumIsActive(vacrel))
+			end_block = next_block +
+				plvstate->scanstate->pbscanwork.phsw_chunk_remaining + 1;
+		else
+			end_block = vacrel->scan_data->rel_pages;
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		find_next_unskippable_block(vacrel, &skipsallvis, next_block,
+									end_block);
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1623,11 +1862,33 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
 				vacrel->scan_data->skippedallvis = true;
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				/* Tell the parallel scans to skip blocks */
+				table_block_parallelscan_skip_pages_in_chunk(vacrel->rel,
+															 &(plvstate->scanstate->pbscanwork),
+															 plvstate->pbscan,
+															 nblocks_skip);
+
+				/* Did we consume all blocks in the chunk? */
+				if (plvstate->scanstate->pbscanwork.phsw_chunk_remaining == 0)
+				{
+					/*
+					 * Reset the next_unskippable_blocks and try to find an
+					 * unskippable block in the next chunk.
+					 */
+					vacrel->next_unskippable_block = InvalidBlockNumber;
+					goto retry;
+				}
+			}
+
+			next_block = vacrel->next_unskippable_block;
 		}
 	}
 
@@ -1663,7 +1924,9 @@ heap_vac_scan_next_block(ReadStream *stream,
 }
 
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in a range of start_blk (inclusive) and end_blk (exclusive).
+ *
  * The next unskippable block and its visibility information is updated in
  * vacrel.
  *
@@ -1676,17 +1939,20 @@ heap_vac_scan_next_block(ReadStream *stream,
  * to skip such a range is actually made, making everything safe.)
  */
 static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start_blk, BlockNumber end_blk)
 {
 	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
 	bool		next_unskippable_allvis;
 
 	*skipsallvis = false;
 
-	for (;; next_unskippable_block++)
+	for (next_unskippable_block = start_blk;
+		 next_unskippable_block < end_blk;
+		 next_unskippable_block++)
 	{
 		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
 													   next_unskippable_block,
@@ -1773,6 +2039,236 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 	vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
 }
 
+/*
+ * A parallel variant of do_lazy_scan_hep(). The leader process launches
+ * parallel workers to scan the heap in parallel.
+*/
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	PLVScanWorkerState *scanstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Launch parallel workers */
+	parallel_lazy_scan_heap_begin(vacrel);
+
+	/*
+	 * Setup the parallel scan description for the leader to join as a worker.
+	 */
+	scanstate = palloc0(sizeof(PLVScanWorkerState));
+	scanstate->last_blkno = InvalidBlockNumber;
+	table_block_parallelscan_startblock_init(vacrel->rel,
+											 &(scanstate->pbscanwork),
+											 vacrel->plvstate->pbscan);
+	vacrel->plvstate->scanstate = scanstate;
+
+	for (;;)
+	{
+		bool		reach_eot;
+		BlockNumber min_scan_blk;
+
+		/*
+		 * Scan the table until either we are close to overrunning the
+		 * available space for dead_items TIDs or we reach the end of the
+		 * relation.
+		 */
+		reach_eot = do_lazy_scan_heap(vacrel);
+
+		/*
+		 * Parallel lazy heap scan finished. Wait for parallel workers to
+		 * finish and gather scan results.
+		 */
+		parallel_lazy_scan_heap_end(vacrel);
+
+		/* We reach the end of the table */
+		if (reach_eot)
+			break;
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		min_scan_blk = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (min_scan_blk > vacrel->next_fsm_block_to_vacuum)
+		{
+			/*
+			 * min_scanned_blkno was updated when gathering the workers' scan
+			 * results.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									min_scan_blk + 1);
+			vacrel->next_fsm_block_to_vacuum = min_scan_blk;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+		/* Re-launch workers to restart parallel heap scan */
+		parallel_lazy_scan_heap_begin(vacrel);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process launched fewer workers in the next cycle.
+	 */
+	complete_unfinihsed_lazy_scan_heap(vacrel);
+}
+
+/*
+ * Return the minimum block number the leader and workers have scanned so far.
+ */
+static BlockNumber
+parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel)
+{
+	BlockNumber min_blk;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	min_blk = vacrel->plvstate->scanstate->last_blkno;
+
+	/*
+	 * We check all worker scan states here to compute the minimum block
+	 * number among all scan states.
+	 */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		PLVScanWorkerState *scanstate = &(vacrel->leader->scanstates[i]);
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!scanstate->inited)
+			continue;
+
+		if (!BlockNumberIsValid(min_blk) || scanstate->last_blkno < min_blk)
+			min_blk = scanstate->last_blkno;
+	}
+
+	Assert(BlockNumberIsValid(min_blk));
+	return min_blk;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+complete_unfinihsed_lazy_scan_heap(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		PLVScanWorkerState *scanstate = &(vacrel->leader->scanstates[i]);
+
+		/*
+		 * Skip if this worker's scan has not been used or doesn't have
+		 * unprocessed block in chunks.
+		 */
+		if (!scanstate->inited || !scanstate->maybe_have_unprocessed_blocks)
+			continue;
+
+		/* Attach the worker's scan state and do heap scan */
+		vacrel->plvstate->scanstate = scanstate;
+		do_lazy_scan_heap(vacrel);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_begin(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->leader->nworkers_launched = parallel_vacuum_collect_dead_items_begin(vacrel->pvs);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for collecting dead tuples (planned: %d)",
+							 "launched %d parallel vacuum workers for collecting dead tuples (planned: %d)",
+							 vacrel->leader->nworkers_launched),
+					vacrel->leader->nworkers_launched,
+					parallel_vacuum_get_nworkers_table(vacrel->pvs))));
+}
+
+/*
+ * Helper routine to finish the parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_scan_end(vacrel->pvs);
+
+	/* Gather the workers' scan results */
+	parallel_lazy_scan_gather_scan_results(vacrel);
+}
+
+/* Accumulate each worker's scan results into the leader's */
+static void
+parallel_lazy_scan_gather_scan_results(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		LVScanData *data = &(vacrel->plvstate->shared->worker_scandata[i]);
+
+#define ACCUM_COUNT(item) vacrel->scan_data->item += data->item
+		ACCUM_COUNT(scanned_pages);
+		ACCUM_COUNT(removed_pages);
+		ACCUM_COUNT(new_frozen_tuple_pages);
+		ACCUM_COUNT(vm_new_visible_pages);
+		ACCUM_COUNT(vm_new_visible_frozen_pages);
+		ACCUM_COUNT(vm_new_frozen_pages);
+		ACCUM_COUNT(lpdead_item_pages);
+		ACCUM_COUNT(missed_dead_pages);
+		ACCUM_COUNT(tuples_deleted);
+		ACCUM_COUNT(tuples_frozen);
+		ACCUM_COUNT(lpdead_items);
+		ACCUM_COUNT(live_tuples);
+		ACCUM_COUNT(recently_dead_tuples);
+		ACCUM_COUNT(missed_dead_tuples);
+#undef ACCUM_COUNT
+
+		Assert(TransactionIdIsValid(data->NewRelfrozenXid));
+		Assert(MultiXactIdIsValid(data->NewRelminMxid));
+
+		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
+			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
+
+		if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
+			vacrel->scan_data->nonempty_pages = data->nonempty_pages;
+
+		vacrel->scan_data->skippedallvis |= data->skippedallvis;
+	}
+}
+
 /*
  *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
  *
@@ -3486,12 +3982,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -3509,11 +4001,17 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize the parallel vacuum state for either lazy heap
+			 * scan or index vacuuming, or both.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3553,9 +4051,15 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
@@ -3758,12 +4262,217 @@ update_relstats_all_indexes(LVRelState *vacrel)
 /*
  * Compute the number of workers for parallel heap vacuum.
  *
- * Return 0 to disable parallel vacuum so far.
+ * The calculation logic is borrowed from compute_parallel_worker().
  */
 int
 heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested)
 {
-	return 0;
+	int			parallel_workers = 0;
+	int			heap_parallel_threshold;
+	int			heap_pages;
+
+	if (nworkers_requested == 0)
+	{
+		/*
+		 * Select the number of workers based on the log of the size of the
+		 * relation. Note that the upper limit of the min_parallel_table_scan_size
+		 * GUC is chosen to prevent overflow here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		heap_pages = RelationGetNumberOfBlocks(rel);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nworkers_requested;
+
+	return parallel_workers;
+}
+
+/*
+ * Estimate shared memory size required for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							  void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	Size		size = 0;
+
+	vacrel->leader = palloc(sizeof(PLVLeader));
+
+	/* Estimate space for PLVShared */
+	size = add_size(size, SizeOfPLVShared);
+	size = add_size(size, mul_size(sizeof(LVScanData), nworkers));
+	vacrel->leader->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for ParallelBlockTableScanDesc */
+	vacrel->leader->pbscan_len = table_block_parallelscan_estimate(rel);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->pbscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of PLVScanWorkerState */
+	vacrel->leader->scanstates_len = mul_size(sizeof(PLVScanWorkerState), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scanstates_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	PLVShared  *shared;
+	ParallelBlockTableScanDesc pbscan;
+	PLVScanWorkerState *scanstates;
+
+	vacrel->plvstate = palloc0(sizeof(PLVState));
+
+	/* Initialize PLVShared */
+	shared = shm_toc_allocate(pcxt->toc, vacrel->leader->shared_len);
+	MemSet(shared, 0, vacrel->leader->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_data->NewRelminMxid;
+	shared->vistest = *vacrel->vistest;
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SHARED, shared);
+	vacrel->plvstate->shared = shared;
+
+	/* Initialize ParallelBlockTableScanDesc */
+	pbscan = shm_toc_allocate(pcxt->toc, vacrel->leader->pbscan_len);
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pbscan);
+	pbscan->base.phs_syncscan = false;	/* always start from the first block */
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCANDESC, pbscan);
+	vacrel->plvstate->pbscan = pbscan;
+
+	/* Initialize the array of PLVScanWorkerState */
+	scanstates = shm_toc_allocate(pcxt->toc, vacrel->leader->scanstates_len);
+	MemSet(scanstates, 0, vacrel->leader->scanstates_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_WORKER_SCANSTATE, scanstates);
+	vacrel->leader->scanstates = scanstates;
+}
+
+/*
+ * Initialize lazy vacuum state with the information retrieved from
+ * shared memory.
+ */
+void
+heap_parallel_vacuum_initialize_worker(Relation rel, ParallelVacuumState *pvs,
+									   ParallelWorkerContext *pwcxt,
+									   void **state_out)
+{
+	LVRelState *vacrel;
+	PLVState   *plvstate;
+	PLVShared  *shared;
+	PLVScanWorkerState *scanstates;
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Initialize PLVState and prepare the related objects */
+
+	plvstate = palloc0(sizeof(PLVState));
+
+	/* Prepare PLVShared */
+	shared = (PLVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_SHARED, false);
+	plvstate->shared = shared;
+
+	/* Prepare ParallelBlockTableScanWorkerData */
+	pbscan = shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_SCANDESC, false);
+	plvstate->pbscan = pbscan;
+
+	/* Prepare PLVScanWorkerState */
+	scanstates = shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_WORKER_SCANSTATE, false);
+	plvstate->scanstate = &(scanstates[ParallelWorkerNumber]);
+
+	/* Initialize LVRelState and prepare fields required by lazy scan heap */
+	vacrel = palloc0(sizeof(LVRelState));
+	vacrel->rel = rel;
+	vacrel->indrels = parallel_vacuum_get_table_indexes(pvs,
+														&vacrel->nindexes);
+	vacrel->pvs = pvs;
+	vacrel->aggressive = shared->aggressive;
+	vacrel->skipwithvm = shared->skipwithvm;
+	vacrel->cutoffs = shared->cutoffs;
+	vacrel->vistest = &(shared->vistest);
+	vacrel->dead_items = parallel_vacuum_get_dead_items(pvs,
+														&vacrel->dead_items_info);
+	vacrel->plvstate = plvstate;
+	vacrel->scan_data = &(shared->worker_scandata[ParallelWorkerNumber]);
+	MemSet(vacrel->scan_data, 0, sizeof(LVScanData));
+	vacrel->scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = shared->NewRelminMxid;
+	vacrel->scan_data->skippedallvis = false;
+	vacrel->scan_data->rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Initialize the scan state if not yet. The chunk of blocks will be
+	 * allocated when to get the scan block for the first time.
+	 */
+	if (!vacrel->plvstate->scanstate->inited)
+	{
+		vacrel->plvstate->scanstate->inited = true;
+		table_block_parallelscan_startblock_init(rel,
+												 &(vacrel->plvstate->scanstate->pbscanwork),
+												 vacrel->plvstate->pbscan);
+		vacrel->plvstate->scanstate->maybe_have_unprocessed_blocks = false;
+	}
+
+	*state_out = (void *) vacrel;
+}
+
+/*
+ * Parallel heap vacuum callback for collecting dead items (i.e., lazy heap scan).
+ */
+void
+heap_parallel_vacuum_collect_dead_items(Relation rel, ParallelVacuumState *pvs,
+										void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ErrorContextCallback errcallback;
+	bool		reach_eot;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel->dbname = get_database_name(MyDatabaseId);
+	vacrel->relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Join the parallel heap vacuum */
+	reach_eot = do_lazy_scan_heap(vacrel);
+
+	/*
+	 * If the leader or a worker finishes the heap scan because dead_items
+	 * TIDs is close to the limit, lazy heap scan stops while it might have
+	 * some unscanned blocks in the allocated chunk. Since this scan state
+	 * could not be used in the next heap scan, we remember that it might have
+	 * some unconsumed blocks so that the leader complete the scans after the
+	 * heap scan phase finishes.
+	 */
+	vacrel->plvstate->scanstate->maybe_have_unprocessed_blocks = !reach_eot;
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
 }
 
 /*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..a0a92dc8be5 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -600,6 +600,21 @@ table_block_parallelscan_nextpage(Relation rel,
 	return page;
 }
 
+/*
+ * skip some blocks to scan.
+ *
+ * Consume the given number of blocks in the current chunk. It doesn't skip blocks
+ * beyond the current chunk.
+ */
+void
+table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+											 ParallelBlockTableScanWorker pbscanwork,
+											 ParallelBlockTableScanDesc pbscan,
+											 BlockNumber nblocks_skip)
+{
+	pbscanwork->phsw_chunk_remaining -= Min(nblocks_skip, pbscanwork->phsw_chunk_remaining);
+}
+
 /* ----------------------------------------------------------------------------
  * Helper functions to implement relation sizing for block oriented AMs.
  * ----------------------------------------------------------------------------
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0acd7fa9144..ea8ffa4efaf 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -508,6 +508,26 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
+/*
+ * Return the number of parallel workers initialized for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
 /*
  * Returns the dead items space and dead items information.
  */
@@ -1113,7 +1133,8 @@ parallel_vacuum_scan_end(ParallelVacuumState *pvs)
 	WaitForParallelWorkersToFinish(pvs->pcxt);
 
 	/* Decrement the worker count for the leader itself */
-	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
 
 	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index c80c6b16143..7848e4621b7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -15,6 +15,7 @@
 #define HEAPAM_H
 
 #include "access/heapam_xlog.h"
+#include "access/parallel.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
@@ -407,9 +408,20 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 
 /* in heap/vacuumlazy.c */
 struct VacuumParams;
+struct ParallelVacuumState;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
 extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int max_workers);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+										  void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+												   ParallelWorkerContext *pwcxt,
+												   void **state_out);
+extern void heap_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+													void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 3e1e6aefeb5..34c2b7856e5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -2231,6 +2231,10 @@ extern BlockNumber table_block_parallelscan_nextpage(Relation rel,
 extern void table_block_parallelscan_startblock_init(Relation rel,
 													 ParallelBlockTableScanWorker pbscanwork,
 													 ParallelBlockTableScanDesc pbscan);
+extern void table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+														 ParallelBlockTableScanWorker pbscanwork,
+														 ParallelBlockTableScanDesc pbscan,
+														 BlockNumber nblocks_skip);
 
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 596e901c207..20eecad4ec4 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -384,6 +384,8 @@ extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels
 												 BufferAccessStrategy bstrategy,
 												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 1a07dbf67d6..f3ddaaad081 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -156,6 +156,13 @@ UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
+CREATE TABLE pvactst2 (i INT) with (autovacuum_enabled = off);
+INSERT INTO pvactst2 SELECT generate_series(1,1000) i;
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
 VACUUM (PARALLEL -1) pvactst; -- error
 ERROR:  parallel workers for vacuum must be between 0 and 1024
 LINE 1: VACUUM (PARALLEL -1) pvactst;
@@ -174,7 +181,9 @@ VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 WARNING:  disabling parallel option of vacuum on "tmp" --- cannot vacuum temporary tables in parallel
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
+DROP TABLE pvactst2;
 -- INDEX_CLEANUP option
 CREATE TABLE no_index_cleanup (i INT PRIMARY KEY, t TEXT);
 -- Use uncompressed data stored in toast.
diff --git a/src/test/regress/sql/vacuum.sql b/src/test/regress/sql/vacuum.sql
index 5e55079e718..c28bb6b831e 100644
--- a/src/test/regress/sql/vacuum.sql
+++ b/src/test/regress/sql/vacuum.sql
@@ -125,6 +125,15 @@ VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
 
+CREATE TABLE pvactst2 (i INT) with (autovacuum_enabled = off);
+INSERT INTO pvactst2 SELECT generate_series(1,1000) i;
+
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
+
 VACUUM (PARALLEL -1) pvactst; -- error
 VACUUM (PARALLEL 2, INDEX_CLEANUP FALSE) pvactst;
 VACUUM (PARALLEL 2, FULL TRUE) pvactst; -- error, cannot use both PARALLEL and FULL
@@ -136,7 +145,9 @@ CREATE INDEX tmp_idx1 ON tmp (a);
 VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
+DROP TABLE pvactst2;
 
 -- INDEX_CLEANUP option
 CREATE TABLE no_index_cleanup (i INT PRIMARY KEY, t TEXT);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5000d029add..916862153ab 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1927,6 +1927,10 @@ PLpgSQL_type
 PLpgSQL_type_type
 PLpgSQL_var
 PLpgSQL_variable
+PLVLeader
+PLVScanWorkerState
+PLVShared
+PLVState
 PLwdatum
 PLword
 PLyArrayToOb
-- 
2.43.5

v11-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchapplication/octet-stream; name=v11-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchDownload

From 1b20c33457ca2ae2732ffbf1dcd955405a2347a3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 26 Feb 2025 11:31:55 -0800
Subject: [PATCH v11 3/5] Move lazy heap scan related variables to new struct
 LVScanData.

This is a pure refactoring for upcoming parallel heap scan, which
requires storing relation statistics collected during lazy heap scan
to a shared memory area.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c | 341 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 180 insertions(+), 162 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0ea669ab419..8632ee9cc43 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -256,6 +256,57 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * Data and counters updated during lazy heap scan.
+ */
+typedef struct LVScanData
+{
+	BlockNumber rel_pages;		/* total number of pages */
+
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+
+	/*
+	 * Count of all-visible blocks eagerly scanned (for logging only). This
+	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
+	 */
+	BlockNumber eager_scanned_pages;
+
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVScanData;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -282,10 +333,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -310,35 +357,8 @@ typedef struct LVRelState
 	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
 	VacDeadItemsInfo *dead_items_info;
 
-	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-
-	/*
-	 * Count of all-visible blocks eagerly scanned (for logging only). This
-	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
-	 */
-	BlockNumber eager_scanned_pages;
-
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
-
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
-
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	/* Data and counters updated during lazy heap scan */
+	LVScanData *scan_data;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -348,13 +368,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -524,7 +537,7 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	 * the first region, making the second region the first to be eager
 	 * scanned normally.
 	 */
-	if (vacrel->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
+	if (vacrel->scan_data->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
 		return;
 
 	/*
@@ -616,6 +629,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVScanData *scan_data;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -730,13 +744,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->eager_scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
+	scan_data = palloc(sizeof(LVScanData));
+	scan_data->scanned_pages = 0;
+	scan_data->eager_scanned_pages = 0;
+	scan_data->removed_pages = 0;
+	scan_data->new_frozen_tuple_pages = 0;
+	scan_data->lpdead_item_pages = 0;
+	scan_data->missed_dead_pages = 0;
+	scan_data->nonempty_pages = 0;
+	scan_data->tuples_deleted = 0;
+	scan_data->tuples_frozen = 0;
+	scan_data->lpdead_items = 0;
+	scan_data->live_tuples = 0;
+	scan_data->recently_dead_tuples = 0;
+	scan_data->missed_dead_tuples = 0;
+	scan_data->vm_new_visible_pages = 0;
+	scan_data->vm_new_visible_frozen_pages = 0;
+	scan_data->vm_new_frozen_pages = 0;
+	scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel->scan_data = scan_data;
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/* Allocate/initialize output statistics state */
@@ -747,17 +773,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
-
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
-	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -778,15 +793,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_data->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_data->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 
 	/*
 	 * Initialize state related to tracking all-visible page skipping. This is
 	 * very important to determine whether or not it is safe to advance the
 	 * relfrozenxid/relminmxid.
 	 */
-	vacrel->skippedallvis = false;
+	vacrel->scan_data->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -874,15 +889,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_data->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_data->NewRelfrozenXid));
+	Assert(vacrel->scan_data->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_data->NewRelminMxid));
+	if (vacrel->scan_data->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -890,15 +905,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_data->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_data->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * pg_class.relpages to
 	 */
-	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
+	new_rel_pages = vacrel->scan_data->rel_pages;	/* After possible rel
+													 * truncation */
 	visibilitymap_count(rel, &new_rel_allvisible, &new_rel_allfrozen);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
@@ -921,7 +937,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, new_rel_allfrozen,
 						vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_data->NewRelfrozenXid,
+						vacrel->scan_data->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -937,8 +954,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples,
+						 vacrel->scan_data->recently_dead_tuples +
+						 vacrel->scan_data->missed_dead_tuples,
 						 starttime);
 	pgstat_progress_end_command();
 
@@ -1012,23 +1029,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u eagerly scanned\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_data->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_data->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages /
+							 100.0 * vacrel->scan_data->scanned_pages /
 							 orig_rel_pages,
-							 vacrel->eager_scanned_pages);
+							 vacrel->scan_data->eager_scanned_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->scan_data->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->scan_data->recently_dead_tuples);
+			if (vacrel->scan_data->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->scan_data->missed_dead_tuples,
+								 vacrel->scan_data->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -1036,33 +1053,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_data->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_data->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_data->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_data->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_data->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_data->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 (long long) vacrel->scan_data->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_data->vm_new_visible_pages,
+							 vacrel->scan_data->vm_new_visible_frozen_pages +
+							 vacrel->scan_data->vm_new_frozen_pages,
+							 vacrel->scan_data->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -1082,10 +1099,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_data->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->scan_data->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->scan_data->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -1199,7 +1216,7 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->rel_pages,
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
 				blkno = 0,
 				next_fsm_block_to_vacuum = 0;
 	BlockNumber orig_eager_scan_success_limit =
@@ -1255,8 +1272,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages > 0 &&
-			vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_data->scanned_pages > 0 &&
+			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1308,9 +1325,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		blkno = BufferGetBlockNumber(buf);
 
-		vacrel->scanned_pages++;
+		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
-			vacrel->eager_scanned_pages++;
+			vacrel->scan_data->eager_scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -1492,16 +1509,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
 
 	read_stream_end(stream);
 
@@ -1548,7 +1565,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * callback_private_data contains a reference to the LVRelState, passed to the
  * read stream API during stream setup. The LVRelState is an in/out parameter
  * here (locally named `vacrel`). Vacuum options and information about the
- * relation are read from it. vacrel->skippedallvis is set if we skip a block
+ * relation are read from it. vacrel->scan_data->skippedallvis is set if we skip a block
  * that's all-visible but not all-frozen (to ensure that we don't update
  * relfrozenxid in that case). vacrel also holds information about the next
  * unskippable block -- as bookkeeping for this function.
@@ -1566,7 +1583,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 	next_block = vacrel->current_block + 1;
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->rel_pages)
+	if (next_block >= vacrel->scan_data->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1610,7 +1627,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_data->skippedallvis = true;
 		}
 	}
 
@@ -1661,7 +1678,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 static void
 find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
@@ -1892,11 +1909,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_data->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1971,10 +1988,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_data->NewRelfrozenXid, &vacrel->scan_data->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_data->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_data->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1984,7 +2001,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_data->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -2019,7 +2036,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -2034,15 +2051,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_data->tuples_deleted += presult.ndeleted;
+	vacrel->scan_data->tuples_frozen += presult.nfrozen;
+	vacrel->scan_data->lpdead_items += presult.lpdead_items;
+	vacrel->scan_data->live_tuples += presult.live_tuples;
+	vacrel->scan_data->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2091,17 +2108,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (presult.all_frozen)
 			{
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 				*vm_page_frozen = true;
 			}
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2189,8 +2206,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 
@@ -2200,7 +2217,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		else
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2241,8 +2258,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_data->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2369,8 +2386,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_data->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -2397,25 +2414,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_data->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_data->live_tuples += live_tuples;
+	vacrel->scan_data->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_data->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_data->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -2444,7 +2461,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_data->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -2473,12 +2490,12 @@ lazy_vacuum(LVRelState *vacrel)
 	 * HOT through careful tuning.
 	 */
 	bypass = false;
-	if (vacrel->consider_bypass_optimization && vacrel->rel_pages > 0)
+	if (vacrel->consider_bypass_optimization && vacrel->scan_data->rel_pages > 0)
 	{
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_data->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2504,8 +2521,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * be negligible.  If this optimization is ever expanded to cover more
 		 * cases then this may need to be reconsidered.
 		 */
-		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		threshold = (double) vacrel->scan_data->rel_pages * BYPASS_THRESHOLD_PAGES;
+		bypass = (vacrel->scan_data->lpdead_item_pages < threshold &&
 				  TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);
 	}
 
@@ -2643,7 +2660,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2795,8 +2812,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
+			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2912,14 +2929,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -2995,7 +3012,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_data->scanned_pages < vacrel->scan_data->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -3176,10 +3193,10 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->scan_data->rel_pages - vacrel->scan_data->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
-		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
+		 possibly_freeable >= vacrel->scan_data->rel_pages / REL_TRUNCATE_FRACTION))
 		return true;
 
 	return false;
@@ -3191,7 +3208,7 @@ should_attempt_truncation(LVRelState *vacrel)
 static void
 lazy_truncate_heap(LVRelState *vacrel)
 {
-	BlockNumber orig_rel_pages = vacrel->rel_pages;
+	BlockNumber orig_rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber new_rel_pages;
 	bool		lock_waiter_detected;
 	int			lock_retry;
@@ -3202,7 +3219,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_data->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -3303,15 +3320,15 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
-		vacrel->rel_pages = new_rel_pages;
+		vacrel->scan_data->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_data->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
 				(errmsg("table \"%s\": truncated %u to %u pages",
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_data->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -3335,11 +3352,11 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * unsigned.)  To make the scan faster, we prefetch a few blocks at a time
 	 * in forward direction, so that OS-level readahead can kick in.
 	 */
-	blkno = vacrel->rel_pages;
+	blkno = vacrel->scan_data->rel_pages;
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_data->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -3451,7 +3468,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_data->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 290aa3053f1..5000d029add 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1496,6 +1496,7 @@ LSEG
 LUID
 LVRelState
 LVSavedErrInfo
+LVScanData
 LWLock
 LWLockHandle
 LWLockMode
-- 
2.43.5

v11-0002-vacuumparallel.c-Support-parallel-table-vacuumin.patchapplication/octet-stream; name=v11-0002-vacuumparallel.c-Support-parallel-table-vacuumin.patchDownload

From 5a6b3bcccf0b46a66e3e432294b4f9ba6befa78b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 18 Feb 2025 17:45:36 -0800
Subject: [PATCH v11 2/5] vacuumparallel.c: Support parallel table vacuuming.

Since parallel vacuum was available only for index vacuuming and index
cleanup, ParallelVacuumState was initialized only when the table has
at least two indexes that are eligible for parallel index vacuuming
and cleanup.

This commit extends vacuumparallel.c to support parallel table
vacuuming. parallel_vacuum_init() now initializes ParallelVacuumState
and it enables parallel table vacuuming and parallel index
vacuuming/cleanup separately. During the initialization, it asks the
table AM for the number of parallel workers required for parallel
table vacuuming. If >0, it enables parallel table vacuuming and calls
further table AM APIs such as parallel_vacuum_estimate.

For parallel table vacuuming, this commit introduces
parallel_vacuum_remove_dead_items_begin() function, which can be used
to remove the collected dead items from the table (for example, the
second pass over heap table in lazy vacuum).

Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/vacuumlazy.c  |   2 +-
 src/backend/commands/vacuumparallel.c | 310 +++++++++++++++++++++-----
 src/include/commands/vacuum.h         |   5 +-
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 260 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 5a3ef564685..0ea669ab419 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3496,7 +3496,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..0acd7fa9144 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -6,15 +6,24 @@
  * This file contains routines that are intended to support setting up, using,
  * and tearing down a ParallelVacuumState.
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan or both index bulk deletion and
+ * index cleanup or all of them with parallel worker processes. Different
+ * numbers of workers are launched for the table vacuuming and index processing.
+ * ParallelVacuumState contains shared information as well as the memory space
+ * for storing dead items allocated in the DSA area.
+ *
+ * When initializing parallel table vacuum scan, we invoke table AM routines for
+ * estimating DSM sizes and initializing DSM memory. Parallel table vacuum
+ * workers invoke the table AM routine for vacuuming the table.
+ *
+ * For processing indexes in parallel, individual indexes are processed by one
+ * vacuum process. We launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the parallel
+ * worker processes exit.
+ *
+ * Each time we process table or indexes in parallel, the parallel context is
+ * re-initialized so that the same DSM can be used for multiple passes of table vacuum
+ * or index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -26,8 +35,10 @@
  */
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -50,6 +61,13 @@
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		4
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
+/* The kind of parallel vacuum phases */
+typedef enum
+{
+	PV_WORK_PHASE_PROCESS_INDEXES,	/* index vacuuming or cleanup */
+	PV_WORK_PHASE_COLLECT_DEAD_ITEMS,	/* collect dead tuples */
+} PVWorkPhase;
+
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
  * segment.
@@ -65,6 +83,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * Tell parallel workers what phase to perform: processing indexes or
+	 * removing dead tuples from the table.
+	 */
+	PVWorkPhase work_phase;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +188,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Do we need to reinitialize parallel DSM? */
+	bool		need_reinitialize_dsm;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -171,6 +198,12 @@ struct ParallelVacuumState
 	Relation   *indrels;
 	int			nindexes;
 
+	/*
+	 * The number of workers for parallel table vacuuming. If 0, the parallel
+	 * table vacuum is disabled.
+	 */
+	int			nworkers_for_table;
+
 	/* Shared information among parallel vacuum workers */
 	PVShared   *shared;
 
@@ -178,7 +211,7 @@ struct ParallelVacuumState
 	 * Shared index statistics among parallel vacuum workers. The array
 	 * element is allocated for every index, even those indexes where parallel
 	 * index vacuuming is unsafe or not worthwhile (e.g.,
-	 * will_parallel_vacuum[] is false).  During parallel vacuum,
+	 * idx_will_parallel_vacuum[] is false).  During parallel vacuum,
 	 * IndexBulkDeleteResult of each index is kept in DSM and is copied into
 	 * local memory at the end of parallel vacuum.
 	 */
@@ -198,7 +231,7 @@ struct ParallelVacuumState
 	 * processing. For example, the index could be <
 	 * min_parallel_index_scan_size cutoff.
 	 */
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
@@ -221,8 +254,9 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static int	parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											bool *idx_will_parallel_vacuum);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -237,12 +271,17 @@ static void parallel_vacuum_error_callback(void *arg);
  * Try to enter parallel mode and create a parallel context.  Then initialize
  * shared memory state.
  *
+ * nrequested_workers is >= 0 number and the requested parallel degree. 0
+ * means that the parallel degrees for table and indexes vacuum are decided
+ * differently. See the comments of parallel_vacuum_compute_workers() for
+ * details.
+ *
  * On success, return parallel vacuum state.  Otherwise return NULL.
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -251,38 +290,37 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
 	int			querylen;
 
-	/*
-	 * A parallel vacuum must be requested and there must be indexes on the
-	 * relation
-	 */
+	/* A parallel vacuum must be requested */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
-	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
+	idx_will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
+	parallel_workers = parallel_vacuum_compute_workers(rel, indrels, nindexes,
 													   nrequested_workers,
-													   will_parallel_vacuum);
+													   &nworkers_for_table,
+													   idx_will_parallel_vacuum);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
-		pfree(will_parallel_vacuum);
+		pfree(idx_will_parallel_vacuum);
 		return NULL;
 	}
 
 	pvs = (ParallelVacuumState *) palloc0(sizeof(ParallelVacuumState));
 	pvs->indrels = indrels;
 	pvs->nindexes = nindexes;
-	pvs->will_parallel_vacuum = will_parallel_vacuum;
+	pvs->idx_will_parallel_vacuum = idx_will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
 
@@ -291,6 +329,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 parallel_workers);
 	Assert(pcxt->nworkers > 0);
 	pvs->pcxt = pcxt;
+	pvs->need_reinitialize_dsm = false;
+	pvs->nworkers_for_table = nworkers_for_table;
 
 	/* Estimate size for index vacuum stats -- PARALLEL_VACUUM_KEY_INDEX_STATS */
 	est_indstats_len = mul_size(sizeof(PVIndStats), nindexes);
@@ -327,6 +367,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, pvs->nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -345,7 +389,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 			   ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) == 0));
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
-		if (!will_parallel_vacuum[i])
+		if (!idx_will_parallel_vacuum[i])
 			continue;
 
 		if (indrel->rd_indam->amusemaintenanceworkmem)
@@ -419,6 +463,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Initialize AM-specific DSM space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, pvs->nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -456,7 +504,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
-	pfree(pvs->will_parallel_vacuum);
+	pfree(pvs->idx_will_parallel_vacuum);
 	pfree(pvs);
 }
 
@@ -533,26 +581,35 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.  Return the maximum number of parallel
+ * workers for table vacuuming and index vacuuming.
+ *
+ * nrequested is the number of parallel workers that user requested, which
+ * applies to both the number of workers for table vacuum and index vacuum.
+ * If nrequested is 0, we compute the parallel degree for them differently
+ * as described below.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel table vacuum, we ask AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to nworkers_table_p.
+ *
+ * For parallel index vacuum, the index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. This function sets
+ * idx_will_parallel_vacuum to remember indexes that participate in parallel vacuum.
  */
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table_p,
+								bool *idx_will_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			nworkers_table = 0;
+	int			nworkers_index = 0;
+
+	*nworkers_table_p = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -561,6 +618,12 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;
 
+	/* Compute the number of workers for parallel table scan */
+	nworkers_table = table_parallel_vacuum_compute_workers(rel, nrequested);
+
+	/* Cap by max_parallel_maintenance_workers */
+	nworkers_table = Min(nworkers_table, max_parallel_maintenance_workers);
+
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
@@ -574,7 +637,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		will_parallel_vacuum[i] = true;
+		idx_will_parallel_vacuum[i] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -589,18 +652,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	/* The leader process takes one index */
 	nindexes_parallel--;
 
-	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table_p = nworkers_table;
+	return Max(nworkers_table, nworkers_index);
 }
 
 /*
@@ -657,7 +720,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL);
 		indstats->status = new_status;
 		indstats->parallel_workers_can_process =
-			(pvs->will_parallel_vacuum[i] &&
+			(pvs->idx_will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
 													num_index_scans,
 													vacuum));
@@ -669,8 +732,10 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
+		pvs->shared->work_phase = PV_WORK_PHASE_PROCESS_INDEXES;
+
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (pvs->need_reinitialize_dsm)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -764,6 +829,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
 }
 
 /*
@@ -979,6 +1047,114 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Begin the parallel scan to collect dead items. Return the number of
+ * launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_scan_end() to finish the parallel
+ * table scan.
+ */
+int
+parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->work_phase = PV_WORK_PHASE_COLLECT_DEAD_ITEMS;
+
+	if (pvs->need_reinitialize_dsm)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	Assert(pvs->nworkers_for_table >= pvs->pcxt->nworkers);
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel vacuum workers launched by
+ * parallel_vacuum_collect_dead_items_begin(), and gather workers' statistics.
+ */
+void
+parallel_vacuum_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
+}
+
+/*
+ * The function is for parallel workers to execute the parallel scan to
+ * collect dead tuples.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs, void *state)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do the parallel scan to collect dead tuples */
+	table_parallel_vacuum_collect_dead_items(pvs->heaprel, pvs, state);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -998,6 +1174,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
+	void	   *state;
 	ErrorContextCallback errcallback;
 
 	/*
@@ -1030,7 +1207,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	/*
 	 * Apply the desired value of maintenance_work_mem within this process.
@@ -1076,6 +1252,17 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
+	/* Initialize AM-specific vacuum state for parallel table vacuuming */
+	if (shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS)
+	{
+		ParallelWorkerContext pwcxt;
+
+		pwcxt.toc = toc;
+		pwcxt.seg = seg;
+		table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+												&state);
+	}
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = parallel_vacuum_error_callback;
 	errcallback.arg = &pvs;
@@ -1085,8 +1272,19 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	switch (pvs.shared->work_phase)
+	{
+		case PV_WORK_PHASE_COLLECT_DEAD_ITEMS:
+			/* Scan the table to collect dead items */
+			parallel_vacuum_process_table(&pvs, state);
+			break;
+		case PV_WORK_PHASE_PROCESS_INDEXES:
+			/* Process indexes to perform vacuum/cleanup */
+			parallel_vacuum_process_safe_indexes(&pvs);
+			break;
+		default:
+			elog(ERROR, "unrecognized parallel vacuum phase %d", pvs.shared->work_phase);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index baacc63f590..596e901c207 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -381,7 +381,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -393,6 +394,8 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_scan_end(ParallelVacuumState *pvs);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9840060997f..290aa3053f1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1999,6 +1999,7 @@ PVIndStats
 PVIndVacStatus
 PVOID
 PVShared
+PVWorkPhase
 PX_Alias
 PX_Cipher
 PX_Combo
-- 
2.43.5

#63

Peter Smith

smithpb2250@gmail.com

10 months ago

In reply to: Masahiko Sawada (#62)

Re: Parallel heap vacuum

Hi Sawada-San,

Here are some review comments for patch v11-0001.

======
src/backend/access/heap/vacuumlazy.c

1.
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Return 0 to disable parallel vacuum so far.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested)

You don't need to say "so far".

======
src/backend/access/table/tableamapi.c

2.
Assert(routine->relation_vacuum != NULL);
+ Assert(routine->parallel_vacuum_compute_workers != NULL);
Assert(routine->scan_analyze_next_block != NULL);

Is it better to keep these Asserts in the same order that the
TableAmRoutine fields are assigned (in heapam_handler.c)?

~~~

3.
+ /*
+ * Callbacks for parallel vacuum are also optional (except for
+ * parallel_vacuum_compute_workers). But one callback implies presence of
+ * the others.
+ */
+ Assert(((((routine->parallel_vacuum_estimate == NULL) ==
+   (routine->parallel_vacuum_initialize == NULL)) ==
+ (routine->parallel_vacuum_initialize_worker == NULL)) ==
+ (routine->parallel_vacuum_collect_dead_items == NULL)));

/also optional/optional/

======
src/include/access/heapam.h

+extern int heap_parallel_vacuum_compute_workers(Relation rel, int
nworkers_requested);

4.
wrong tab/space after 'int'.

======
src/include/access/tableam.h

5.
+ /*
+ * Compute the number of parallel workers for parallel table vacuum. The
+ * parallel degree for parallel vacuum is further limited by
+ * max_parallel_maintenance_workers. The function must return 0 to disable
+ * parallel table vacuum.
+ *
+ * 'nworkers_requested' is a >=0 number and the requested number of
+ * workers. This comes from the PARALLEL option. 0 means to choose the
+ * parallel degree based on the table AM specific factors such as table
+ * size.
+ */
+ int (*parallel_vacuum_compute_workers) (Relation rel,
+ int nworkers_requested);

The comment here says "This comes from the PARALLEL option." and "0
means to choose the parallel degree...". But, the PG docs [1]https://www.postgresql.org/docs/devel/sql-vacuum.html says "To
disable this feature, one can use PARALLEL option and specify parallel
workers as zero.".

These two descriptions "disable this feature" (PG docs) and letting
the system "choose the parallel degree" (code comment) don't sound the
same. Should this 0001 patch update the PG documentation for the
effect of setting PARALLEL value zero?

~~~

6.
+ /*
+ * Initialize DSM space for parallel table vacuum.
+ *
+ * Not called if parallel table vacuum is disabled.
+ *
+ * Optional callback, but either all other parallel vacuum callbacks need
+ * to exist, or neither.
+ */

"or neither"?

Also, saying "all other" seems incorrect because
parallel_vacuum_compute_workers callback must exist event if
parallel_vacuum_initialize does not exist.

IMO you meant to say "all optional", and "or none".

SUGGESTION:
Optional callback. Either all optional parallel vacuum callbacks need
to exist, or none.

(this same issue is repeated in multiple places).

======
[1]: https://www.postgresql.org/docs/devel/sql-vacuum.html

Kind Regards,
Peter Smith.
Fujitsu Australia

#64

Peter Smith

smithpb2250@gmail.com

10 months ago

In reply to: Masahiko Sawada (#62)

Re: Parallel heap vacuum

Hi Sawada-San.

Here are some review comments for patch v11-0002

======
Commit message.

1.
Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

This function implementation was moved into patch 0001, so probably
this part of the commit message comment also belongs now in patch
0001.

======
src/backend/commands/vacuumparallel.c

2.
+ * For processing indexes in parallel, individual indexes are processed by one
+ * vacuum process. We launch parallel worker processes at the start
of parallel index
+ * bulk-deletion and index cleanup and once all indexes are
processed, the parallel
+ * worker processes exit.
+ *

"are processed by one vacuum process." -- Did you mean "are processed
by separate vacuum processes." ?

~~~

3.
+ *
+ * Each time we process table or indexes in parallel, the parallel context is
+ * re-initialized so that the same DSM can be used for multiple
passes of table vacuum
+ * or index bulk-deletion and index cleanup.

Maybe I am mistaken, but it seems like the logic is almost always
re-initializing this. I wonder if it might be simpler to just remove
the 'need_reinitialize_dsm' field and initialize unconditionally.

~~~

4.
+ * nrequested_workers is >= 0 number and the requested parallel degree. 0
+ * means that the parallel degrees for table and indexes vacuum are decided
+ * differently. See the comments of parallel_vacuum_compute_workers() for
+ * details.
+ *
  * On success, return parallel vacuum state.  Otherwise return NULL.
  */

SUGGESTION
nrequested_workers is the requested parallel degree (>=0). 0 means that...

~~~

5.
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int
nrequested,
- bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+ int nrequested, int *nworkers_table_p,
+ bool *idx_will_parallel_vacuum)

IIUC the returns for this function seem inconsistent. AFAIK, it
previously would return the number of workers for parallel index
vacuuming. But now (after this patch) the return value is returned
Max(nworkers_table, nworkers_index). Meanwhile, the number of workers
for parallel table vacuuming is returned as a by-reference parameter
'nworkers_table_p'. In other words, it is returning the number of
workers in 2 different ways.

Why not make this a void function, but introduce another parameter
'nworkers_index_p', similar to 'nworkers_table_p'?

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#65

Amit Kapila

amit.kapila16@gmail.com

10 months ago

In reply to: Masahiko Sawada (#56)

Re: Parallel heap vacuum

On Wed, Mar 5, 2025 at 6:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Another performance regression I can see in the results is that heap
vacuum phase (phase III) got slower with the patch. It's weired to me
since I don't touch the code of heap vacuum phase. I'm still
investigating the cause.

I have investigated this regression. I've confirmed that In both
scenarios (patched and unpatched), the entire table and its associated
indexes were loaded into the shared buffer before the vacuum. Then,
the 'perf record' analysis, focused specifically on the heap vacuum
phase of the patched code, revealed numerous soft page faults
occurring:

62.37% 13.90% postgres postgres [.] lazy_vacuum_heap_rel
|
|--52.44%--lazy_vacuum_heap_rel
| |
| |--46.33%--lazy_vacuum_heap_page (inlined)
| | |
| | |--32.42%--heap_page_is_all_visible (inlined)
| | | |
| | | |--26.46%--HeapTupleSatisfiesVacuum
| | | |
HeapTupleSatisfiesVacuumHorizon
| | | |
HeapTupleHeaderXminCommitted (inlined)
| | | | |
| | | | --18.52%--page_fault
| | | | do_page_fault
| | | |
__do_page_fault
| | | |
handle_mm_fault
| | | |
__handle_mm_fault
| | | |
handle_pte_fault
| | | | |
| | | |
|--16.53%--filemap_map_pages
| | | | | |
| | | | |
--2.63%--alloc_set_pte
| | | | |
pfn_pte
| | | | |
| | | |
--1.99%--pmd_page_vaddr
| | | |
| | | --1.99%--TransactionIdPrecedes

I did not observe these page faults in the 'perf record' results for
the HEAD version. Furthermore, when I disabled parallel heap vacuum
while keeping parallel index vacuuming enabled, the regression
disappeared. Based on these findings, the likely cause of the
regression appears to be that during parallel heap vacuum operations,
table blocks were loaded into the shared buffer by parallel vacuum
workers.

In the previous paragraph, you mentioned that the entire table and its
associated indexes were loaded into the shared buffer before the
vacuum. If that is true, then why does the parallel vacuum need to
reload the table blocks into shared buffers?

However, in the heap vacuum phase, the leader process needed
to process all blocks, resulting in soft page faults while creating
Page Table Entries (PTEs). Without the patch, the backend process had
already created PTEs during the heap scan, thus preventing these
faults from occurring during the heap vacuum phase.

This part is again not clear to me because I am assuming all the data
exists in shared buffers before the vacuum, so why the page faults
will occur in the first place.

--
With Regards,
Amit Kapila.

#66

Amit Kapila

amit.kapila16@gmail.com

10 months ago

In reply to: Masahiko Sawada (#60)

Re: Parallel heap vacuum

On Fri, Mar 7, 2025 at 11:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Discussing with Amit offlist, I've run another benchmark test where no
data is loaded on the shared buffer. In the previous test, I loaded
all table blocks before running vacuum, so it was the best case. The
attached test results showed the worst case.

Overall, while the numbers seem not stable, the phase I got sped up a
bit, but not as scalable as expected, which is not surprising.

Sorry, but it is difficult for me to understand this data because it
doesn't contain the schema or details like what exactly is a fraction.
It is also not clear how the workers are divided among heap and
indexes, like do we use parallelism for both phases of heap or only
first phase and do we reuse those workers for index vacuuming. These
tests were probably discussed earlier, but it would be better to
either add a summary of the required information to understand the
results or at least a link to a previous email that has such details.

Please

note that the test results shows that the phase III also got sped up
but this is because in parallel vacuum we use more ring buffers than
the single process vacuum. So we need to compare the only phase I time
in terms of the benefit of the parallelism.

Does phase 3 also use parallelism? If so, can we try to divide the
ring buffers among workers or at least try vacuum with an increased
number of ring buffers. This would be good to do for both the phases,
if they both use parallelism.

--
With Regards,
Amit Kapila.

#67

Peter Smith

smithpb2250@gmail.com

10 months ago

In reply to: Masahiko Sawada (#62)

Re: Parallel heap vacuum

Hi Sawada-San.

Some review comments for patch v11-0003.

======
src/backend/access/heap/vacuumlazy.c

1.
+typedef struct LVScanData
+{
+ BlockNumber rel_pages; /* total number of pages */
+
+ BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */
+
+ /*
+ * Count of all-visible blocks eagerly scanned (for logging only). This
+ * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
+ */
+ BlockNumber eager_scanned_pages;
+
+ BlockNumber removed_pages; /* # pages removed by relation truncation */
+ BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+
+ /* # pages newly set all-visible in the VM */
+ BlockNumber vm_new_visible_pages;
+
+ /*
+ * # pages newly set all-visible and all-frozen in the VM. This is a
+ * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+ * all pages set all-visible, but vm_new_visible_frozen_pages includes
+ * only those which were also set all-frozen.
+ */
+ BlockNumber vm_new_visible_frozen_pages;
+
+ /* # all-visible pages newly set all-frozen in the VM */
+ BlockNumber vm_new_frozen_pages;
+
+ BlockNumber lpdead_item_pages; /* # pages with LP_DEAD items */
+ BlockNumber missed_dead_pages; /* # pages with missed dead tuples */
+ BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+ /* Counters that follow are only for scanned_pages */
+ int64 tuples_deleted; /* # deleted from table */
+ int64 tuples_frozen; /* # newly frozen */
+ int64 lpdead_items; /* # deleted from indexes */
+ int64 live_tuples; /* # live tuples remaining */
+ int64 recently_dead_tuples; /* # dead, but not yet removable */
+ int64 missed_dead_tuples; /* # removable, but not removed */
+
+ /* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+ TransactionId NewRelfrozenXid;
+ MultiXactId NewRelminMxid;
+ bool skippedallvis;
+} LVScanData;
+

1a.
Double blank line after 'new_frozen_tuple_pages' field.

1b.
I know you have only refactored existing code but IMO, the use of "#"
meaning "Number of" or "Count of" doesn't seem nice. It *may* be
justifiable to prevent wrapping of the short comments on the same line
as the fields, but for all the other larger comments it looks strange
not to be spelled out properly as words.

~~~

heap_vacuum_rel:

2.
  /* Initialize page counters explicitly (be tidy) */
- vacrel->scanned_pages = 0;
- vacrel->eager_scanned_pages = 0;
- vacrel->removed_pages = 0;
- vacrel->new_frozen_tuple_pages = 0;
- vacrel->lpdead_item_pages = 0;
- vacrel->missed_dead_pages = 0;
- vacrel->nonempty_pages = 0;
+ scan_data = palloc(sizeof(LVScanData));
+ scan_data->scanned_pages = 0;
+ scan_data->eager_scanned_pages = 0;
+ scan_data->removed_pages = 0;
+ scan_data->new_frozen_tuple_pages = 0;
+ scan_data->lpdead_item_pages = 0;
+ scan_data->missed_dead_pages = 0;
+ scan_data->nonempty_pages = 0;
+ scan_data->tuples_deleted = 0;
+ scan_data->tuples_frozen = 0;
+ scan_data->lpdead_items = 0;
+ scan_data->live_tuples = 0;
+ scan_data->recently_dead_tuples = 0;
+ scan_data->missed_dead_tuples = 0;
+ scan_data->vm_new_visible_pages = 0;
+ scan_data->vm_new_visible_frozen_pages = 0;
+ scan_data->vm_new_frozen_pages = 0;
+ scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+ vacrel->scan_data = scan_data;

Or, you could replace most of these assignments by using palloc0, and
write a comment to say all the counters were zapped to 0.

~~~

3.
+ scan_data->vm_new_frozen_pages = 0;
+ scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+ vacrel->scan_data = scan_data;
  /* dead_items_alloc allocates vacrel->dead_items later on */

That comment about dead_items_alloc seems misplaced now because it is
grouped with all the scan_data stuff.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#68

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Amit Kapila (#65)

Re: Parallel heap vacuum

On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 5, 2025 at 6:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Another performance regression I can see in the results is that heap
vacuum phase (phase III) got slower with the patch. It's weired to me
since I don't touch the code of heap vacuum phase. I'm still
investigating the cause.

I have investigated this regression. I've confirmed that In both
scenarios (patched and unpatched), the entire table and its associated
indexes were loaded into the shared buffer before the vacuum. Then,
the 'perf record' analysis, focused specifically on the heap vacuum
phase of the patched code, revealed numerous soft page faults
occurring:

62.37% 13.90% postgres postgres [.] lazy_vacuum_heap_rel
|
|--52.44%--lazy_vacuum_heap_rel
| |
| |--46.33%--lazy_vacuum_heap_page (inlined)
| | |
| | |--32.42%--heap_page_is_all_visible (inlined)
| | | |
| | | |--26.46%--HeapTupleSatisfiesVacuum
| | | |
HeapTupleSatisfiesVacuumHorizon
| | | |
HeapTupleHeaderXminCommitted (inlined)
| | | | |
| | | | --18.52%--page_fault
| | | | do_page_fault
| | | |
__do_page_fault
| | | |
handle_mm_fault
| | | |
__handle_mm_fault
| | | |
handle_pte_fault
| | | | |
| | | |
|--16.53%--filemap_map_pages
| | | | | |
| | | | |
--2.63%--alloc_set_pte
| | | | |
pfn_pte
| | | | |
| | | |
--1.99%--pmd_page_vaddr
| | | |
| | | --1.99%--TransactionIdPrecedes

I did not observe these page faults in the 'perf record' results for
the HEAD version. Furthermore, when I disabled parallel heap vacuum
while keeping parallel index vacuuming enabled, the regression
disappeared. Based on these findings, the likely cause of the
regression appears to be that during parallel heap vacuum operations,
table blocks were loaded into the shared buffer by parallel vacuum
workers.

In the previous paragraph, you mentioned that the entire table and its
associated indexes were loaded into the shared buffer before the
vacuum. If that is true, then why does the parallel vacuum need to
reload the table blocks into shared buffers?

Hmm, my above sentences are wrong. All pages are loaded into the
shared buffer before the vacuum test. In parallel heap vacuum cases,
the leader process reads a part of the table during phase I whereas in
single-process vacuum case, the process reads the entire table. So
there will be differences in the phase III in terms of PTEs as
described below.

However, in the heap vacuum phase, the leader process needed
to process all blocks, resulting in soft page faults while creating
Page Table Entries (PTEs). Without the patch, the backend process had
already created PTEs during the heap scan, thus preventing these
faults from occurring during the heap vacuum phase.

This part is again not clear to me because I am assuming all the data
exists in shared buffers before the vacuum, so why the page faults
will occur in the first place.

IIUC PTEs are process-local data. So even if physical pages are loaded
to PostgreSQL's shared buffer (and paga caches), soft page faults (or
minor page faults)[1]https://en.wikipedia.org/wiki/Page_fault#Minor can occur if these pages are not yet mapped in
its page table.

Regards,

[1]: https://en.wikipedia.org/wiki/Page_fault#Minor

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#69

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Amit Kapila (#66)

Re: Parallel heap vacuum

On Sun, Mar 9, 2025 at 11:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 7, 2025 at 11:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Discussing with Amit offlist, I've run another benchmark test where no
data is loaded on the shared buffer. In the previous test, I loaded
all table blocks before running vacuum, so it was the best case. The
attached test results showed the worst case.

Overall, while the numbers seem not stable, the phase I got sped up a
bit, but not as scalable as expected, which is not surprising.

Sorry, but it is difficult for me to understand this data because it
doesn't contain the schema or details like what exactly is a fraction.
It is also not clear how the workers are divided among heap and
indexes, like do we use parallelism for both phases of heap or only
first phase and do we reuse those workers for index vacuuming. These
tests were probably discussed earlier, but it would be better to
either add a summary of the required information to understand the
results or at least a link to a previous email that has such details.

The testing configurations are:

max_wal_size = 50GB
shared_buffers = 25GB
max_parallel_maintenance_workers = 10
max_parallel_workers = 20
max_worker_processes = 30

The test scripts are: ($m and $p are a fraction and a parallel degree,
respectively)

create unlogged table test_vacuum (a bigint) with (autovacuum_enabled=off);
insert into test_vacuum select i from generate_series(1,200000000) s(i);
create index idx_0 on test_vacuum (a);
create index idx_1 on test_vacuum (a);
create index idx_2 on test_vacuum (a);
create index idx_3 on test_vacuum (a);
create index idx_4 on test_vacuum (a);
delete from test_vacuum where mod(a, $m) = 0;
vacuum (verbose, parallel $p) test_vacuum; -- measured the execution time

Please

note that the test results shows that the phase III also got sped up
but this is because in parallel vacuum we use more ring buffers than
the single process vacuum. So we need to compare the only phase I time
in terms of the benefit of the parallelism.

Does phase 3 also use parallelism? If so, can we try to divide the
ring buffers among workers or at least try vacuum with an increased
number of ring buffers. This would be good to do for both the phases,
if they both use parallelism.

No, only phase 1 was parallelized in this test. In parallel vacuum,
since it uses (ring_buffer_size * parallel_degree) memory, more pages
are loaded during phase 1, increasing cache hits during phase 3.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#70

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Masahiko Sawada (#62)

Re: Parallel heap vacuum

On Sat, Mar 8, 2025 at 1:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the updated version patches.

I've started trying to review this and realized that, while I'm
familiar with heap vacuuming code, I'm not familiar enough with the
vacuumparallel.c machinery to be of help without much additional
study. As such, I have mainly focused on reading the comments in your
code.

I think your comment in vacuumlazy.c describing the design could use
more detail and a bit of massaging.

For example, I don't know what you mean when you say:

* We could require different number of parallel vacuum workers for each phase
* for various factors such as table size and number of indexes.

Does that refer to something you did implement or you are saying we
could do that in the future?

- Melanie

#71

Amit Kapila

amit.kapila16@gmail.com

10 months ago

In reply to: Masahiko Sawada (#69)

Re: Parallel heap vacuum

On Tue, Mar 11, 2025 at 5:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 9, 2025 at 11:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Does phase 3 also use parallelism? If so, can we try to divide the
ring buffers among workers or at least try vacuum with an increased
number of ring buffers. This would be good to do for both the phases,
if they both use parallelism.

No, only phase 1 was parallelized in this test. In parallel vacuum,
since it uses (ring_buffer_size * parallel_degree) memory, more pages
are loaded during phase 1, increasing cache hits during phase 3.

Shouldn't we ideally try with a vacuum without parallelism with
ring_buffer_size * parallel_degree to make the comparison better?
Also, what could be the reason for the variation in data of phase-I?
Do you restart the system after each run to ensure there is nothing in
the memory? If not, then shouldn't we try at least a few runs by
restarting the system before each run to ensure there is nothing
leftover in memory?

--
With Regards,
Amit Kapila.

#72

Amit Kapila

amit.kapila16@gmail.com

10 months ago

In reply to: Masahiko Sawada (#68)

Re: Parallel heap vacuum

On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

However, in the heap vacuum phase, the leader process needed
to process all blocks, resulting in soft page faults while creating
Page Table Entries (PTEs). Without the patch, the backend process had
already created PTEs during the heap scan, thus preventing these
faults from occurring during the heap vacuum phase.

This part is again not clear to me because I am assuming all the data
exists in shared buffers before the vacuum, so why the page faults
will occur in the first place.

IIUC PTEs are process-local data. So even if physical pages are loaded
to PostgreSQL's shared buffer (and paga caches), soft page faults (or
minor page faults)[1] can occur if these pages are not yet mapped in
its page table.

Okay, I got your point. BTW, I noticed that even for the case where
all the data is in shared_buffers, the performance improvement for
workers greater than two does decrease marginally. Am I reading the
data correctly? If so, what is the theory, and do we have
recommendations for a parallel degree?

--
With Regards,
Amit Kapila.

#73

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Melanie Plageman (#70)

Re: Parallel heap vacuum

On Mon, Mar 10, 2025 at 5:03 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Sat, Mar 8, 2025 at 1:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the updated version patches.

I've started trying to review this and realized that, while I'm
familiar with heap vacuuming code, I'm not familiar enough with the
vacuumparallel.c machinery to be of help without much additional
study. As such, I have mainly focused on reading the comments in your
code.

Thank you for looking at the patch.

I think your comment in vacuumlazy.c describing the design could use
more detail and a bit of massaging.

For example, I don't know what you mean when you say:

* We could require different number of parallel vacuum workers for each phase
* for various factors such as table size and number of indexes.

Does that refer to something you did implement or you are saying we
could do that in the future?

It referred to the parallel heap vacuum implementation that I wrote.
Since the parallel degrees for parallel heap scan and parallel index
vacuuming are chosen separately based on different factors, we launch
a different number of workers for each phase and they exit at the end
of each phase.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#74

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Amit Kapila (#72)

Re: Parallel heap vacuum

On Tue, Mar 11, 2025 at 6:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

However, in the heap vacuum phase, the leader process needed
to process all blocks, resulting in soft page faults while creating
Page Table Entries (PTEs). Without the patch, the backend process had
already created PTEs during the heap scan, thus preventing these
faults from occurring during the heap vacuum phase.

This part is again not clear to me because I am assuming all the data
exists in shared buffers before the vacuum, so why the page faults
will occur in the first place.

IIUC PTEs are process-local data. So even if physical pages are loaded
to PostgreSQL's shared buffer (and paga caches), soft page faults (or
minor page faults)[1] can occur if these pages are not yet mapped in
its page table.

Okay, I got your point. BTW, I noticed that even for the case where
all the data is in shared_buffers, the performance improvement for
workers greater than two does decrease marginally. Am I reading the
data correctly? If so, what is the theory, and do we have
recommendations for a parallel degree?

The decrease you referred to is that the total vacuum execution time?
When it comes to the execution time of phase 1, it seems we have good
scalability. For example, with 2 workers (i.e.3 workers working
including the leader in total) it got about 3x speed up, and with 4
workers it got about 5x speed up. Regarding other phases, the phase 3
got slower probably because of PTEs stuff, but I don't investigate why
the phase 2 also slightly got slower with more than 2 workers.

In the current patch, the parallel degree for phase 1 is chosen based
on the table size, which is almost the same as the calculation of the
degree for parallel seq scan. But thinking further, we might want to
account for the number of all-visible pages and all-frozen pages here
so that we can avoid launching many workers for
mostly-frozen-big-tables.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#75

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Amit Kapila (#71)

Re: Parallel heap vacuum

On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 11, 2025 at 5:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 9, 2025 at 11:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Does phase 3 also use parallelism? If so, can we try to divide the
ring buffers among workers or at least try vacuum with an increased
number of ring buffers. This would be good to do for both the phases,
if they both use parallelism.

No, only phase 1 was parallelized in this test. In parallel vacuum,
since it uses (ring_buffer_size * parallel_degree) memory, more pages
are loaded during phase 1, increasing cache hits during phase 3.

Shouldn't we ideally try with a vacuum without parallelism with
ring_buffer_size * parallel_degree to make the comparison better?

Right. I'll share the benchmark test results with such configuration.

Also, what could be the reason for the variation in data of phase-I?
Do you restart the system after each run to ensure there is nothing in
the memory? If not, then shouldn't we try at least a few runs by
restarting the system before each run to ensure there is nothing
leftover in memory?

I dropped all page caches by executing 'echo 3 >
/proc/sys/vm/drop_caches' before each run and these results are the
median of 3 runs. I'll investigate it further.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#76

Dilip Kumar

dilipbalaut@gmail.com

10 months ago

In reply to: Masahiko Sawada (#75)

Re: Parallel heap vacuum

On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Some thoughts/questions on the idea

I notice that we are always considering block-level parallelism for
heaps and object-level parallelism for indexes. I'm wondering, when
multiple tables are being vacuumed together—either because the user
has provided a list of tables or has specified a partitioned table
with multiple children—does it still make sense to default to
block-level parallelism? Or could we consider table-level parallelism
in such cases? For example, if there are 4 tables and 6 workers, with
2 tables being small and the other 2 being large, perhaps we could
allocate 4 workers to vacuum all 4 tables in parallel. For the larger
tables, we could apply block-level parallelism, using more workers for
internal parallelism. On the other hand, if all tables are small, we
could just apply table-level parallelism without needing block-level
parallelism at all. This approach could offer more flexibility, isn't
it?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#77

Dilip Kumar

dilipbalaut@gmail.com

10 months ago

In reply to: Dilip Kumar (#76)

Re: Parallel heap vacuum

On Wed, Mar 12, 2025 at 11:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Some more specific comments on the patch set

-- v11-0001

1. This introduces functions like parallel_vacuum_estimate(),
parallel_vacuum_initialize(), etc., but these functions haven't been
assigned to the respective members in TableAMRoutine. As shown in the
hunk below, only parallel_vacuum_compute_workers is initialized, while
parallel_vacuum_estimate and the others are not. These are assigned in
later patches, which makes the patch division appear a bit unclean.

+++ b/src/backend/access/heap/heapam_handler.c
@@ -2688,7 +2688,9 @@ static const TableAmRoutine heapam_methods = {
  .scan_bitmap_next_block = heapam_scan_bitmap_next_block,
  .scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
  .scan_sample_next_block = heapam_scan_sample_next_block,
- .scan_sample_next_tuple = heapam_scan_sample_next_tuple
+ .scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+ .parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
 };

-- v11-0002

2. In commit message, you mentioned a function name
"parallel_vacuum_remove_dead_items_begin()" but there is no such
function introduced in the patch, I think you meant
'parallel_vacuum_collect_dead_items_begin()'?

3. I would suggest rewrite for this
/launched for the table vacuuming and index processing./launched for
the table pruning phase and index vacuuming.

4. These comments are talking about DSA and DSM but we better clarify
for what we are using DSM and DSA, or give reference to the location
where we have explained that.

+ * In a parallel vacuum, we perform table scan or both index bulk deletion and
+ * index cleanup or all of them with parallel worker processes. Different
+ * numbers of workers are launched for the table vacuuming and index
processing.
+ * ParallelVacuumState contains shared information as well as the memory space
+ * for storing dead items allocated in the DSA area.
+ *
+ * When initializing parallel table vacuum scan, we invoke table AM
routines for
+ * estimating DSM sizes and initializing DSM memory. Parallel table vacuum
+ * workers invoke the table AM routine for vacuuming the table.

5. Is there any particular reason why marking the dead TID as reusable
is not done in parallel? Is it because parallelizing that phase
wouldn't make sense due to the work involved, or is there another
reason?

+/* The kind of parallel vacuum phases */
+typedef enum
+{
+ PV_WORK_PHASE_PROCESS_INDEXES, /* index vacuuming or cleanup */
+ PV_WORK_PHASE_COLLECT_DEAD_ITEMS, /* collect dead tuples */
+} PVWorkPhase;

6. In the below hunk "nrequested_workers is >= 0 number and the
requested parallel degree." sentence doesn't make sense to me, can you
rephrase this?

+ * nrequested_workers is >= 0 number and the requested parallel degree. 0
+ * means that the parallel degrees for table and indexes vacuum are decided
+ * differently. See the comments of parallel_vacuum_compute_workers() for
+ * details.

Thats what I got so far, I will continue reviewing 0002 and the
remaining patches from the thread.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#78

Amit Kapila

amit.kapila16@gmail.com

10 months ago

In reply to: Masahiko Sawada (#74)

Re: Parallel heap vacuum

On Wed, Mar 12, 2025 at 3:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 11, 2025 at 6:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

However, in the heap vacuum phase, the leader process needed
to process all blocks, resulting in soft page faults while creating
Page Table Entries (PTEs). Without the patch, the backend process had
already created PTEs during the heap scan, thus preventing these
faults from occurring during the heap vacuum phase.

This part is again not clear to me because I am assuming all the data
exists in shared buffers before the vacuum, so why the page faults
will occur in the first place.

IIUC PTEs are process-local data. So even if physical pages are loaded
to PostgreSQL's shared buffer (and paga caches), soft page faults (or
minor page faults)[1] can occur if these pages are not yet mapped in
its page table.

Okay, I got your point. BTW, I noticed that even for the case where
all the data is in shared_buffers, the performance improvement for
workers greater than two does decrease marginally. Am I reading the
data correctly? If so, what is the theory, and do we have
recommendations for a parallel degree?

The decrease you referred to is that the total vacuum execution time?

Right.

When it comes to the execution time of phase 1, it seems we have good
scalability. For example, with 2 workers (i.e.3 workers working
including the leader in total) it got about 3x speed up, and with 4
workers it got about 5x speed up. Regarding other phases, the phase 3
got slower probably because of PTEs stuff, but I don't investigate why
the phase 2 also slightly got slower with more than 2 workers.

Could it be possible that now phase-2 needs to access the shared area
for TIDs, and some locking/unlocking causes such slowdown?

--
With Regards,
Amit Kapila.

#79

Amit Kapila

amit.kapila16@gmail.com

10 months ago

In reply to: Dilip Kumar (#76)

Re: Parallel heap vacuum

On Wed, Mar 12, 2025 at 11:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Some thoughts/questions on the idea

I notice that we are always considering block-level parallelism for
heaps and object-level parallelism for indexes. I'm wondering, when
multiple tables are being vacuumed together—either because the user
has provided a list of tables or has specified a partitioned table
with multiple children—does it still make sense to default to
block-level parallelism? Or could we consider table-level parallelism
in such cases? For example, if there are 4 tables and 6 workers, with
2 tables being small and the other 2 being large, perhaps we could
allocate 4 workers to vacuum all 4 tables in parallel. For the larger
tables, we could apply block-level parallelism, using more workers for
internal parallelism. On the other hand, if all tables are small, we
could just apply table-level parallelism without needing block-level
parallelism at all. This approach could offer more flexibility, isn't
it?

I have not thought from this angle, but it seems we can build this
even on top of block-level vacuum for large tables.

--
With Regards,
Amit Kapila.

#80

Dilip Kumar

dilipbalaut@gmail.com

10 months ago

In reply to: Amit Kapila (#79)

Re: Parallel heap vacuum

On Wed, Mar 12, 2025 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 12, 2025 at 11:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Some thoughts/questions on the idea

I notice that we are always considering block-level parallelism for
heaps and object-level parallelism for indexes. I'm wondering, when
multiple tables are being vacuumed together—either because the user
has provided a list of tables or has specified a partitioned table
with multiple children—does it still make sense to default to
block-level parallelism? Or could we consider table-level parallelism
in such cases? For example, if there are 4 tables and 6 workers, with
2 tables being small and the other 2 being large, perhaps we could
allocate 4 workers to vacuum all 4 tables in parallel. For the larger
tables, we could apply block-level parallelism, using more workers for
internal parallelism. On the other hand, if all tables are small, we
could just apply table-level parallelism without needing block-level
parallelism at all. This approach could offer more flexibility, isn't
it?

I have not thought from this angle, but it seems we can build this
even on top of block-level vacuum for large tables.

Yes, that can be built on top of block-level vacuum. In that case, we
can utilize the workers more efficiently, depending on how many
workers we have and how many tables need to be vacuumed. And yes, that
could also be discussed separately.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#81

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Amit Kapila (#78)

Re: Parallel heap vacuum

On Wed, Mar 12, 2025 at 3:05 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 12, 2025 at 3:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 11, 2025 at 6:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

However, in the heap vacuum phase, the leader process needed
to process all blocks, resulting in soft page faults while creating
Page Table Entries (PTEs). Without the patch, the backend process had
already created PTEs during the heap scan, thus preventing these
faults from occurring during the heap vacuum phase.

This part is again not clear to me because I am assuming all the data
exists in shared buffers before the vacuum, so why the page faults
will occur in the first place.

IIUC PTEs are process-local data. So even if physical pages are loaded
to PostgreSQL's shared buffer (and paga caches), soft page faults (or
minor page faults)[1] can occur if these pages are not yet mapped in
its page table.

Okay, I got your point. BTW, I noticed that even for the case where
all the data is in shared_buffers, the performance improvement for
workers greater than two does decrease marginally. Am I reading the
data correctly? If so, what is the theory, and do we have
recommendations for a parallel degree?

The decrease you referred to is that the total vacuum execution time?

Right.

When it comes to the execution time of phase 1, it seems we have good
scalability. For example, with 2 workers (i.e.3 workers working
including the leader in total) it got about 3x speed up, and with 4
workers it got about 5x speed up. Regarding other phases, the phase 3
got slower probably because of PTEs stuff, but I don't investigate why
the phase 2 also slightly got slower with more than 2 workers.

Could it be possible that now phase-2 needs to access the shared area
for TIDs, and some locking/unlocking causes such slowdown?

No, TidStore is shared in this case but we don't take a lock on it
during phase 2.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#82

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Masahiko Sawada (#62)

5 attachment(s)

Re: Parallel heap vacuum

On Fri, Mar 7, 2025 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the updated version patches.

When testing the multi passes of table vacuuming, I found an issue.
With the current patch, both leader and parallel workers process stop
the phase 1 as soon as the shared TidStore size reaches to the limit,
and then the leader resumes the parallel heap scan after heap
vacuuming and index vacuuming. Therefore, as I described in the patch,
one tricky part of this patch is that it's possible that we launch
fewer workers than the previous time when resuming phase 1 after phase
2 and 3. In this case, since the previous parallel workers might have
already allocated some blocks in their chunk, newly launched workers
need to take over their parallel scan state. That's why in the patch
we store workers' ParallelBlockTableScanWorkerData in shared memory.
However, I found my assumption is wrong; in order to take over the
previous scan state properly we need to take over not only
ParallelBlockTableScanWorkerData but also ReadStream data as parallel
workers might have already queued some blocks for look-ahead in their
ReadStream. Looking at ReadStream codes, I find that it's not
realistic to store it into the shared memory.

One plausible solution would be that we don't use ReadStream in
parallel heap vacuum cases but directly use
table_block_parallelscan_xxx() instead. It works but we end up having
two different scan methods for parallel and non-parallel lazy heap
scan. I've implemented this idea in the attached v12 patches.

Other than the above change, I've fixed some bugs and addressed review
comments I got so far.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v12-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchapplication/octet-stream; name=v12-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchDownload

From add1b216dfc85ffafb00269d2db05c5130849646 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v12 1/5] Introduces table AM APIs for parallel table
 vacuuming.

This commit introduces the following new table AM APIs for parallel
heap vacuuming:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

There is no code using these new APIs for now. Upcoming parallel
vacuum patches utilize these APIs.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/heapam_handler.c |   4 +-
 src/backend/access/heap/vacuumlazy.c     |  12 ++
 src/backend/access/table/tableamapi.c    |  11 ++
 src/include/access/heapam.h              |   2 +
 src/include/access/tableam.h             | 140 +++++++++++++++++++++++
 5 files changed, 168 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4da4dc84580..1ab20edef94 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2696,7 +2696,9 @@ static const TableAmRoutine heapam_methods = {
 
 	.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
 	.scan_sample_next_block = heapam_scan_sample_next_block,
-	.scan_sample_next_tuple = heapam_scan_sample_next_tuple
+	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2cbcf5e5db2..b4100dacd1d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3741,6 +3741,18 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Return 0 to disable parallel vacuum.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									 void *state)
+{
+	return 0;
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 476663b66aa..c3ee9869e12 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,6 +81,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
+	Assert(routine->parallel_vacuum_compute_workers != NULL);
 	Assert(routine->scan_analyze_next_block != NULL);
 	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
@@ -94,6 +95,16 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->scan_sample_next_block != NULL);
 	Assert(routine->scan_sample_next_tuple != NULL);
 
+	/*
+	 * Callbacks for parallel vacuum are also optional (except for
+	 * parallel_vacuum_compute_workers). But one callback implies presence of
+	 * the others.
+	 */
+	Assert(((((routine->parallel_vacuum_estimate == NULL) ==
+			  (routine->parallel_vacuum_initialize == NULL)) ==
+			 (routine->parallel_vacuum_initialize_worker == NULL)) ==
+			(routine->parallel_vacuum_collect_dead_items == NULL)));
+
 	return routine;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..6a1ca5d5ca7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -409,6 +409,8 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+												 void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b8cb1e744ad..c61b1700953 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -35,6 +35,9 @@ extern PGDLLIMPORT bool synchronize_seqscans;
 
 struct BulkInsertStateData;
 struct IndexInfo;
+struct ParallelContext;
+struct ParallelVacuumState;
+struct ParallelWorkerContext;
 struct SampleScanState;
 struct VacuumParams;
 struct ValidateIndexState;
@@ -655,6 +658,81 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * parallel degree for parallel vacuum is further limited by
+	 * max_parallel_maintenance_workers. The function must return 0 to disable
+	 * parallel table vacuum.
+	 *
+	 * 'nworkers_requested' is a >=0 number and the requested number of
+	 * workers. This comes from the PARALLEL option. 0 means to choose the
+	 * parallel degree based on the table AM specific factors such as table
+	 * size.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel,
+													int nworkers_requested,
+													void *state);
+
+	/*
+	 * Estimate the size of shared memory needed for a parallel table vacuum
+	 * of this relation.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 struct ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   struct ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * Initialize AM-specific vacuum state for worker processes.
+	 *
+	 * The state_out is the output parameter so that arbitrary data can be
+	 * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize_worker) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  struct ParallelWorkerContext *pwcxt,
+													  void **state_out);
+
+	/*
+	 * Execute a parallel scan to collect dead items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_collect_dead_items) (Relation rel,
+													   struct ParallelVacuumState *pvs,
+													   void *state);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1680,6 +1758,68 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Compute the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									  void *state)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel,
+															nworkers_requested,
+															state);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, struct ParallelContext *pcxt,
+							   int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, struct ParallelContext *pcxt,
+								 int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize AM-specific vacuum state for worker processes.
+ */
+static inline void
+table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+										struct ParallelWorkerContext *pwcxt,
+										void **state_out)
+{
+	rel->rd_tableam->parallel_vacuum_initialize_worker(rel, pvs, pwcxt, state_out);
+}
+
+/*
+ * Execute a parallel vacuum scan to collect dead items.
+ */
+static inline void
+table_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+										 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_collect_dead_items(rel, pvs, state);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
-- 
2.43.5

v12-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patchapplication/octet-stream; name=v12-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patchDownload

From 96f81c119e7f620064dceb0938905232fd1bbd3d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:00:46 -0800
Subject: [PATCH v12 4/5] Move GlobalVisState definition to snapmgr_internal.h.

This commit expose the GlobalVisState struct in
snapmgr_internal.h. This is a preparatory work for parallel vacuum
heap scan.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/storage/ipc/procarray.c  | 74 ----------------------
 src/include/utils/snapmgr.h          |  2 +-
 src/include/utils/snapmgr_internal.h | 91 ++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+), 75 deletions(-)
 create mode 100644 src/include/utils/snapmgr_internal.h

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 2e54c11f880..4813a07860d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,80 +99,6 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
-/*
- * State for the GlobalVisTest* family of functions. Those functions can
- * e.g. be used to decide if a deleted row can be removed without violating
- * MVCC semantics: If the deleted row's xmax is not considered to be running
- * by anyone, the row can be removed.
- *
- * To avoid slowing down GetSnapshotData(), we don't calculate a precise
- * cutoff XID while building a snapshot (looking at the frequently changing
- * xmins scales badly). Instead we compute two boundaries while building the
- * snapshot:
- *
- * 1) definitely_needed, indicating that rows deleted by XIDs >=
- *    definitely_needed are definitely still visible.
- *
- * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
- *    definitely be removed
- *
- * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
- * && XID < definitely_needed), the boundaries can be recomputed (using
- * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
- * maintaining an accurate value all the time.
- *
- * As it is not cheap to compute accurate boundaries, we limit the number of
- * times that happens in short succession. See GlobalVisTestShouldUpdate().
- *
- *
- * There are three backend lifetime instances of this struct, optimized for
- * different types of relations. As e.g. a normal user defined table in one
- * database is inaccessible to backends connected to another database, a test
- * specific to a relation can be more aggressive than a test for a shared
- * relation.  Currently we track four different states:
- *
- * 1) GlobalVisSharedRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in any database, nor a
- *    replication slot's xmin, nor a replication slot's catalog_xmin might
- *    still consider XID as running.
- *
- * 2) GlobalVisCatalogRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin, nor a replication slot's
- *    catalog_xmin might still consider XID as running.
- *
- *    I.e. the difference to GlobalVisSharedRels is that
- *    snapshot in other databases are ignored.
- *
- * 3) GlobalVisDataRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin consider XID as running.
- *
- *    I.e. the difference to GlobalVisCatalogRels is that
- *    replication slot's catalog_xmin is not taken into account.
- *
- * 4) GlobalVisTempRels, which only considers the current session, as temp
- *    tables are not visible to other sessions.
- *
- * GlobalVisTestFor(relation) returns the appropriate state
- * for the relation.
- *
- * The boundaries are FullTransactionIds instead of TransactionIds to avoid
- * wraparound dangers. There e.g. would otherwise exist no procarray state to
- * prevent maybe_needed to become old enough after the GetSnapshotData()
- * call.
- *
- * The typedef is in the header.
- */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..3b6fb603544 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -17,6 +17,7 @@
 #include "utils/relcache.h"
 #include "utils/resowner.h"
 #include "utils/snapshot.h"
+#include "utils/snapmgr_internal.h"
 
 
 extern PGDLLIMPORT bool FirstSnapshotSet;
@@ -95,7 +96,6 @@ extern char *ExportSnapshot(Snapshot snapshot);
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
  */
-typedef struct GlobalVisState GlobalVisState;
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
diff --git a/src/include/utils/snapmgr_internal.h b/src/include/utils/snapmgr_internal.h
new file mode 100644
index 00000000000..4363adf7f62
--- /dev/null
+++ b/src/include/utils/snapmgr_internal.h
@@ -0,0 +1,91 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapmgr_internal.h
+ *		This file contains declarations of structs for snapshot manager
+ *		for internal use.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/snapmgr_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SNAPMGR_INTERNAL_H
+#define SNAPMGR_INTERNAL_H
+
+#include "access/transam.h"
+
+/*
+ * State for the GlobalVisTest* family of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track four different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisDataRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * 4) GlobalVisTempRels, which only considers the current session, as temp
+ *    tables are not visible to other sessions.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ */
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
+#endif							/* SNAPMGR_INTERNAL_H */
-- 
2.43.5

v12-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchapplication/octet-stream; name=v12-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchDownload

From ea757a2cdd7b0eb4cc9f7afcca9731a5c2ab6852 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 18 Feb 2025 17:45:36 -0800
Subject: [PATCH v12 2/5] vacuumparallel.c: Support parallel vacuuming for
 tables to collect dead items.

Since parallel vacuum was available only for index vacuuming and index
cleanup, ParallelVacuumState was initialized only when the table has
at least two indexes that are eligible for parallel index vacuuming
and cleanup.

This commit extends vacuumparallel.c to support parallel table
vacuuming. parallel_vacuum_init() now initializes ParallelVacuumState
and it enables parallel table vacuuming and parallel index
vacuuming/cleanup separately. During the initialization, it asks the
table AM for the number of parallel workers required for parallel
table vacuuming. If >0, it enables parallel table vacuuming and calls
further table AM APIs such as parallel_vacuum_estimate.

For parallel table vacuuming, this commit introduces
parallel_vacuum_collect_dead_items_begin() function, which can be used
to collect dead items in the table (for example, the first pass over
heap table in lazy vacuum for heap tables).

Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/vacuumlazy.c  |   2 +-
 src/backend/commands/vacuumparallel.c | 307 +++++++++++++++++++++-----
 src/include/commands/vacuum.h         |   5 +-
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 256 insertions(+), 59 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b4100dacd1d..735b2a1cfdc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3499,7 +3499,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..bb0d690aed8 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -4,17 +4,18 @@
  *	  Support routines for parallel vacuum execution.
  *
  * This file contains routines that are intended to support setting up, using,
- * and tearing down a ParallelVacuumState.
+ * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared
+ * information as well as the memory space for storing dead items allocated in
+ * the DSA area. We launch
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan, index bulk-deletion, index
+ * cleanup, or all of them with parallel worker processes depending on the
+ * number of parallel workers required for each phase. So different numbers of
+ * workers might be required for the table scanning and index processing.
+ * We launch parallel worker processes at the start of a phase, and once we
+ * complete all work in the phase, parallel workers exit. Each time we process
+ * table or indexes in parallel, the parallel context is re-initialized so that
+ * the same DSM can be used for multiple passes of each phase.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -26,8 +27,10 @@
  */
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -50,6 +53,13 @@
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		4
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
+/* The kind of parallel vacuum phases */
+typedef enum
+{
+	PV_WORK_PHASE_PROCESS_INDEXES,	/* index vacuuming or cleanup */
+	PV_WORK_PHASE_COLLECT_DEAD_ITEMS,	/* collect dead tuples */
+} PVWorkPhase;
+
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
  * segment.
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * Tell parallel workers what phase to perform: processing indexes or
+	 * removing dead tuples from the table.
+	 */
+	PVWorkPhase work_phase;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +180,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Do we need to reinitialize parallel DSM? */
+	bool		need_reinitialize_dsm;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -171,6 +190,12 @@ struct ParallelVacuumState
 	Relation   *indrels;
 	int			nindexes;
 
+	/*
+	 * The number of workers for parallel table vacuuming. If 0, the parallel
+	 * table vacuum is disabled.
+	 */
+	int			nworkers_for_table;
+
 	/* Shared information among parallel vacuum workers */
 	PVShared   *shared;
 
@@ -178,7 +203,7 @@ struct ParallelVacuumState
 	 * Shared index statistics among parallel vacuum workers. The array
 	 * element is allocated for every index, even those indexes where parallel
 	 * index vacuuming is unsafe or not worthwhile (e.g.,
-	 * will_parallel_vacuum[] is false).  During parallel vacuum,
+	 * idx_will_parallel_vacuum[] is false).  During parallel vacuum,
 	 * IndexBulkDeleteResult of each index is kept in DSM and is copied into
 	 * local memory at the end of parallel vacuum.
 	 */
@@ -198,7 +223,7 @@ struct ParallelVacuumState
 	 * processing. For example, the index could be <
 	 * min_parallel_index_scan_size cutoff.
 	 */
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
@@ -221,8 +246,10 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static int	parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											bool *idx_will_parallel_vacuum,
+											void *state);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -237,12 +264,16 @@ static void parallel_vacuum_error_callback(void *arg);
  * Try to enter parallel mode and create a parallel context.  Then initialize
  * shared memory state.
  *
+ * nrequested_workers is the requested parallel degree. 0 means that the parallel
+ * degrees for table and indexes vacuum are decided differently. See the comments
+ * of parallel_vacuum_compute_workers() for details.
+ *
  * On success, return parallel vacuum state.  Otherwise return NULL.
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -251,38 +282,38 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
 	int			querylen;
 
-	/*
-	 * A parallel vacuum must be requested and there must be indexes on the
-	 * relation
-	 */
+	/* A parallel vacuum must be requested */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
-	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
+	idx_will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
+	parallel_workers = parallel_vacuum_compute_workers(rel, indrels, nindexes,
 													   nrequested_workers,
-													   will_parallel_vacuum);
+													   &nworkers_for_table,
+													   idx_will_parallel_vacuum,
+													   state);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
-		pfree(will_parallel_vacuum);
+		pfree(idx_will_parallel_vacuum);
 		return NULL;
 	}
 
 	pvs = (ParallelVacuumState *) palloc0(sizeof(ParallelVacuumState));
 	pvs->indrels = indrels;
 	pvs->nindexes = nindexes;
-	pvs->will_parallel_vacuum = will_parallel_vacuum;
+	pvs->idx_will_parallel_vacuum = idx_will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
 
@@ -291,6 +322,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 parallel_workers);
 	Assert(pcxt->nworkers > 0);
 	pvs->pcxt = pcxt;
+	pvs->need_reinitialize_dsm = false;
+	pvs->nworkers_for_table = nworkers_for_table;
 
 	/* Estimate size for index vacuum stats -- PARALLEL_VACUUM_KEY_INDEX_STATS */
 	est_indstats_len = mul_size(sizeof(PVIndStats), nindexes);
@@ -327,6 +360,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, pvs->nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -345,7 +382,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 			   ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) == 0));
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
-		if (!will_parallel_vacuum[i])
+		if (!idx_will_parallel_vacuum[i])
 			continue;
 
 		if (indrel->rd_indam->amusemaintenanceworkmem)
@@ -419,6 +456,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Initialize AM-specific DSM space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, pvs->nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -456,7 +497,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
-	pfree(pvs->will_parallel_vacuum);
+	pfree(pvs->idx_will_parallel_vacuum);
 	pfree(pvs);
 }
 
@@ -533,26 +574,35 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.  Return the maximum number of parallel
+ * workers for table vacuuming and index vacuuming.
+ *
+ * nrequested is the number of parallel workers that user requested, which
+ * applies to both the number of workers for table vacuum and index vacuum.
+ * If nrequested is 0, we compute the parallel degree for them differently
+ * as described below.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel table vacuum, we ask AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to nworkers_table_p.
+ *
+ * For parallel index vacuum, the index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. This function sets
+ * idx_will_parallel_vacuum to remember indexes that participate in parallel vacuum.
  */
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table_p,
+								bool *idx_will_parallel_vacuum, void *state)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			nworkers_table = 0;
+	int			nworkers_index = 0;
+
+	*nworkers_table_p = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -561,6 +611,13 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;
 
+	/* Compute the number of workers for parallel table scan */
+	nworkers_table = table_parallel_vacuum_compute_workers(rel, nrequested,
+														   state);
+
+	/* Cap by max_parallel_maintenance_workers */
+	nworkers_table = Min(nworkers_table, max_parallel_maintenance_workers);
+
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
@@ -574,7 +631,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		will_parallel_vacuum[i] = true;
+		idx_will_parallel_vacuum[i] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -589,18 +646,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	/* The leader process takes one index */
 	nindexes_parallel--;
 
-	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table_p = nworkers_table;
+	return Max(nworkers_table, nworkers_index);
 }
 
 /*
@@ -657,7 +714,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL);
 		indstats->status = new_status;
 		indstats->parallel_workers_can_process =
-			(pvs->will_parallel_vacuum[i] &&
+			(pvs->idx_will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
 													num_index_scans,
 													vacuum));
@@ -669,8 +726,10 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
+		pvs->shared->work_phase = PV_WORK_PHASE_PROCESS_INDEXES;
+
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (pvs->need_reinitialize_dsm)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -764,6 +823,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
 }
 
 /*
@@ -979,6 +1041,115 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Begin the parallel scan to collect dead items. Return the number of
+ * launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_scan_end() to finish the parallel
+ * table scan.
+ */
+int
+parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->work_phase = PV_WORK_PHASE_COLLECT_DEAD_ITEMS;
+
+	if (pvs->need_reinitialize_dsm)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	Assert(pvs->nworkers_for_table <= pvs->pcxt->nworkers);
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel vacuum workers launched by
+ * parallel_vacuum_collect_dead_items_begin(), and gather workers' statistics.
+ */
+void
+parallel_vacuum_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
+}
+
+/*
+ * The function is for parallel workers to execute the parallel scan to
+ * collect dead tuples.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs, void *state)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do the parallel scan to collect dead tuples */
+	table_parallel_vacuum_collect_dead_items(pvs->heaprel, pvs, state);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -998,6 +1169,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
+	void	   *state;
 	ErrorContextCallback errcallback;
 
 	/*
@@ -1030,7 +1202,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	/*
 	 * Apply the desired value of maintenance_work_mem within this process.
@@ -1076,6 +1247,17 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
+	/* Initialize AM-specific vacuum state for parallel table vacuuming */
+	if (shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS)
+	{
+		ParallelWorkerContext pwcxt;
+
+		pwcxt.toc = toc;
+		pwcxt.seg = seg;
+		table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+												&state);
+	}
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = parallel_vacuum_error_callback;
 	errcallback.arg = &pvs;
@@ -1085,8 +1267,19 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	switch (pvs.shared->work_phase)
+	{
+		case PV_WORK_PHASE_COLLECT_DEAD_ITEMS:
+			/* Scan the table to collect dead items */
+			parallel_vacuum_process_table(&pvs, state);
+			break;
+		case PV_WORK_PHASE_PROCESS_INDEXES:
+			/* Process indexes to perform vacuum/cleanup */
+			parallel_vacuum_process_safe_indexes(&pvs);
+			break;
+		default:
+			elog(ERROR, "unrecognized parallel vacuum phase %d", pvs.shared->work_phase);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index baacc63f590..596e901c207 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -381,7 +381,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -393,6 +394,8 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_scan_end(ParallelVacuumState *pvs);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bfa276d2d35..12c101c9946 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2007,6 +2007,7 @@ PVIndStats
 PVIndVacStatus
 PVOID
 PVShared
+PVWorkPhase
 PX_Alias
 PX_Cipher
 PX_Combo
-- 
2.43.5

v12-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchapplication/octet-stream; name=v12-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchDownload

From 2cfc0c1fbc09f5db199c551b024792431b11c547 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 26 Feb 2025 11:31:55 -0800
Subject: [PATCH v12 3/5] Move lazy heap scan related variables to new struct
 LVScanData.

This is a pure refactoring for upcoming parallel heap scan, which
requires storing relation statistics collected during lazy heap scan
to a shared memory area.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c | 343 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 181 insertions(+), 163 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 735b2a1cfdc..c54cffdc399 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -256,6 +256,56 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * Data and counters updated during lazy heap scan.
+ */
+typedef struct LVScanData
+{
+	BlockNumber rel_pages;		/* total number of pages */
+
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+
+	/*
+	 * Count of all-visible blocks eagerly scanned (for logging only). This
+	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
+	 */
+	BlockNumber eager_scanned_pages;
+
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVScanData;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -282,10 +332,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -310,35 +356,8 @@ typedef struct LVRelState
 	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
 	VacDeadItemsInfo *dead_items_info;
 
-	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-
-	/*
-	 * Count of all-visible blocks eagerly scanned (for logging only). This
-	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
-	 */
-	BlockNumber eager_scanned_pages;
-
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
-
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
-
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	/* Data and counters updated during lazy heap scan */
+	LVScanData *scan_data;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -348,13 +367,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -524,7 +536,7 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	 * the first region, making the second region the first to be eager
 	 * scanned normally.
 	 */
-	if (vacrel->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
+	if (vacrel->scan_data->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
 		return;
 
 	/*
@@ -616,6 +628,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVScanData *scan_data;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -730,14 +743,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->eager_scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	scan_data = palloc(sizeof(LVScanData));
+	scan_data->scanned_pages = 0;
+	scan_data->eager_scanned_pages = 0;
+	scan_data->removed_pages = 0;
+	scan_data->new_frozen_tuple_pages = 0;
+	scan_data->lpdead_item_pages = 0;
+	scan_data->missed_dead_pages = 0;
+	scan_data->nonempty_pages = 0;
+	scan_data->tuples_deleted = 0;
+	scan_data->tuples_frozen = 0;
+	scan_data->lpdead_items = 0;
+	scan_data->live_tuples = 0;
+	scan_data->recently_dead_tuples = 0;
+	scan_data->missed_dead_tuples = 0;
+	scan_data->vm_new_visible_pages = 0;
+	scan_data->vm_new_visible_frozen_pages = 0;
+	scan_data->vm_new_frozen_pages = 0;
+	scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel->scan_data = scan_data;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -747,17 +771,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
-
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
-	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -778,15 +793,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_data->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_data->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 
 	/*
 	 * Initialize state related to tracking all-visible page skipping. This is
 	 * very important to determine whether or not it is safe to advance the
 	 * relfrozenxid/relminmxid.
 	 */
-	vacrel->skippedallvis = false;
+	vacrel->scan_data->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -874,15 +889,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_data->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_data->NewRelfrozenXid));
+	Assert(vacrel->scan_data->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_data->NewRelminMxid));
+	if (vacrel->scan_data->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -890,15 +905,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_data->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_data->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * pg_class.relpages to
 	 */
-	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
+	new_rel_pages = vacrel->scan_data->rel_pages;	/* After possible rel
+													 * truncation */
 	visibilitymap_count(rel, &new_rel_allvisible, &new_rel_allfrozen);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
@@ -921,7 +937,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, new_rel_allfrozen,
 						vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_data->NewRelfrozenXid,
+						vacrel->scan_data->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -937,8 +954,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples,
+						 vacrel->scan_data->recently_dead_tuples +
+						 vacrel->scan_data->missed_dead_tuples,
 						 starttime);
 	pgstat_progress_end_command();
 
@@ -1012,23 +1029,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u eagerly scanned\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_data->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_data->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages /
+							 100.0 * vacrel->scan_data->scanned_pages /
 							 orig_rel_pages,
-							 vacrel->eager_scanned_pages);
+							 vacrel->scan_data->eager_scanned_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->scan_data->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->scan_data->recently_dead_tuples);
+			if (vacrel->scan_data->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->scan_data->missed_dead_tuples,
+								 vacrel->scan_data->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -1036,33 +1053,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_data->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_data->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_data->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_data->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_data->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_data->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 (long long) vacrel->scan_data->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_data->vm_new_visible_pages,
+							 vacrel->scan_data->vm_new_visible_frozen_pages +
+							 vacrel->scan_data->vm_new_frozen_pages,
+							 vacrel->scan_data->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -1082,10 +1099,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_data->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->scan_data->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->scan_data->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -1199,7 +1216,7 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->rel_pages,
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
 				blkno = 0,
 				next_fsm_block_to_vacuum = 0;
 	BlockNumber orig_eager_scan_success_limit =
@@ -1255,8 +1272,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages > 0 &&
-			vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_data->scanned_pages > 0 &&
+			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1311,9 +1328,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		blkno = BufferGetBlockNumber(buf);
 
-		vacrel->scanned_pages++;
+		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
-			vacrel->eager_scanned_pages++;
+			vacrel->scan_data->eager_scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -1495,16 +1512,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
 
 	read_stream_end(stream);
 
@@ -1551,7 +1568,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * callback_private_data contains a reference to the LVRelState, passed to the
  * read stream API during stream setup. The LVRelState is an in/out parameter
  * here (locally named `vacrel`). Vacuum options and information about the
- * relation are read from it. vacrel->skippedallvis is set if we skip a block
+ * relation are read from it. vacrel->scan_data->skippedallvis is set if we skip a block
  * that's all-visible but not all-frozen (to ensure that we don't update
  * relfrozenxid in that case). vacrel also holds information about the next
  * unskippable block -- as bookkeeping for this function.
@@ -1569,7 +1586,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 	next_block = vacrel->current_block + 1;
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->rel_pages)
+	if (next_block >= vacrel->scan_data->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1613,7 +1630,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_data->skippedallvis = true;
 		}
 	}
 
@@ -1664,7 +1681,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 static void
 find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
@@ -1895,11 +1912,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_data->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1974,10 +1991,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_data->NewRelfrozenXid, &vacrel->scan_data->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_data->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_data->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1987,7 +2004,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_data->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -2022,7 +2039,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -2037,15 +2054,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_data->tuples_deleted += presult.ndeleted;
+	vacrel->scan_data->tuples_frozen += presult.nfrozen;
+	vacrel->scan_data->lpdead_items += presult.lpdead_items;
+	vacrel->scan_data->live_tuples += presult.live_tuples;
+	vacrel->scan_data->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2094,17 +2111,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (presult.all_frozen)
 			{
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 				*vm_page_frozen = true;
 			}
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2192,8 +2209,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 
@@ -2203,7 +2220,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		else
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2244,8 +2261,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_data->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2372,8 +2389,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_data->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -2400,25 +2417,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_data->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_data->live_tuples += live_tuples;
+	vacrel->scan_data->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_data->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_data->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -2447,7 +2464,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_data->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -2476,12 +2493,12 @@ lazy_vacuum(LVRelState *vacrel)
 	 * HOT through careful tuning.
 	 */
 	bypass = false;
-	if (vacrel->consider_bypass_optimization && vacrel->rel_pages > 0)
+	if (vacrel->consider_bypass_optimization && vacrel->scan_data->rel_pages > 0)
 	{
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_data->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2507,8 +2524,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * be negligible.  If this optimization is ever expanded to cover more
 		 * cases then this may need to be reconsidered.
 		 */
-		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		threshold = (double) vacrel->scan_data->rel_pages * BYPASS_THRESHOLD_PAGES;
+		bypass = (vacrel->scan_data->lpdead_item_pages < threshold &&
 				  TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);
 	}
 
@@ -2646,7 +2663,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2798,8 +2815,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
+			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2915,14 +2932,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -2998,7 +3015,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_data->scanned_pages < vacrel->scan_data->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -3179,10 +3196,10 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->scan_data->rel_pages - vacrel->scan_data->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
-		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
+		 possibly_freeable >= vacrel->scan_data->rel_pages / REL_TRUNCATE_FRACTION))
 		return true;
 
 	return false;
@@ -3194,7 +3211,7 @@ should_attempt_truncation(LVRelState *vacrel)
 static void
 lazy_truncate_heap(LVRelState *vacrel)
 {
-	BlockNumber orig_rel_pages = vacrel->rel_pages;
+	BlockNumber orig_rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber new_rel_pages;
 	bool		lock_waiter_detected;
 	int			lock_retry;
@@ -3205,7 +3222,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_data->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -3306,15 +3323,15 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
-		vacrel->rel_pages = new_rel_pages;
+		vacrel->scan_data->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_data->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
 				(errmsg("table \"%s\": truncated %u to %u pages",
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_data->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -3338,11 +3355,11 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * unsigned.)  To make the scan faster, we prefetch a few blocks at a time
 	 * in forward direction, so that OS-level readahead can kick in.
 	 */
-	blkno = vacrel->rel_pages;
+	blkno = vacrel->scan_data->rel_pages;
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_data->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -3454,7 +3471,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_data->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 12c101c9946..b8bc8bd3c22 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1503,6 +1503,7 @@ LSEG
 LUID
 LVRelState
 LVSavedErrInfo
+LVScanData
 LWLock
 LWLockHandle
 LWLockMode
-- 
2.43.5

v12-0005-Support-parallelism-for-collecting-dead-items-du.patchapplication/octet-stream; name=v12-0005-Support-parallelism-for-collecting-dead-items-du.patchDownload

From 6be4e1a5f910ec53b240b082d35baf6905b39219 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Feb 2025 13:41:35 -0800
Subject: [PATCH v12 5/5] Support parallelism for collecting dead items during
 lazy vacuum.

This feature allows the vacuum to leverage multiple CPUs in order to
collect dead items (i.e. the first pass over heap table) with parallel
workers. The parallel degree for parallel heap vacuuming is determined
based on the number of blocks to vacuum unless PARALLEL option of
VACUUM command is specified, and further limited by
max_parallel_maintenance_workers.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/ref/vacuum.sgml             |   54 +-
 src/backend/access/heap/heapam_handler.c |    4 +
 src/backend/access/heap/vacuumlazy.c     | 1098 +++++++++++++++++++---
 src/backend/access/table/tableam.c       |   15 +
 src/backend/commands/vacuumparallel.c    |   20 +
 src/include/access/heapam.h              |   11 +
 src/include/access/tableam.h             |    4 +
 src/include/commands/vacuum.h            |    2 +
 src/test/regress/expected/vacuum.out     |    6 +
 src/test/regress/sql/vacuum.sql          |    7 +
 src/tools/pgindent/typedefs.list         |    4 +
 11 files changed, 1070 insertions(+), 155 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 971b1237d47..9d73f6074de 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -279,25 +279,41 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
      <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
+      Perform scanning heap, index vacuum, and index cleanup phases of
+      <command>VACUUM</command> in parallel using
+      <replaceable class="parameter">integer</replaceable> background workers
+      (for the details of each vacuum phase, please refer to
+      <xref linkend="vacuum-phases"/>).
+     </para>
+     <para>
+      For heap tables, the number of workers used to perform the scanning
+      heap is determined based on the size of table. A table can participate in
+      parallel scanning heap if and only if the size of the table is more than
+      <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+      the heap table's blocks will be divided into ranges and shared among the
+      cooperating processes. Each worker process will complete the scanning of
+      its given range of blocks before requesting an additional range of blocks.
+     </para>
+     <para>
+      The number of workers used to perform parallel index vacuum and index
+      cleanup is equal to the number of indexes on the relation that support
+      parallel vacuum. An index can participate in parallel vacuum if and only
+      if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+      Only one worker can be used per index. So parallel workers for index vacuum
+      and index cleanup are launched only when there are at least <literal>2</literal>
+      indexes in the table.
+     </para>
+     <para>
+      Workers for vacuum are launched before the start of each phase and exit
+      at the end of the phase. The number of workers for each phase is limited by
+      the number of workers specified with <literal>PARALLEL</literal> option if
+      any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+      Please note that in any parallel vacuum phase, it is not guaanteed that the
+      number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+      will be used during execution. It is possible for a vacuum to run with fewer
+      workers than specified, or even with no workers at all. These behaviors might
+      change in a future release. This option can't be used with the <literal>FULL</literal>
+      option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1ab20edef94..0a3996f1ef1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2699,6 +2699,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
 
 	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_initialize_worker = heap_parallel_vacuum_initialize_worker,
+	.parallel_vacuum_collect_dead_items = heap_parallel_vacuum_collect_dead_items,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c54cffdc399..a595dd22134 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -99,6 +99,36 @@
  * After pruning and freezing, pages that are newly all-visible and all-frozen
  * are marked as such in the visibility map.
  *
+ * Parallel Lazy Heap Scanning:
+ *
+ * Lazy vacuum on heap tables supports parallel processing for phase I and
+ * phase II. Before starting phase I, we initialize parallel vacuum state,
+ * ParallelVacuumState, and allocate the TID store in a DSA area if we can
+ * use parallel mode for any of these two phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size and number of indexes. Parallel
+ * workers are launched at the beginning of each phase and exit at the end of
+ * each phase.
+ *
+ * For scanning the heap table with parallel workers, we utilize the
+ * table_block_parallelscan_xxx facility which splits the table into several
+ * chunks and parallel workers allocate chunks to scan. If the TID store is
+ * close to overrunning the available space during phase I, parallel workers
+ * exit and leader process gathers the scan results. Then, it performs a index
+ * vacuuming that could also use the parallelism. After vacuuming both indexes
+ * and heap table, the leader process vacuums FSM to make newly-freed space
+ * visible. Then, it relaunches parallel workers to resume the scanning heap
+ * phase with parallel workers again. In order to be able to resume the parallel
+ * heap scan from the previous status, the workers' parallel scan descriptions
+ * are stored in the shared memory (DSM) space to share among parallel workers.
+ * If the leader launches fewer workers than the previous time to resume the
+ * parallel heap scan, some blocks are remained as un-scanned. The leader
+ * process serially deals with such blocks at the end of scanning heap phase
+ * (see complete_unfinihsed_lazy_scan_heap()). Whereas non-parallel heap scans
+ * use streaming read, parallel heap scans don't use it as the streaming read
+ * state cannot be stored easily in the shared memory space.
+ *
  * Dead TID Storage:
  *
  * The major space usage for vacuuming is storage for the dead tuple IDs that
@@ -147,6 +177,7 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -214,11 +245,21 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * DSM keys for parallel heap vacuum. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_KEY_SHARED				0xFFFF0001
+#define LV_PARALLEL_KEY_SCANDESC			0xFFFF0002
+#define LV_PARALLEL_KEY_WORKER_SCANSTATE	0xFFFF0003
+
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
  * parallel mode and the DSM segment is initialized.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->plvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -306,6 +347,87 @@ typedef struct LVScanData
 	bool		skippedallvis;
 } LVScanData;
 
+/*
+ * Struct for information that needs to be shared among parallel workers
+ * for parallel heap vacuum.
+ */
+typedef struct PLVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState vistest;
+
+	/* Per-worker scan data for parallel lazy heap scan */
+	LVScanData	worker_scandata[FLEXIBLE_ARRAY_MEMBER];
+} PLVShared;
+#define SizeOfPLVShared	(offsetof(PLVShared, worker_scandata))
+
+/* Per-worker scan state for parallel heap vacuum */
+typedef struct PLVScanWorkerState
+{
+	/* Has this worker data been initialized? */
+	bool		inited;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData pbscanwork;
+
+	/*
+	 * True if a parallel vacuum scan worker allocated blocks in state but
+	 * might have not scanned all of them. The leader process will take over
+	 * for scanning these remaining blocks.
+	 */
+	bool		maybe_have_unprocessed_blocks;
+
+	/* Last block number the worker scanned */
+	BlockNumber last_blkno;
+} PLVScanWorkerState;
+
+/*
+ * Struct to store the parallel lazy scan state.
+ */
+typedef struct PLVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Shared information */
+	PLVShared  *shared;
+
+	/* Scan state for parallel heap vacuum */
+	PLVScanWorkerState *scanstate;
+} PLVState;
+
+/*
+ * Struct for leader in parallel heap vacuum.
+ */
+typedef struct PLVLeader
+{
+	/* Shared memory size for each shared object */
+	Size		pbscan_len;
+	Size		shared_len;
+	Size		scanstates_len;
+
+	int			nworkers_launched;
+
+	/*
+	 * Points to all per-worker scan states stored on DSM area.
+	 *
+	 * During parallel heap scan, each worker allocates some chunks of blocks
+	 * to scan in its scan state, and could exit while leaving some chunks
+	 * un-scanned if the size of dead_items TIDs is close to overrunning the
+	 * the available space. We store the scan states on shared memory area so
+	 * that workers can resume heap scans from the previous point.
+	 */
+	PLVScanWorkerState *scanstates;
+} PLVLeader;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -313,6 +435,13 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	/*
+	 * For scans that stream reads. We use streaming read for non-parallel
+	 * heap scanning. For parallel heap scanning we use parallel block table
+	 * scan. See the comment atop of this file for details.
+	 */
+	ReadStream *stream;
+
 	/* Buffer access strategy and parallel vacuum state */
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
@@ -368,6 +497,9 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 
+	/* Next block to check for FSM vacuum */
+	BlockNumber next_fsm_block_to_vacuum;
+
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
 	BlockNumber next_unskippable_block; /* next unskippable block */
@@ -375,6 +507,16 @@ typedef struct LVRelState
 	bool		next_unskippable_eager_scanned; /* if it was eagerly scanned */
 	Buffer		next_unskippable_vmbuffer;	/* buffer containing its VM bit */
 
+	/* Fields used for parallel heap vacuum */
+
+	/* Parallel lazy vacuum working state */
+	PLVState   *plvstate;
+
+	/*
+	 * The leader state for parallel heap vacuum. NULL for parallel workers.
+	 */
+	PLVLeader  *leader;
+
 	/* State related to managing eager scanning of all-visible pages */
 
 	/*
@@ -434,12 +576,17 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static bool do_lazy_scan_heap(LVRelState *vacrel);
 static void heap_vacuum_eager_scan_setup(LVRelState *vacrel,
 										 VacuumParams *params);
-static BlockNumber heap_vac_scan_next_block(ReadStream *stream,
-											void *callback_private_data,
-											void *per_buffer_data);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static BlockNumber heap_vac_scan_next_block_read_stream(ReadStream *stream,
+														void *callback_private_data,
+														void *per_buffer_data);
+static inline Buffer heap_vac_get_next_buffer(LVRelState *vacrel, BlockNumber *blkno_out,
+											  uint8 *blk_info_out);
+static BlockNumber get_next_unskippable_block(LVRelState *vacrel, uint8 *blk_info);
+static bool find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -450,6 +597,12 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
 							  bool *has_lpdead_items);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static BlockNumber parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel);
+static void complete_unfinihsed_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_begin(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
+static void parallel_lazy_scan_gather_scan_results(LVRelState *vacrel);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -529,6 +682,22 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	if (vacrel->aggressive)
 		return;
 
+	/*
+	 * Disable eager scanning if parallel heap vacuum is enabled.
+	 *
+	 * One might think that it would make sense to use the eager scanning even
+	 * during parallel heap scanning, but parallel vacuum is available only in
+	 * VACUUM command and would not be something that happens frequently,
+	 * which seems not fit to the purpose of the eager scanning. Also, it
+	 * would require making the code complex. So it would make sense to
+	 * disable it for now.
+	 *
+	 * XXX: this limitation might be eliminated  in the future for example
+	 * when we use parallel vacuum also in autovacuum.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		return;
+
 	/*
 	 * Aggressively vacuuming a small relation shouldn't take long, so it
 	 * isn't worth amortizing. We use two times the region size as the size
@@ -771,6 +940,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
+	vacrel->next_fsm_block_to_vacuum = 0;
 
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
@@ -815,13 +985,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
-	/*
-	 * Set up eager scan tracking state. This must happen after determining
-	 * whether or not the vacuum must be aggressive, because only normal
-	 * vacuums use the eager scan algorithm.
-	 */
-	heap_vacuum_eager_scan_setup(vacrel, params);
-
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -846,6 +1009,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
 
+	/*
+	 * Set up eager scan tracking state. This must happen after determining
+	 * whether or not the vacuum must be aggressive, because only normal
+	 * vacuums use the eager scan algorithm.
+	 */
+	heap_vacuum_eager_scan_setup(vacrel, params);
+
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
@@ -1215,13 +1385,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
-				blkno = 0,
-				next_fsm_block_to_vacuum = 0;
-	BlockNumber orig_eager_scan_success_limit =
-		vacrel->eager_scan_remaining_successes; /* for logging */
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1242,14 +1406,88 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
-	/* Set up the read stream for vacuum's first pass through the heap */
-	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
-										vacrel->bstrategy,
-										vacrel->rel,
-										MAIN_FORKNUM,
-										heap_vac_scan_next_block,
-										vacrel,
-										sizeof(uint8));
+	/* Do the actual work */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+	{
+		/* Set up the read stream for vacuum's first pass through the heap */
+		vacrel->stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+													vacrel->bstrategy,
+													vacrel->rel,
+													MAIN_FORKNUM,
+													heap_vac_scan_next_block_read_stream,
+													vacrel,
+													sizeof(uint8));
+		do_lazy_scan_heap(vacrel);
+
+		read_stream_end(vacrel->stream);
+	}
+
+	/*
+	 * Report that everything is now scanned. We never skip scanning the last
+	 * block in the relation, so we can pass rel_pages here.
+	 */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
+								 rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * We can pass rel_pages here because we never skip scanning the last
+	 * block of the relation.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * Return true if the scan reaches the end of the table, otherwise false. In single
+ * process vacuum, since we loop the cycle of heap scanning, vacuuming and heap
+ * vacuuming until we reach to the end of table, it always returns true. On the other
+ * hand, in parallel vacuum case, if the dead items space is overrunning the available
+ * space, we exit from this function without invoking a cycle of index and heap vacuuming.
+ * In this case, we return false.
+ */
+static bool
+do_lazy_scan_heap(LVRelState *vacrel)
+{
+	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber orig_eager_scan_success_limit =
+		vacrel->eager_scan_remaining_successes; /* for logging */
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		reach_eot = true;
+
+	Assert((vacrel->stream != NULL) || ParallelHeapVacuumIsActive(vacrel));
 
 	while (true)
 	{
@@ -1257,7 +1495,6 @@ lazy_scan_heap(LVRelState *vacrel)
 		Page		page;
 		uint8		blk_info = 0;
 		bool		has_lpdead_items;
-		void	   *per_buffer_data = NULL;
 		bool		vm_page_frozen = false;
 		bool		got_cleanup_lock = false;
 
@@ -1271,8 +1508,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * that point.  This check also provides failsafe coverage for the
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
+		 *
+		 * The failsafe check should be done only by the leader process.
 		 */
-		if (vacrel->scan_data->scanned_pages > 0 &&
+		if (!IsParallelWorker() &&
+			vacrel->scan_data->scanned_pages > 0 &&
 			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
@@ -1299,6 +1539,19 @@ lazy_scan_heap(LVRelState *vacrel)
 				vmbuffer = InvalidBuffer;
 			}
 
+			/*
+			 * In parallel heap vacuum, we return false to the caller without
+			 * index and heap vacuuming. The parallel vacuum workers will exit
+			 * and the leader process will perform both index and heap
+			 * vacuuming.
+			 */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				vacrel->plvstate->scanstate->last_blkno = blkno;
+				reach_eot = false;
+				break;
+			}
+
 			/* Perform a round of index and heap vacuuming */
 			vacrel->consider_bypass_optimization = false;
 			lazy_vacuum(vacrel);
@@ -1308,25 +1561,23 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * upper-level FSM pages. Note that blkno is the previously
 			 * processed block.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 									blkno + 1);
-			next_fsm_block_to_vacuum = blkno;
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 		}
 
-		buf = read_stream_next_buffer(stream, &per_buffer_data);
+		/* Read the next block to process */
+		buf = heap_vac_get_next_buffer(vacrel, &blkno, &blk_info);
 
 		/* The relation is exhausted. */
 		if (!BufferIsValid(buf))
 			break;
 
-		blk_info = *((uint8 *) per_buffer_data);
-		CheckBufferIsPinnedOnce(buf);
 		page = BufferGetPage(buf);
-		blkno = BufferGetBlockNumber(buf);
 
 		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
@@ -1471,10 +1722,13 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * also be no opportunity to update the FSM later, because we'll never
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
+		 *
+		 * FSM vacuum should be done only by the leader process.
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1488,11 +1742,17 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				/*
+				 * XXX: Since the following logic doesn't consider the
+				 * progress of workers' scan processes, there might be
+				 * unprocessed pages between next_fsm_block_to_vacuum and
+				 * blkno.
+				 */
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
@@ -1504,98 +1764,129 @@ lazy_scan_heap(LVRelState *vacrel)
 		ReleaseBuffer(vmbuffer);
 
 	/*
-	 * Report that everything is now scanned. We never skip scanning the last
-	 * block in the relation, so we can pass rel_pages here.
+	 * We could return from this function while not reaching to the end of
+	 * table only in parallel heap vacuum cases.
 	 */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
-								 rel_pages);
+	Assert(reach_eot || ParallelHeapVacuumIsActive(vacrel));
+	return reach_eot;
+}
 
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_data->scanned_pages,
-													 vacrel->scan_data->live_tuples);
+/*
+ * Return the buffer for the block to process. Both blkno and blk_info are output
+ * parameters; the block number and VAC_BLK_XXX flags are set, respectively. Return
+ * InvalidBuffer if there are no remaining blocks.
+ */
+static inline Buffer
+heap_vac_get_next_buffer(LVRelState *vacrel, BlockNumber *blkno_out, uint8 *blk_info_out)
+{
+	Buffer		buf;
 
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
-		vacrel->scan_data->missed_dead_tuples;
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		*blkno_out = get_next_unskippable_block(vacrel, blk_info_out);
 
-	read_stream_end(stream);
+		/* Have we reached the end of the relation? */
+		if (!BlockNumberIsValid(*blkno_out))
+			goto reach_eot;
 
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
+		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, *blkno_out, RBM_NORMAL,
+								 vacrel->bstrategy);
+	}
+	else
+	{
+		void	   *per_buffer_data = NULL;
 
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 * We can pass rel_pages here because we never skip scanning the last
-	 * block of the relation.
-	 */
-	if (rel_pages > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, rel_pages);
+		/* Use streaming read for non-parallel heap scans */
+		buf = read_stream_next_buffer(vacrel->stream, &per_buffer_data);
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+		/* Have we reached the end of the relation? */
+		if (!BufferIsValid(buf))
+			goto reach_eot;
 
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
+		*blk_info_out = *((uint8 *) per_buffer_data);
+		*blkno_out = BufferGetBlockNumber(buf);
+	}
+
+	Assert(BufferIsValid(buf));
+	CheckBufferIsPinnedOnce(buf);
+
+	return buf;
+
+reach_eot:
+	if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
+	{
+		ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
+		vacrel->next_unskippable_vmbuffer = InvalidBuffer;
+	}
+
+	return InvalidBuffer;
 }
 
 /*
- *	heap_vac_scan_next_block() -- read stream callback to get the next block
- *	for vacuum to process
- *
- * Every time lazy_scan_heap() needs a new block to process during its first
- * phase, it invokes read_stream_next_buffer() with a stream set up to call
- * heap_vac_scan_next_block() to get the next block.
- *
- * heap_vac_scan_next_block() uses the visibility map, vacuum options, and
- * various thresholds to skip blocks which do not need to be processed and
- * returns the next block to process or InvalidBlockNumber if there are no
- * remaining blocks.
- *
- * The visibility status of the next block to process and whether or not it
- * was eager scanned is set in the per_buffer_data.
- *
- * callback_private_data contains a reference to the LVRelState, passed to the
- * read stream API during stream setup. The LVRelState is an in/out parameter
- * here (locally named `vacrel`). Vacuum options and information about the
- * relation are read from it. vacrel->scan_data->skippedallvis is set if we skip a block
- * that's all-visible but not all-frozen (to ensure that we don't update
- * relfrozenxid in that case). vacrel also holds information about the next
- * unskippable block -- as bookkeeping for this function.
+ * heap_vac_scan_next_block_read_stream() is the read stream callback. Every
+ * time lazy_scan_heap() needs a new block to process during its first phase, it
+ * invokes read_stream_next_buffer() with a stream set up to call
+ * heap_vac_scan_next_block_read_stream() to get the next block.
+ * callback_private_data contains a reference to the LVRelState, passed to the read
+ * stream API during stream setup.
  */
 static BlockNumber
-heap_vac_scan_next_block(ReadStream *stream,
-						 void *callback_private_data,
-						 void *per_buffer_data)
+heap_vac_scan_next_block_read_stream(ReadStream *stream,
+									 void *callback_private_data,
+									 void *per_buffer_data)
 {
-	BlockNumber next_block;
 	LVRelState *vacrel = callback_private_data;
-	uint8		blk_info = 0;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+	Assert(!ParallelHeapVacuumIsActive(vacrel));
+	return get_next_unskippable_block(vacrel, (uint8 *) per_buffer_data);
+}
+
+/*
+ * Get the next unskippable block for vacuum process found. Return InvalidBlockNumber
+ * if there are no remaining blocks to process.
+ *
+ * get_next_unskippable_block() uses the visibility map, vacuum options, and various
+ * thresholds to skip blocks which do not need to be processed and returns the next
+ * block to process or InvalidBlockNumber if there are no remaining blocks. The
+ * visibility status of the next block to process and whether or not it was eager
+ * scanned are set to the blk_info.
+ *
+ * The LVRelState is an in/out parameter here (locally named `vacrel`). Vacuum options
+ * and information about the relation are read from it.
+ * vacrel->scan_data->skippedallvis is set if we skip a block that's all-visible but
+ * not all-frozen (to ensure that we don't update relfrozenxid in that case). vacrel
+ * also holds information about the next unskippable block -- as bookkeeping for this
+ * function.
+ */
+static BlockNumber
+get_next_unskippable_block(LVRelState *vacrel, uint8 *blk_info)
+{
+	BlockNumber next_block;
+
+	*blk_info = 0;
 
-	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->scan_data->rel_pages)
+retry:
+	/* Get the next block to process */
+	if (ParallelHeapVacuumIsActive(vacrel))
 	{
-		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
-		{
-			ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
-			vacrel->next_unskippable_vmbuffer = InvalidBuffer;
-		}
-		return InvalidBlockNumber;
+		next_block = table_block_parallelscan_nextpage(vacrel->rel,
+													   &(vacrel->plvstate->scanstate->pbscanwork),
+													   vacrel->plvstate->pbscan);
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
 	}
 
+	/*
+	 * Have we reached the end of the relation? Note that
+	 * table_block_parallelscan_nextpage() returns InvalidBlockNumber if there
+	 * are no remaining blocks.
+	 */
+	if (!BlockNumberIsValid(next_block) || next_block >= vacrel->scan_data->rel_pages)
+		return InvalidBlockNumber;
+
 	/*
 	 * We must be in one of the three following states:
 	 */
@@ -1608,8 +1899,54 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * visibility map.
 		 */
 		bool		skipsallvis;
+		bool		found;
+		BlockNumber end_block;
+		BlockNumber nblocks_skip;
+
+		if (ParallelHeapVacuumIsActive(vacrel))
+		{
+			/* We look for the next unskippable block within the chunk */
+			end_block = next_block +
+				vacrel->plvstate->scanstate->pbscanwork.phsw_chunk_remaining + 1;
+		}
+		else
+			end_block = vacrel->scan_data->rel_pages;
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		found = find_next_unskippable_block(vacrel, &skipsallvis, next_block,
+											end_block);
+
+		/*
+		 * Handle the case where we could not find the unskippable block
+		 * within the specified range. This case must happen only in parallel
+		 * heap vacuum cases.
+		 *
+		 * In non-parallel heap vacuums, we already did the boundary check
+		 * above and always specify the range of next_block and rel_pages. And
+		 * we must scan the last page.
+		 */
+		if (!found)
+		{
+			Assert(ParallelHeapVacuumIsActive(vacrel));
+
+			if (skipsallvis)
+				vacrel->scan_data->skippedallvis = true;
+
+			/*
+			 * Tell the parallel scans to consume all remaining blocks in the
+			 * current chunk.
+			 */
+			table_block_parallelscan_skip_pages_in_chunk(vacrel->rel,
+														 &(vacrel->plvstate->scanstate->pbscanwork),
+														 vacrel->plvstate->pbscan,
+														 vacrel->plvstate->scanstate->pbscanwork.phsw_chunk_remaining);
+
+			/*
+			 * Reset the next_unskippable_blocks so we can try to find an
+			 * unskippable block in the next chunk.
+			 */
+			vacrel->next_unskippable_block = InvalidBlockNumber;
+			goto retry;
+		}
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1626,11 +1963,20 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
 				vacrel->scan_data->skippedallvis = true;
+
+			/* Tell the parallel scans to skip blocks */
+			if (ParallelHeapVacuumIsActive(vacrel))
+				table_block_parallelscan_skip_pages_in_chunk(vacrel->rel,
+															 &(vacrel->plvstate->scanstate->pbscanwork),
+															 vacrel->plvstate->pbscan,
+															 nblocks_skip);
+
+			next_block = vacrel->next_unskippable_block;
 		}
 	}
 
@@ -1643,8 +1989,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * otherwise they would've been unskippable.
 		 */
 		vacrel->current_block = next_block;
-		blk_info |= VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM;
-		*((uint8 *) per_buffer_data) = blk_info;
+		*blk_info |= VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM;
 		return vacrel->current_block;
 	}
 	else
@@ -1657,16 +2002,18 @@ heap_vac_scan_next_block(ReadStream *stream,
 
 		vacrel->current_block = next_block;
 		if (vacrel->next_unskippable_allvis)
-			blk_info |= VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM;
+			*blk_info |= VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM;
 		if (vacrel->next_unskippable_eager_scanned)
-			blk_info |= VAC_BLK_WAS_EAGER_SCANNED;
-		*((uint8 *) per_buffer_data) = blk_info;
+			*blk_info |= VAC_BLK_WAS_EAGER_SCANNED;
 		return vacrel->current_block;
 	}
 }
 
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in a range of start_blk (inclusive) and end_blk (exclusive). Return true
+ * if we successfully found the block within the range.
+ *
  * The next unskippable block and its visibility information is updated in
  * vacrel.
  *
@@ -1678,23 +2025,35 @@ heap_vac_scan_next_block(ReadStream *stream,
  * older XIDs/MXIDs.  The *skippedallvis flag will be set here when the choice
  * to skip such a range is actually made, making everything safe.)
  */
-static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+static bool
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start_blk, BlockNumber end_blk)
 {
 	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block = start_blk;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
 	bool		next_unskippable_allvis;
+	bool		found = true;
 
 	*skipsallvis = false;
 
 	for (;; next_unskippable_block++)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   &next_unskippable_vmbuffer);
+		uint8		mapbits;
+
+		/*
+		 * Stop the search as we reach the end of the range. Let the caller
+		 * know we could not find the unskippable page within the range.
+		 */
+		if (next_unskippable_block >= end_blk)
+		{
+			found = false;
+			break;
+		}
 
+		mapbits = visibilitymap_get_status(vacrel->rel, next_unskippable_block,
+										   &next_unskippable_vmbuffer);
 		next_unskippable_allvis = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
 
 		/*
@@ -1774,6 +2133,241 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 	vacrel->next_unskippable_allvis = next_unskippable_allvis;
 	vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
 	vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+
+	return found;
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches
+ * parallel workers to scan the heap in parallel.
+*/
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	PLVScanWorkerState *scanstate;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+	Assert(vacrel->stream == NULL);
+
+	/* Launch parallel workers */
+	parallel_lazy_scan_heap_begin(vacrel);
+
+	/*
+	 * Setup the parallel scan description for the leader to join as a worker.
+	 */
+	scanstate = palloc0(sizeof(PLVScanWorkerState));
+	scanstate->last_blkno = InvalidBlockNumber;
+	table_block_parallelscan_startblock_init(vacrel->rel,
+											 &(scanstate->pbscanwork),
+											 vacrel->plvstate->pbscan);
+	vacrel->plvstate->scanstate = scanstate;
+
+	for (;;)
+	{
+		bool		reach_eot;
+		BlockNumber min_scan_blk;
+
+		/*
+		 * Scan the table until either we are close to overrunning the
+		 * available space for dead_items TIDs or we reach the end of the
+		 * relation.
+		 */
+		reach_eot = do_lazy_scan_heap(vacrel);
+
+		/*
+		 * Parallel lazy heap scan finished. Wait for parallel workers to
+		 * finish and gather scan results.
+		 */
+		parallel_lazy_scan_heap_end(vacrel);
+
+		/* We reach the end of the table */
+		if (reach_eot)
+			break;
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		min_scan_blk = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (min_scan_blk > vacrel->next_fsm_block_to_vacuum)
+		{
+			/*
+			 * min_scanned_blkno was updated when gathering the workers' scan
+			 * results.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									min_scan_blk + 1);
+			vacrel->next_fsm_block_to_vacuum = min_scan_blk;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+		/* Re-launch workers to restart parallel heap scan */
+		parallel_lazy_scan_heap_begin(vacrel);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process launched fewer workers in the next cycle.
+	 */
+	complete_unfinihsed_lazy_scan_heap(vacrel);
+}
+
+/*
+ * Return the minimum block number the leader and workers have scanned so far.
+ */
+static BlockNumber
+parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel)
+{
+	BlockNumber min_blk;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	min_blk = vacrel->plvstate->scanstate->last_blkno;
+
+	/*
+	 * We check all worker scan states here to compute the minimum block
+	 * number among all scan states.
+	 */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		PLVScanWorkerState *scanstate = &(vacrel->leader->scanstates[i]);
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!scanstate->inited)
+			continue;
+
+		if (!BlockNumberIsValid(min_blk) || scanstate->last_blkno < min_blk)
+			min_blk = scanstate->last_blkno;
+	}
+
+	Assert(BlockNumberIsValid(min_blk));
+	return min_blk;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+complete_unfinihsed_lazy_scan_heap(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		PLVScanWorkerState *scanstate = &(vacrel->leader->scanstates[i]);
+
+		/*
+		 * Skip if this worker's scan has not been used or doesn't have
+		 * unprocessed block in chunks.
+		 */
+		if (!scanstate->inited || !scanstate->maybe_have_unprocessed_blocks)
+			continue;
+
+		/* Attach the worker's scan state */
+		vacrel->plvstate->scanstate = scanstate;
+
+		/* Complete the unfinished scan */
+		do_lazy_scan_heap(vacrel);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_begin(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->leader->nworkers_launched = parallel_vacuum_collect_dead_items_begin(vacrel->pvs);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for collecting dead tuples (planned: %d)",
+							 "launched %d parallel vacuum workers for collecting dead tuples (planned: %d)",
+							 vacrel->leader->nworkers_launched),
+					vacrel->leader->nworkers_launched,
+					parallel_vacuum_get_nworkers_table(vacrel->pvs))));
+}
+
+/*
+ * Helper routine to finish the parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_scan_end(vacrel->pvs);
+
+	/* Gather the workers' scan results */
+	parallel_lazy_scan_gather_scan_results(vacrel);
+}
+
+/* Accumulate each worker's scan results into the leader's */
+static void
+parallel_lazy_scan_gather_scan_results(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		LVScanData *data = &(vacrel->plvstate->shared->worker_scandata[i]);
+
+#define ACCUM_COUNT(item) vacrel->scan_data->item += data->item
+		ACCUM_COUNT(scanned_pages);
+		ACCUM_COUNT(removed_pages);
+		ACCUM_COUNT(new_frozen_tuple_pages);
+		ACCUM_COUNT(vm_new_visible_pages);
+		ACCUM_COUNT(vm_new_visible_frozen_pages);
+		ACCUM_COUNT(vm_new_frozen_pages);
+		ACCUM_COUNT(lpdead_item_pages);
+		ACCUM_COUNT(missed_dead_pages);
+		ACCUM_COUNT(tuples_deleted);
+		ACCUM_COUNT(tuples_frozen);
+		ACCUM_COUNT(lpdead_items);
+		ACCUM_COUNT(live_tuples);
+		ACCUM_COUNT(recently_dead_tuples);
+		ACCUM_COUNT(missed_dead_tuples);
+#undef ACCUM_COUNT
+
+		Assert(TransactionIdIsValid(data->NewRelfrozenXid));
+		Assert(MultiXactIdIsValid(data->NewRelminMxid));
+
+		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
+			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
+
+		if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
+			vacrel->scan_data->nonempty_pages = data->nonempty_pages;
+
+		vacrel->scan_data->skippedallvis |= data->skippedallvis;
+	}
 }
 
 /*
@@ -3489,12 +4083,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -3512,11 +4102,17 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize the parallel vacuum state for either lazy heap
+			 * scan or index vacuuming, or both.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3556,9 +4152,15 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
@@ -3760,14 +4362,238 @@ update_relstats_all_indexes(LVRelState *vacrel)
 
 /*
  * Compute the number of workers for parallel heap vacuum.
- *
- * Return 0 to disable parallel vacuum.
  */
 int
 heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
 									 void *state)
 {
-	return 0;
+	int			parallel_workers = 0;
+
+	if (nworkers_requested == 0)
+	{
+		LVRelState *vacrel = (LVRelState *) state;
+		int			heap_parallel_threshold;
+		int			heap_pages;
+		BlockNumber allvisible;
+		BlockNumber allfrozen;
+
+		/*
+		 * Estimate the number of blocks that we're going to scan during
+		 * lazy_scan_heap().
+		 */
+		visibilitymap_count(rel, &allvisible, &allfrozen);
+		heap_pages = RelationGetNumberOfBlocks(rel) -
+			(vacrel->aggressive ? allfrozen : allvisible);
+
+		Assert(heap_pages >= 0);
+
+		/*
+		 * Select the number of workers based on the log of the number of
+		 * pages to scan. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nworkers_requested;
+
+	return parallel_workers;
+}
+
+/*
+ * Estimate shared memory size required for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							  void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	Size		size = 0;
+
+	vacrel->leader = palloc(sizeof(PLVLeader));
+
+	/* Estimate space for PLVShared */
+	size = add_size(size, SizeOfPLVShared);
+	size = add_size(size, mul_size(sizeof(LVScanData), nworkers));
+	vacrel->leader->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for ParallelBlockTableScanDesc */
+	vacrel->leader->pbscan_len = table_block_parallelscan_estimate(rel);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->pbscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of PLVScanWorkerState */
+	vacrel->leader->scanstates_len = mul_size(sizeof(PLVScanWorkerState), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scanstates_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	PLVShared  *shared;
+	ParallelBlockTableScanDesc pbscan;
+	PLVScanWorkerState *scanstates;
+
+	vacrel->plvstate = palloc0(sizeof(PLVState));
+
+	/* Initialize PLVShared */
+	shared = shm_toc_allocate(pcxt->toc, vacrel->leader->shared_len);
+	MemSet(shared, 0, vacrel->leader->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_data->NewRelminMxid;
+	shared->vistest = *vacrel->vistest;
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SHARED, shared);
+	vacrel->plvstate->shared = shared;
+
+	/* Initialize ParallelBlockTableScanDesc */
+	pbscan = shm_toc_allocate(pcxt->toc, vacrel->leader->pbscan_len);
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pbscan);
+	pbscan->base.phs_syncscan = false;	/* always start from the first block */
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_SCANDESC, pbscan);
+	vacrel->plvstate->pbscan = pbscan;
+
+	/* Initialize the array of PLVScanWorkerState */
+	scanstates = shm_toc_allocate(pcxt->toc, vacrel->leader->scanstates_len);
+	MemSet(scanstates, 0, vacrel->leader->scanstates_len);
+	shm_toc_insert(pcxt->toc, LV_PARALLEL_KEY_WORKER_SCANSTATE, scanstates);
+	vacrel->leader->scanstates = scanstates;
+}
+
+/*
+ * Initialize lazy vacuum state with the information retrieved from
+ * shared memory.
+ */
+void
+heap_parallel_vacuum_initialize_worker(Relation rel, ParallelVacuumState *pvs,
+									   ParallelWorkerContext *pwcxt,
+									   void **state_out)
+{
+	LVRelState *vacrel;
+	PLVState   *plvstate;
+	PLVShared  *shared;
+	PLVScanWorkerState *scanstates;
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Initialize PLVState and prepare the related objects */
+
+	plvstate = palloc0(sizeof(PLVState));
+
+	/* Prepare PLVShared */
+	shared = (PLVShared *) shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_SHARED, false);
+	plvstate->shared = shared;
+
+	/* Prepare ParallelBlockTableScanWorkerData */
+	pbscan = shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_SCANDESC, false);
+	plvstate->pbscan = pbscan;
+
+	/* Prepare PLVScanWorkerState */
+	scanstates = shm_toc_lookup(pwcxt->toc, LV_PARALLEL_KEY_WORKER_SCANSTATE, false);
+	plvstate->scanstate = &(scanstates[ParallelWorkerNumber]);
+
+	/* Initialize LVRelState and prepare fields required by lazy scan heap */
+	vacrel = palloc0(sizeof(LVRelState));
+	vacrel->rel = rel;
+	vacrel->indrels = parallel_vacuum_get_table_indexes(pvs,
+														&vacrel->nindexes);
+	vacrel->pvs = pvs;
+	vacrel->aggressive = shared->aggressive;
+	vacrel->skipwithvm = shared->skipwithvm;
+	vacrel->cutoffs = shared->cutoffs;
+	vacrel->vistest = &(shared->vistest);
+	vacrel->dead_items = parallel_vacuum_get_dead_items(pvs,
+														&vacrel->dead_items_info);
+	vacrel->plvstate = plvstate;
+	vacrel->scan_data = &(shared->worker_scandata[ParallelWorkerNumber]);
+	MemSet(vacrel->scan_data, 0, sizeof(LVScanData));
+	vacrel->scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = shared->NewRelminMxid;
+	vacrel->scan_data->skippedallvis = false;
+	vacrel->scan_data->rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Initialize the scan state if not yet. The chunk of blocks will be
+	 * allocated when to get the scan block for the first time.
+	 */
+	if (!vacrel->plvstate->scanstate->inited)
+	{
+		vacrel->plvstate->scanstate->inited = true;
+		table_block_parallelscan_startblock_init(rel,
+												 &(vacrel->plvstate->scanstate->pbscanwork),
+												 vacrel->plvstate->pbscan);
+		vacrel->plvstate->scanstate->maybe_have_unprocessed_blocks = false;
+	}
+
+	*state_out = (void *) vacrel;
+}
+
+/*
+ * Parallel heap vacuum callback for collecting dead items (i.e., lazy heap scan).
+ */
+void
+heap_parallel_vacuum_collect_dead_items(Relation rel, ParallelVacuumState *pvs,
+										void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ErrorContextCallback errcallback;
+	bool		reach_eot;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel->dbname = get_database_name(MyDatabaseId);
+	vacrel->relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Join the parallel heap vacuum */
+	reach_eot = do_lazy_scan_heap(vacrel);
+
+	/*
+	 * Release vmbuffer if we hold as we don't take over the state maintained
+	 * by heap_vac_scan_next_block() to the next time.
+	 */
+	if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
+		ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
+
+	/*
+	 * If the leader or a worker finishes the heap scan because dead_items
+	 * TIDs is close to the limit, lazy heap scan stops while it might have
+	 * some unscanned blocks in the allocated chunk. Since this scan state
+	 * could not be used in the next heap scan, we remember that it might have
+	 * some unconsumed blocks so that the leader complete the scans after the
+	 * heap scan phase finishes.
+	 */
+	vacrel->plvstate->scanstate->maybe_have_unprocessed_blocks = !reach_eot;
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
 }
 
 /*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..a0a92dc8be5 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -600,6 +600,21 @@ table_block_parallelscan_nextpage(Relation rel,
 	return page;
 }
 
+/*
+ * skip some blocks to scan.
+ *
+ * Consume the given number of blocks in the current chunk. It doesn't skip blocks
+ * beyond the current chunk.
+ */
+void
+table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+											 ParallelBlockTableScanWorker pbscanwork,
+											 ParallelBlockTableScanDesc pbscan,
+											 BlockNumber nblocks_skip)
+{
+	pbscanwork->phsw_chunk_remaining -= Min(nblocks_skip, pbscanwork->phsw_chunk_remaining);
+}
+
 /* ----------------------------------------------------------------------------
  * Helper functions to implement relation sizing for block oriented AMs.
  * ----------------------------------------------------------------------------
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bb0d690aed8..8473971ac4f 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -501,6 +501,26 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
+/*
+ * Return the number of parallel workers initialized for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
 /*
  * Returns the dead items space and dead items information.
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 6a1ca5d5ca7..b94d783c31e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -15,6 +15,7 @@
 #define HEAPAM_H
 
 #include "access/heapam_xlog.h"
+#include "access/parallel.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
@@ -407,10 +408,20 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 
 /* in heap/vacuumlazy.c */
 struct VacuumParams;
+struct ParallelVacuumState;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
 extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
 												 void *state);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+										  void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+												   ParallelWorkerContext *pwcxt,
+												   void **state_out);
+extern void heap_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+													void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c61b1700953..6b72883353e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -2170,6 +2170,10 @@ extern BlockNumber table_block_parallelscan_nextpage(Relation rel,
 extern void table_block_parallelscan_startblock_init(Relation rel,
 													 ParallelBlockTableScanWorker pbscanwork,
 													 ParallelBlockTableScanDesc pbscan);
+extern void table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+														 ParallelBlockTableScanWorker pbscanwork,
+														 ParallelBlockTableScanDesc pbscan,
+														 BlockNumber nblocks_skip);
 
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 596e901c207..20eecad4ec4 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -384,6 +384,8 @@ extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels
 												 BufferAccessStrategy bstrategy,
 												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 3f91b69b324..0baf592ae5d 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -160,6 +160,11 @@ UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
 VACUUM (PARALLEL -1) pvactst; -- error
 ERROR:  parallel workers for vacuum must be between 0 and 1024
 LINE 1: VACUUM (PARALLEL -1) pvactst;
@@ -185,6 +190,7 @@ VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 WARNING:  disabling parallel option of vacuum on "tmp" --- cannot vacuum temporary tables in parallel
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
 DROP TABLE pvactst2;
 -- INDEX_CLEANUP option
diff --git a/src/test/regress/sql/vacuum.sql b/src/test/regress/sql/vacuum.sql
index 058add027f1..3f156147a70 100644
--- a/src/test/regress/sql/vacuum.sql
+++ b/src/test/regress/sql/vacuum.sql
@@ -129,6 +129,12 @@ VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
 
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
+
 VACUUM (PARALLEL -1) pvactst; -- error
 VACUUM (PARALLEL 2, INDEX_CLEANUP FALSE) pvactst;
 VACUUM (PARALLEL 2, FULL TRUE) pvactst; -- error, cannot use both PARALLEL and FULL
@@ -148,6 +154,7 @@ CREATE INDEX tmp_idx1 ON tmp (a);
 VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
 DROP TABLE pvactst2;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b8bc8bd3c22..b7b0ae8d9b8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1935,6 +1935,10 @@ PLpgSQL_type
 PLpgSQL_type_type
 PLpgSQL_var
 PLpgSQL_variable
+PLVLeader
+PLVScanWorkerState
+PLVShared
+PLVState
 PLwdatum
 PLword
 PLyArrayToOb
-- 
2.43.5

#83

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Masahiko Sawada (#82)

Re: Parallel heap vacuum

On Thu, Mar 20, 2025 at 4:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

When testing the multi passes of table vacuuming, I found an issue.
With the current patch, both leader and parallel workers process stop
the phase 1 as soon as the shared TidStore size reaches to the limit,
and then the leader resumes the parallel heap scan after heap
vacuuming and index vacuuming. Therefore, as I described in the patch,
one tricky part of this patch is that it's possible that we launch
fewer workers than the previous time when resuming phase 1 after phase
2 and 3. In this case, since the previous parallel workers might have
already allocated some blocks in their chunk, newly launched workers
need to take over their parallel scan state. That's why in the patch
we store workers' ParallelBlockTableScanWorkerData in shared memory.
However, I found my assumption is wrong; in order to take over the
previous scan state properly we need to take over not only
ParallelBlockTableScanWorkerData but also ReadStream data as parallel
workers might have already queued some blocks for look-ahead in their
ReadStream. Looking at ReadStream codes, I find that it's not
realistic to store it into the shared memory.

It seems like one way to solve this would be to add functionality to
the read stream to unpin the buffers it has in the buffers queue
without trying to continue calling the callback until the stream is
exhausted.

We have read_stream_reset(), but that is to restart streams that have
already been exhausted. Exhausted streams are where the callback has
returned InvalidBlockNumber. In the read_stream_reset() cases, the
read stream user knows there are more blocks it would like to scan or
that it would like to restart the scan from the beginning.

Your case is you want to stop trying to exhaust the read stream and
just unpin the remaining buffers. As long as the worker which paused
phase I knows exactly the last block it processed and can communicate
this to whatever worker resumes phase I later, it can initialize
vacrel->current_block to the last block processed.

One plausible solution would be that we don't use ReadStream in
parallel heap vacuum cases but directly use
table_block_parallelscan_xxx() instead. It works but we end up having
two different scan methods for parallel and non-parallel lazy heap
scan. I've implemented this idea in the attached v12 patches.

One question is which scenarios will parallel vacuum phase I without
AIO be faster than read AIO-ified vacuum phase I. Without AIO writes,
I suppose it would be trivial for phase I parallel vacuum to be faster
without using read AIO. But it's worth thinking about the tradeoff.

- Melanie

#84

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Masahiko Sawada (#82)

Re: Parallel heap vacuum

Hi,

On 2025-03-20 01:35:42 -0700, Masahiko Sawada wrote:

One plausible solution would be that we don't use ReadStream in
parallel heap vacuum cases but directly use
table_block_parallelscan_xxx() instead. It works but we end up having
two different scan methods for parallel and non-parallel lazy heap
scan. I've implemented this idea in the attached v12 patches.

I think that's a bad idea - this means we'll never be able to use direct IO
for parallel VACUUMs, despite

a) The CPU overhead of buffered reads being a problem for VACUUM

b) Data ending up in the kernel page cache is rather wasteful for VACUUM, as
that's often data that won't otherwise be used again soon. I.e. these reads
would particularly benefit from using direct IO.

c) Even disregarding DIO, loosing the ability to do larger reads, as provided
by read streams, looses a fair bit of efficiency (just try doing a
pg_prewarm of a large relation with io_combine_limit=1 vs
io_combine_limit=1).

Greetings,

Andres Freund

#85

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Melanie Plageman (#83)

Re: Parallel heap vacuum

On Sat, Mar 22, 2025 at 7:16 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Mar 20, 2025 at 4:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

When testing the multi passes of table vacuuming, I found an issue.
With the current patch, both leader and parallel workers process stop
the phase 1 as soon as the shared TidStore size reaches to the limit,
and then the leader resumes the parallel heap scan after heap
vacuuming and index vacuuming. Therefore, as I described in the patch,
one tricky part of this patch is that it's possible that we launch
fewer workers than the previous time when resuming phase 1 after phase
2 and 3. In this case, since the previous parallel workers might have
already allocated some blocks in their chunk, newly launched workers
need to take over their parallel scan state. That's why in the patch
we store workers' ParallelBlockTableScanWorkerData in shared memory.
However, I found my assumption is wrong; in order to take over the
previous scan state properly we need to take over not only
ParallelBlockTableScanWorkerData but also ReadStream data as parallel
workers might have already queued some blocks for look-ahead in their
ReadStream. Looking at ReadStream codes, I find that it's not
realistic to store it into the shared memory.

It seems like one way to solve this would be to add functionality to
the read stream to unpin the buffers it has in the buffers queue
without trying to continue calling the callback until the stream is
exhausted.

We have read_stream_reset(), but that is to restart streams that have
already been exhausted. Exhausted streams are where the callback has
returned InvalidBlockNumber. In the read_stream_reset() cases, the
read stream user knows there are more blocks it would like to scan or
that it would like to restart the scan from the beginning.

Your case is you want to stop trying to exhaust the read stream and
just unpin the remaining buffers. As long as the worker which paused
phase I knows exactly the last block it processed and can communicate
this to whatever worker resumes phase I later, it can initialize
vacrel->current_block to the last block processed.

If we use ParallelBlockTableScanDesc with streaming read like the
patch did, we would also need to somehow rewind the number of blocks
allocated to workers. The problem I had with such usage was that a
parallel vacuum worker allocated a new chunk of blocks when doing
look-ahead reading and therefore advanced
ParallelBlockTableScanDescData.phs_nallocated. In this case, even if
we unpin the remaining buffers in the queue by a new functionality and
a parallel worker resumes the phase 1 from the last processed block,
we would lose some blocks in already allocated chunks unless we rewind
ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData
data. However, since a worker might have already allocated multiple
chunks it would not be easy to rewind these scan state data.

Another idea is that parallel workers don't exit phase 1 until it
consumes all pinned buffers in the queue, even if the memory usage of
TidStore exceeds the limit. It would need to add new functionality to
the read stream to disable the look-ahead reading. Since we could use
much memory while processing these buffers, exceeding the memory
limit, we can trigger this mode when the memory usage of TidStore
reaches 70% of the limit or so. On the other hand, it means that we
would not use the streaming read for the blocks in this mode, which is
not efficient.

One plausible solution would be that we don't use ReadStream in
parallel heap vacuum cases but directly use
table_block_parallelscan_xxx() instead. It works but we end up having
two different scan methods for parallel and non-parallel lazy heap
scan. I've implemented this idea in the attached v12 patches.

One question is which scenarios will parallel vacuum phase I without
AIO be faster than read AIO-ified vacuum phase I. Without AIO writes,
I suppose it would be trivial for phase I parallel vacuum to be faster
without using read AIO. But it's worth thinking about the tradeoff.

As Andres pointed out, there are major downsides. So we would need to
invent a way to stop and resume the read stream in the middle during
parallel scan.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#86

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Masahiko Sawada (#85)

Re: Parallel heap vacuum

On Sun, Mar 23, 2025 at 4:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

If we use ParallelBlockTableScanDesc with streaming read like the
patch did, we would also need to somehow rewind the number of blocks
allocated to workers. The problem I had with such usage was that a
parallel vacuum worker allocated a new chunk of blocks when doing
look-ahead reading and therefore advanced
ParallelBlockTableScanDescData.phs_nallocated. In this case, even if
we unpin the remaining buffers in the queue by a new functionality and
a parallel worker resumes the phase 1 from the last processed block,
we would lose some blocks in already allocated chunks unless we rewind
ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData
data. However, since a worker might have already allocated multiple
chunks it would not be easy to rewind these scan state data.

Ah I didn't realize rewinding the state would be difficult. It seems
like the easiest way to make sure those blocks are done is to add them
back to the counter somehow. And I don't suppose there is some way to
save these not yet done block assignments somewhere and give them to
the workers who restart phase I to process on the second pass?

Another idea is that parallel workers don't exit phase 1 until it
consumes all pinned buffers in the queue, even if the memory usage of
TidStore exceeds the limit. It would need to add new functionality to
the read stream to disable the look-ahead reading. Since we could use
much memory while processing these buffers, exceeding the memory
limit, we can trigger this mode when the memory usage of TidStore
reaches 70% of the limit or so. On the other hand, it means that we
would not use the streaming read for the blocks in this mode, which is
not efficient.

That might work. And/or maybe you could start decreasing the size of
block assignment chunks when the memory usage of TidStore reaches a
certain level. I don't know how much that would help or how fiddly it
would be.

So we would need to
invent a way to stop and resume the read stream in the middle during
parallel scan.

As for needing to add new read stream functionality, we actually
probably don't have to. If you use read_stream_end() ->
read_stream_reset(), it resets the distance to 0, so then
read_stream_next_buffer() should just end up unpinning the buffers and
freeing the per buffer data. I think the easiest way to implement this
is to think about it as ending a read stream and starting a new one
next time you start phase I and not as pausing and resuming the read
stream. And anyway, maybe it's better not to keep a bunch of pinned
buffers and allocated memory hanging around while doing what could be
very long index scans.

- Melanie

#87

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Masahiko Sawada (#85)

Re: Parallel heap vacuum

Hi,

On 2025-03-23 01:45:35 -0700, Masahiko Sawada wrote:

Another idea is that parallel workers don't exit phase 1 until it
consumes all pinned buffers in the queue, even if the memory usage of
TidStore exceeds the limit.

Yes, that seems a quite reasonable approach to me.

It would need to add new functionality to the read stream to disable the
look-ahead reading.

Couldn't your next block callback simply return InvalidBlockNumber once close
to the memory limit?

Since we could use much memory while processing these buffers, exceeding the
memory limit, we can trigger this mode when the memory usage of TidStore
reaches 70% of the limit or so.

It wouldn't be that much memory, would it? A few 10s-100s of buffers don't
increase the size of a TidStore that much? Using 10 parallel vacuum with a
m_w_m of 1MB doesn't make sense, we'd constantly start/stop workers.

On the other hand, it means that we would not use the streaming read for the
blocks in this mode, which is not efficient.

I don't follow - why wouldn't you be using streaming read?

Greetings,

Andres Freund

#88

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Melanie Plageman (#86)

5 attachment(s)

Re: Parallel heap vacuum

On Sun, Mar 23, 2025 at 10:01 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Sun, Mar 23, 2025 at 4:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

If we use ParallelBlockTableScanDesc with streaming read like the
patch did, we would also need to somehow rewind the number of blocks
allocated to workers. The problem I had with such usage was that a
parallel vacuum worker allocated a new chunk of blocks when doing
look-ahead reading and therefore advanced
ParallelBlockTableScanDescData.phs_nallocated. In this case, even if
we unpin the remaining buffers in the queue by a new functionality and
a parallel worker resumes the phase 1 from the last processed block,
we would lose some blocks in already allocated chunks unless we rewind
ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData
data. However, since a worker might have already allocated multiple
chunks it would not be easy to rewind these scan state data.

Ah I didn't realize rewinding the state would be difficult. It seems
like the easiest way to make sure those blocks are done is to add them
back to the counter somehow. And I don't suppose there is some way to
save these not yet done block assignments somewhere and give them to
the workers who restart phase I to process on the second pass?

It might be possible to store the not-yet-done-blocks in DSA to pass
them to the next workers. But it would make the codes more complex.

Another idea is that parallel workers don't exit phase 1 until it
consumes all pinned buffers in the queue, even if the memory usage of
TidStore exceeds the limit. It would need to add new functionality to
the read stream to disable the look-ahead reading. Since we could use
much memory while processing these buffers, exceeding the memory
limit, we can trigger this mode when the memory usage of TidStore
reaches 70% of the limit or so. On the other hand, it means that we
would not use the streaming read for the blocks in this mode, which is
not efficient.

That might work. And/or maybe you could start decreasing the size of
block assignment chunks when the memory usage of TidStore reaches a
certain level. I don't know how much that would help or how fiddly it
would be.

I've tried this idea in the attached version patch. I've started with
a simple approach; once the TidStore reaches the limit,
heap_vac_scan_next_block(), a callback for the read stream, begins to
return InvalidBlockNumber. We continue phase 1 until the read stream
is exhausted.

So we would need to
invent a way to stop and resume the read stream in the middle during
parallel scan.

As for needing to add new read stream functionality, we actually
probably don't have to. If you use read_stream_end() ->
read_stream_reset(), it resets the distance to 0, so then
read_stream_next_buffer() should just end up unpinning the buffers and
freeing the per buffer data. I think the easiest way to implement this
is to think about it as ending a read stream and starting a new one
next time you start phase I and not as pausing and resuming the read
stream. And anyway, maybe it's better not to keep a bunch of pinned
buffers and allocated memory hanging around while doing what could be
very long index scans.

You're right. I've studied the read stream code and figured out how to
use it. In the attached patch, we end the read stream at the end of
phase 1 and start a new read stream, as you suggested.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v13-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchapplication/octet-stream; name=v13-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchDownload

From e2cddd0b084a0cd492672505e2d2074bd0d866a2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 18 Feb 2025 17:45:36 -0800
Subject: [PATCH v13 2/5] vacuumparallel.c: Support parallel vacuuming for
 tables to collect dead items.

Since parallel vacuum was available only for index vacuuming and index
cleanup, ParallelVacuumState was initialized only when the table has
at least two indexes that are eligible for parallel index vacuuming
and cleanup.

This commit extends vacuumparallel.c to support parallel table
vacuuming. parallel_vacuum_init() now initializes ParallelVacuumState
and it enables parallel table vacuuming and parallel index
vacuuming/cleanup separately. During the initialization, it asks the
table AM for the number of parallel workers required for parallel
table vacuuming. If >0, it enables parallel table vacuuming and calls
further table AM APIs such as parallel_vacuum_estimate.

For parallel table vacuuming, this commit introduces
parallel_vacuum_collect_dead_items_begin() function, which can be used
to collect dead items in the table (for example, the first pass over
heap table in lazy vacuum for heap tables).

Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/vacuumlazy.c  |   2 +-
 src/backend/commands/vacuumparallel.c | 307 +++++++++++++++++++++-----
 src/include/commands/vacuum.h         |   5 +-
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 256 insertions(+), 59 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b4100dacd1d..735b2a1cfdc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3499,7 +3499,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..bb0d690aed8 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -4,17 +4,18 @@
  *	  Support routines for parallel vacuum execution.
  *
  * This file contains routines that are intended to support setting up, using,
- * and tearing down a ParallelVacuumState.
+ * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared
+ * information as well as the memory space for storing dead items allocated in
+ * the DSA area. We launch
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan, index bulk-deletion, index
+ * cleanup, or all of them with parallel worker processes depending on the
+ * number of parallel workers required for each phase. So different numbers of
+ * workers might be required for the table scanning and index processing.
+ * We launch parallel worker processes at the start of a phase, and once we
+ * complete all work in the phase, parallel workers exit. Each time we process
+ * table or indexes in parallel, the parallel context is re-initialized so that
+ * the same DSM can be used for multiple passes of each phase.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -26,8 +27,10 @@
  */
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -50,6 +53,13 @@
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		4
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
+/* The kind of parallel vacuum phases */
+typedef enum
+{
+	PV_WORK_PHASE_PROCESS_INDEXES,	/* index vacuuming or cleanup */
+	PV_WORK_PHASE_COLLECT_DEAD_ITEMS,	/* collect dead tuples */
+} PVWorkPhase;
+
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
  * segment.
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * Tell parallel workers what phase to perform: processing indexes or
+	 * removing dead tuples from the table.
+	 */
+	PVWorkPhase work_phase;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +180,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Do we need to reinitialize parallel DSM? */
+	bool		need_reinitialize_dsm;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -171,6 +190,12 @@ struct ParallelVacuumState
 	Relation   *indrels;
 	int			nindexes;
 
+	/*
+	 * The number of workers for parallel table vacuuming. If 0, the parallel
+	 * table vacuum is disabled.
+	 */
+	int			nworkers_for_table;
+
 	/* Shared information among parallel vacuum workers */
 	PVShared   *shared;
 
@@ -178,7 +203,7 @@ struct ParallelVacuumState
 	 * Shared index statistics among parallel vacuum workers. The array
 	 * element is allocated for every index, even those indexes where parallel
 	 * index vacuuming is unsafe or not worthwhile (e.g.,
-	 * will_parallel_vacuum[] is false).  During parallel vacuum,
+	 * idx_will_parallel_vacuum[] is false).  During parallel vacuum,
 	 * IndexBulkDeleteResult of each index is kept in DSM and is copied into
 	 * local memory at the end of parallel vacuum.
 	 */
@@ -198,7 +223,7 @@ struct ParallelVacuumState
 	 * processing. For example, the index could be <
 	 * min_parallel_index_scan_size cutoff.
 	 */
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
@@ -221,8 +246,10 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static int	parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											bool *idx_will_parallel_vacuum,
+											void *state);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -237,12 +264,16 @@ static void parallel_vacuum_error_callback(void *arg);
  * Try to enter parallel mode and create a parallel context.  Then initialize
  * shared memory state.
  *
+ * nrequested_workers is the requested parallel degree. 0 means that the parallel
+ * degrees for table and indexes vacuum are decided differently. See the comments
+ * of parallel_vacuum_compute_workers() for details.
+ *
  * On success, return parallel vacuum state.  Otherwise return NULL.
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -251,38 +282,38 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
 	int			querylen;
 
-	/*
-	 * A parallel vacuum must be requested and there must be indexes on the
-	 * relation
-	 */
+	/* A parallel vacuum must be requested */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
-	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
+	idx_will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
+	parallel_workers = parallel_vacuum_compute_workers(rel, indrels, nindexes,
 													   nrequested_workers,
-													   will_parallel_vacuum);
+													   &nworkers_for_table,
+													   idx_will_parallel_vacuum,
+													   state);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
-		pfree(will_parallel_vacuum);
+		pfree(idx_will_parallel_vacuum);
 		return NULL;
 	}
 
 	pvs = (ParallelVacuumState *) palloc0(sizeof(ParallelVacuumState));
 	pvs->indrels = indrels;
 	pvs->nindexes = nindexes;
-	pvs->will_parallel_vacuum = will_parallel_vacuum;
+	pvs->idx_will_parallel_vacuum = idx_will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
 
@@ -291,6 +322,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 parallel_workers);
 	Assert(pcxt->nworkers > 0);
 	pvs->pcxt = pcxt;
+	pvs->need_reinitialize_dsm = false;
+	pvs->nworkers_for_table = nworkers_for_table;
 
 	/* Estimate size for index vacuum stats -- PARALLEL_VACUUM_KEY_INDEX_STATS */
 	est_indstats_len = mul_size(sizeof(PVIndStats), nindexes);
@@ -327,6 +360,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, pvs->nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -345,7 +382,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 			   ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) == 0));
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
-		if (!will_parallel_vacuum[i])
+		if (!idx_will_parallel_vacuum[i])
 			continue;
 
 		if (indrel->rd_indam->amusemaintenanceworkmem)
@@ -419,6 +456,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Initialize AM-specific DSM space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, pvs->nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -456,7 +497,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
-	pfree(pvs->will_parallel_vacuum);
+	pfree(pvs->idx_will_parallel_vacuum);
 	pfree(pvs);
 }
 
@@ -533,26 +574,35 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.  Return the maximum number of parallel
+ * workers for table vacuuming and index vacuuming.
+ *
+ * nrequested is the number of parallel workers that user requested, which
+ * applies to both the number of workers for table vacuum and index vacuum.
+ * If nrequested is 0, we compute the parallel degree for them differently
+ * as described below.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel table vacuum, we ask AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to nworkers_table_p.
+ *
+ * For parallel index vacuum, the index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. This function sets
+ * idx_will_parallel_vacuum to remember indexes that participate in parallel vacuum.
  */
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table_p,
+								bool *idx_will_parallel_vacuum, void *state)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			nworkers_table = 0;
+	int			nworkers_index = 0;
+
+	*nworkers_table_p = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -561,6 +611,13 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;
 
+	/* Compute the number of workers for parallel table scan */
+	nworkers_table = table_parallel_vacuum_compute_workers(rel, nrequested,
+														   state);
+
+	/* Cap by max_parallel_maintenance_workers */
+	nworkers_table = Min(nworkers_table, max_parallel_maintenance_workers);
+
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
@@ -574,7 +631,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		will_parallel_vacuum[i] = true;
+		idx_will_parallel_vacuum[i] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -589,18 +646,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	/* The leader process takes one index */
 	nindexes_parallel--;
 
-	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table_p = nworkers_table;
+	return Max(nworkers_table, nworkers_index);
 }
 
 /*
@@ -657,7 +714,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL);
 		indstats->status = new_status;
 		indstats->parallel_workers_can_process =
-			(pvs->will_parallel_vacuum[i] &&
+			(pvs->idx_will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
 													num_index_scans,
 													vacuum));
@@ -669,8 +726,10 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
+		pvs->shared->work_phase = PV_WORK_PHASE_PROCESS_INDEXES;
+
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (pvs->need_reinitialize_dsm)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -764,6 +823,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
 }
 
 /*
@@ -979,6 +1041,115 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Begin the parallel scan to collect dead items. Return the number of
+ * launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_scan_end() to finish the parallel
+ * table scan.
+ */
+int
+parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->work_phase = PV_WORK_PHASE_COLLECT_DEAD_ITEMS;
+
+	if (pvs->need_reinitialize_dsm)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	Assert(pvs->nworkers_for_table <= pvs->pcxt->nworkers);
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel vacuum workers launched by
+ * parallel_vacuum_collect_dead_items_begin(), and gather workers' statistics.
+ */
+void
+parallel_vacuum_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
+}
+
+/*
+ * The function is for parallel workers to execute the parallel scan to
+ * collect dead tuples.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs, void *state)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do the parallel scan to collect dead tuples */
+	table_parallel_vacuum_collect_dead_items(pvs->heaprel, pvs, state);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -998,6 +1169,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
+	void	   *state;
 	ErrorContextCallback errcallback;
 
 	/*
@@ -1030,7 +1202,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	/*
 	 * Apply the desired value of maintenance_work_mem within this process.
@@ -1076,6 +1247,17 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
+	/* Initialize AM-specific vacuum state for parallel table vacuuming */
+	if (shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS)
+	{
+		ParallelWorkerContext pwcxt;
+
+		pwcxt.toc = toc;
+		pwcxt.seg = seg;
+		table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+												&state);
+	}
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = parallel_vacuum_error_callback;
 	errcallback.arg = &pvs;
@@ -1085,8 +1267,19 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	switch (pvs.shared->work_phase)
+	{
+		case PV_WORK_PHASE_COLLECT_DEAD_ITEMS:
+			/* Scan the table to collect dead items */
+			parallel_vacuum_process_table(&pvs, state);
+			break;
+		case PV_WORK_PHASE_PROCESS_INDEXES:
+			/* Process indexes to perform vacuum/cleanup */
+			parallel_vacuum_process_safe_indexes(&pvs);
+			break;
+		default:
+			elog(ERROR, "unrecognized parallel vacuum phase %d", pvs.shared->work_phase);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index bc37a80dc74..e8f75fc67b1 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -382,7 +382,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -394,6 +395,8 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_scan_end(ParallelVacuumState *pvs);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3fbf5a4c212..159e14486d9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2008,6 +2008,7 @@ PVIndStats
 PVIndVacStatus
 PVOID
 PVShared
+PVWorkPhase
 PX_Alias
 PX_Cipher
 PX_Combo
-- 
2.43.5

v13-0005-Support-parallelism-for-collecting-dead-items-du.patchapplication/octet-stream; name=v13-0005-Support-parallelism-for-collecting-dead-items-du.patchDownload

From 48d285569891ffaff38ce28029bf28932d431004 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Feb 2025 13:41:35 -0800
Subject: [PATCH v13 5/5] Support parallelism for collecting dead items during
 lazy vacuum.

This feature allows the vacuum to leverage multiple CPUs in order to
collect dead items (i.e. the first pass over heap table) with parallel
workers. The parallel degree for parallel heap vacuuming is determined
based on the number of blocks to vacuum unless PARALLEL option of
VACUUM command is specified, and further limited by
max_parallel_maintenance_workers.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/ref/vacuum.sgml             |  54 +-
 src/backend/access/heap/heapam_handler.c |   4 +
 src/backend/access/heap/vacuumlazy.c     | 890 ++++++++++++++++++++---
 src/backend/access/table/tableam.c       |  15 +
 src/backend/commands/vacuumparallel.c    |  20 +
 src/include/access/heapam.h              |  11 +
 src/include/access/tableam.h             |   4 +
 src/include/commands/vacuum.h            |   2 +
 src/test/regress/expected/vacuum.out     |   6 +
 src/test/regress/sql/vacuum.sql          |   7 +
 src/tools/pgindent/typedefs.list         |   4 +
 11 files changed, 905 insertions(+), 112 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..294494877d9 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -280,25 +280,41 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
      <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
+      Perform scanning heap, index vacuum, and index cleanup phases of
+      <command>VACUUM</command> in parallel using
+      <replaceable class="parameter">integer</replaceable> background workers
+      (for the details of each vacuum phase, please refer to
+      <xref linkend="vacuum-phases"/>).
+     </para>
+     <para>
+      For heap tables, the number of workers used to perform the scanning
+      heap is determined based on the size of table. A table can participate in
+      parallel scanning heap if and only if the size of the table is more than
+      <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+      the heap table's blocks will be divided into ranges and shared among the
+      cooperating processes. Each worker process will complete the scanning of
+      its given range of blocks before requesting an additional range of blocks.
+     </para>
+     <para>
+      The number of workers used to perform parallel index vacuum and index
+      cleanup is equal to the number of indexes on the relation that support
+      parallel vacuum. An index can participate in parallel vacuum if and only
+      if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+      Only one worker can be used per index. So parallel workers for index vacuum
+      and index cleanup are launched only when there are at least <literal>2</literal>
+      indexes in the table.
+     </para>
+     <para>
+      Workers for vacuum are launched before the start of each phase and exit
+      at the end of the phase. The number of workers for each phase is limited by
+      the number of workers specified with <literal>PARALLEL</literal> option if
+      any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+      Please note that in any parallel vacuum phase, it is not guaanteed that the
+      number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+      will be used during execution. It is possible for a vacuum to run with fewer
+      workers than specified, or even with no workers at all. These behaviors might
+      change in a future release. This option can't be used with the <literal>FULL</literal>
+      option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a534100692a..9de9f4637b2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2713,6 +2713,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
 
 	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_initialize_worker = heap_parallel_vacuum_initialize_worker,
+	.parallel_vacuum_collect_dead_items = heap_parallel_vacuum_collect_dead_items,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c54cffdc399..731ef24858e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -99,6 +99,39 @@
  * After pruning and freezing, pages that are newly all-visible and all-frozen
  * are marked as such in the visibility map.
  *
+ * Parallel Vacuum:
+ *
+ * Lazy vacuum on heap tables supports parallel processing for phase I and
+ * phase II. Before starting phase I, we initialize parallel vacuum state,
+ * ParallelVacuumState, and allocate the TID store in a DSA area if we can
+ * use parallel mode for any of these two phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size and number of indexes. Parallel
+ * workers are launched at the beginning of each phase and exit at the end of
+ * each phase.
+ *
+ * For the parallel lazy heap scan (i.e. parallel phase I), we employ a parallel
+ * block table scan, controlled by ParallelBlockTableScanDesc, in conjunction
+ * with the read stream. The table is split into multiple chunks, which are
+ * then distributed among parallel workers.
+ *
+ * The workers' parallel scan descriptions, ParallelBlockTableScanWorkerData,
+ * are stored in the DSM space, enabling different parallel workers to resume
+ * phase I from their previous state. However, due to the potential presence
+ * of pinned buffers loaded by the read stream's look-ahead mechanism, we
+ * cannot abruptly stop phase I even when the space of dead_items TIDs exceeds
+ * the limit. Instead, once this threshold is surpassed, we begin processing
+ * pages without attempting to retrieve additional blocks until the read
+ * stream is exhausted. While this approach may increase the memory usage, it
+ * typically doesn't pose a significant problem, as processing a few 10s-100s
+ * buffers doesn't substantially increase the size of dead_items TIDs.
+ *
+ * If the leader launches fewer workers than the previous time to resume the
+ * parallel lazy heap scan, some block within chunks may remain un-scanned.
+ * To address this, the leader completes workers' unfinished scans at the end
+ * of the parallel lazy heap scan (see complete_unfinihsed_lazy_scan_heap()).
+ *
  * Dead TID Storage:
  *
  * The major space usage for vacuuming is storage for the dead tuple IDs that
@@ -147,6 +180,7 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"	/* for min_parallel_table_scan_size */
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -214,11 +248,21 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * DSM keys for parallel lazy vacuum. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define PARALLEL_LV_KEY_SHARED				0xFFFF0001
+#define PARALLEL_LV_KEY_SCANDESC			0xFFFF0002
+#define PARALLEL_LV_KEY_WORKER_SCANSTATE	0xFFFF0003
+
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
  * parallel mode and the DSM segment is initialized.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->plvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -306,6 +350,77 @@ typedef struct LVScanData
 	bool		skippedallvis;
 } LVScanData;
 
+/*
+ * Struct for information that needs to be shared among parallel workers
+ * for parallel lazy vacuum.
+ */
+typedef struct ParallelLVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+	GlobalVisState vistest;
+
+	/* Per-worker scan data for parallel lazy heap scan */
+	LVScanData	worker_scandata[FLEXIBLE_ARRAY_MEMBER];
+} ParallelLVShared;
+#define SizeOfParallelLVShared	(offsetof(ParallelLVShared, worker_scandata))
+
+/*
+ * Per-worker scan state for parallel lazy vacuum.
+ */
+typedef struct ParallelLVScanWorker
+{
+	/* Has this scan state been initialized? */
+	bool		inited;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData pbscanwork;
+
+	/* Last block number the worker processed */
+	BlockNumber last_blkno;
+} ParallelLVScanWorker;
+
+/*
+ * Struct to store parallel lazy vacuum working state.
+ */
+typedef struct ParallelLVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Shared information */
+	ParallelLVShared *shared;
+
+	/* Scan state for parallel lazy vacuum */
+	ParallelLVScanWorker *scanworker;
+} ParallelLVState;
+
+/*
+ * Struct for the leader process in parallel lazy vacuum.
+ */
+typedef struct ParallelLVLeader
+{
+	/* Shared memory size for each shared object */
+	Size		pbscan_len;
+	Size		shared_len;
+	Size		scanstates_len;
+
+	/* The number of workers launched for parallel lazy heap scan */
+	int			nworkers_launched;
+
+	/*
+	 * Points to the array of all per-worker scan states stored on DSM area.
+	 */
+	ParallelLVScanWorker *scanstates;
+} ParallelLVLeader;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -368,6 +483,9 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 
+	/* Next block to check for FSM vacuum */
+	BlockNumber next_fsm_block_to_vacuum;
+
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
 	BlockNumber next_unskippable_block; /* next unskippable block */
@@ -375,6 +493,16 @@ typedef struct LVRelState
 	bool		next_unskippable_eager_scanned; /* if it was eagerly scanned */
 	Buffer		next_unskippable_vmbuffer;	/* buffer containing its VM bit */
 
+	/* Fields used for parallel lazy vacuum */
+
+	/* Parallel lazy vacuum working state */
+	ParallelLVState *plvstate;
+
+	/*
+	 * The leader state for parallel lazy vacuum. NULL for parallel workers.
+	 */
+	ParallelLVLeader *leader;
+
 	/* State related to managing eager scanning of all-visible pages */
 
 	/*
@@ -434,12 +562,14 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void do_lazy_scan_heap(LVRelState *vacrel, bool do_vacuum);
 static void heap_vacuum_eager_scan_setup(LVRelState *vacrel,
 										 VacuumParams *params);
 static BlockNumber heap_vac_scan_next_block(ReadStream *stream,
 											void *callback_private_data,
 											void *per_buffer_data);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -450,6 +580,12 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
 							  bool *has_lpdead_items);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static BlockNumber parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel);
+static void complete_unfinihsed_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_begin(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
+static void parallel_lazy_scan_gather_scan_results(LVRelState *vacrel);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -474,6 +610,7 @@ static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
 static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 						   int num_offsets);
+static bool dead_items_check_memory_limit(LVRelState *vacrel);
 static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
@@ -529,6 +666,22 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	if (vacrel->aggressive)
 		return;
 
+	/*
+	 * Disable eager scanning if parallel lazy vacuum is enabled.
+	 *
+	 * One might think that it would make sense to use the eager scanning even
+	 * during parallel lazy vacuum, but parallel vacuum is available only in
+	 * VACUUM command and would not be something that happens frequently,
+	 * which seems not fit to the purpose of the eager scanning. Also, it
+	 * would require making the code complex. So it would make sense to
+	 * disable it for now.
+	 *
+	 * XXX: this limitation might need to be eliminated in the future for
+	 * example when we use parallel vacuum also in autovacuum.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		return;
+
 	/*
 	 * Aggressively vacuuming a small relation shouldn't take long, so it
 	 * isn't worth amortizing. We use two times the region size as the size
@@ -771,6 +924,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
+	vacrel->next_fsm_block_to_vacuum = 0;
 
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
@@ -815,13 +969,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
-	/*
-	 * Set up eager scan tracking state. This must happen after determining
-	 * whether or not the vacuum must be aggressive, because only normal
-	 * vacuums use the eager scan algorithm.
-	 */
-	heap_vacuum_eager_scan_setup(vacrel, params);
-
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -846,6 +993,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
 
+	/*
+	 * Set up eager scan tracking state. This must happen after determining
+	 * whether or not the vacuum must be aggressive, because only normal
+	 * vacuums use the eager scan algorithm.
+	 */
+	heap_vacuum_eager_scan_setup(vacrel, params);
+
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
@@ -1215,13 +1369,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
-				blkno = 0,
-				next_fsm_block_to_vacuum = 0;
-	BlockNumber orig_eager_scan_success_limit =
-		vacrel->eager_scan_remaining_successes; /* for logging */
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1242,6 +1390,73 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/* Do the actual work */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel, true);
+
+	/*
+	 * Report that everything is now scanned. We never skip scanning the last
+	 * block in the relation, so we can pass rel_pages here.
+	 */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
+								 rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * We can pass rel_pages here because we never skip scanning the last
+	 * block of the relation.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * If do_vacuum is true, we stop the lazy heap scan and invoke a cycle of index
+ * vacuuming and table vacuuming if the space of dead_items TIDs exceeds the limit, and
+ * then resume it. On the other hand, if it's false, we continue scanning until the
+ * read stream is exhausted.
+ */
+static void
+do_lazy_scan_heap(LVRelState *vacrel, bool do_vacuum)
+{
+	ReadStream *stream;
+	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber orig_eager_scan_success_limit =
+		vacrel->eager_scan_remaining_successes; /* for logging */
+	Buffer		vmbuffer = InvalidBuffer;
+
 	/* Set up the read stream for vacuum's first pass through the heap */
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
 										vacrel->bstrategy,
@@ -1271,8 +1486,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * that point.  This check also provides failsafe coverage for the
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
+		 *
+		 * The failsafe check is done only by the leader process.
 		 */
-		if (vacrel->scan_data->scanned_pages > 0 &&
+		if (!IsParallelWorker() &&
+			vacrel->scan_data->scanned_pages > 0 &&
 			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
@@ -1280,12 +1498,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Consider if we definitely have enough space to process TIDs on page
 		 * already.  If we are close to overrunning the available space for
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page. However, let's force at least one page-worth of tuples
-		 * to be stored as to ensure we do at least some work when the memory
-		 * configured is so low that we run out before storing anything.
+		 * this page.
 		 */
-		if (vacrel->dead_items_info->num_items > 0 &&
-			TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
+		if (do_vacuum && dead_items_check_memory_limit(vacrel))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1308,15 +1523,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * upper-level FSM pages. Note that blkno is the previously
 			 * processed block.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 									blkno + 1);
-			next_fsm_block_to_vacuum = blkno;
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 		}
 
+		/* Read the next block to process */
 		buf = read_stream_next_buffer(stream, &per_buffer_data);
 
 		/* The relation is exhausted. */
@@ -1471,10 +1687,13 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * also be no opportunity to update the FSM later, because we'll never
 		 * revisit this page. Since updating the FSM is desirable but not
 		 * absolutely required, that's OK.
+		 *
+		 * FSM vacuum is done only by the leader process.
 		 */
-		if (vacrel->nindexes == 0
-			|| !vacrel->do_index_vacuuming
-			|| !has_lpdead_items)
+		if (!IsParallelWorker() &&
+			(vacrel->nindexes == 0
+			 || !vacrel->do_index_vacuuming
+			 || !has_lpdead_items))
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
@@ -1488,11 +1707,17 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * held the cleanup lock and lazy_scan_prune() was called.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				/*
+				 * XXX: In parallel lazy scan, since the following logic
+				 * doesn't consider the progress of parallel workers, there
+				 * might be unprocessed pages between next_fsm_block_to_vacuum
+				 * and blkno.
+				 */
+				FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
@@ -1503,50 +1728,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/*
-	 * Report that everything is now scanned. We never skip scanning the last
-	 * block in the relation, so we can pass rel_pages here.
-	 */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
-								 rel_pages);
-
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_data->scanned_pages,
-													 vacrel->scan_data->live_tuples);
-
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
-		vacrel->scan_data->missed_dead_tuples;
-
 	read_stream_end(stream);
-
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
-
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 * We can pass rel_pages here because we never skip scanning the last
-	 * block of the relation.
-	 */
-	if (rel_pages > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, rel_pages);
-
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
-
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
 }
 
 /*
@@ -1560,7 +1742,8 @@ lazy_scan_heap(LVRelState *vacrel)
  * heap_vac_scan_next_block() uses the visibility map, vacuum options, and
  * various thresholds to skip blocks which do not need to be processed and
  * returns the next block to process or InvalidBlockNumber if there are no
- * remaining blocks.
+ * remaining blocks or the space of dead_items TIDs reaches the limit (only
+ * in parallel lazy vacuum cases).
  *
  * The visibility status of the next block to process and whether or not it
  * was eager scanned is set in the per_buffer_data.
@@ -1578,15 +1761,39 @@ heap_vac_scan_next_block(ReadStream *stream,
 						 void *callback_private_data,
 						 void *per_buffer_data)
 {
-	BlockNumber next_block;
+	BlockNumber next_block = InvalidBlockNumber;
 	LVRelState *vacrel = callback_private_data;
 	uint8		blk_info = 0;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+retry:
+	/* Get the next block to process */
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		/*
+		 * Stop returning the next block to the read stream if we are close to
+		 * overrunning the available space for dead_items TIDs so that the
+		 * read stream returns pinned buffers in its buffers queue until the
+		 * stream is exhausted. See the comments atop this file for details.
+		 */
+		if (!dead_items_check_memory_limit(vacrel))
+		{
+			/*
+			 * table_block_parallelscan_nextpage() returns InvalidBlockNumber
+			 * if there are no remaining blocks.
+			 */
+			next_block = table_block_parallelscan_nextpage(vacrel->rel,
+														   &(vacrel->plvstate->scanworker->pbscanwork),
+														   vacrel->plvstate->pbscan);
+		}
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
+	}
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->scan_data->rel_pages)
+	if (!BlockNumberIsValid(next_block) || next_block >= vacrel->scan_data->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1608,8 +1815,19 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * visibility map.
 		 */
 		bool		skipsallvis;
+		BlockNumber end_block;
+		BlockNumber nblocks_skip;
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		if (ParallelHeapVacuumIsActive(vacrel))
+		{
+			/* We look for the next unskippable block within the chunk */
+			end_block = next_block +
+				vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1;
+		}
+		else
+			end_block = vacrel->scan_data->rel_pages;
+
+		find_next_unskippable_block(vacrel, &skipsallvis, next_block, end_block);
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1626,11 +1844,36 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
 				vacrel->scan_data->skippedallvis = true;
+
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				ParallelLVState *plvstate = vacrel->plvstate;
+
+				/* Tell the parallel scans to skip blocks */
+				table_block_parallelscan_skip_pages_in_chunk(vacrel->rel,
+															 &(vacrel->plvstate->scanworker->pbscanwork),
+															 vacrel->plvstate->pbscan,
+															 nblocks_skip);
+
+				/*
+				 * If we have consumed the all blocks in the current chunk,
+				 * retry with the next chunk. We reset next_unskippable_blocks
+				 * so we can find an unskippable block in the next chunk.
+				 */
+				if (plvstate->scanworker->pbscanwork.phsw_chunk_remaining == 0)
+				{
+					Assert(vacrel->next_unskippable_block == end_block);
+					vacrel->next_unskippable_block = InvalidBlockNumber;
+					goto retry;
+				}
+			}
+
+			next_block = vacrel->next_unskippable_block;
 		}
 	}
 
@@ -1666,7 +1909,9 @@ heap_vac_scan_next_block(ReadStream *stream,
 }
 
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in a range of start_blk (inclusive) and end_blk (exclusive).
+ *
  * The next unskippable block and its visibility information is updated in
  * vacrel.
  *
@@ -1679,17 +1924,20 @@ heap_vac_scan_next_block(ReadStream *stream,
  * to skip such a range is actually made, making everything safe.)
  */
 static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start_blk, BlockNumber end_blk)
 {
 	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
 	bool		next_unskippable_allvis;
 
 	*skipsallvis = false;
 
-	for (;; next_unskippable_block++)
+	for (next_unskippable_block = start_blk;
+		 next_unskippable_block < end_blk;
+		 next_unskippable_block++)
 	{
 		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
 													   next_unskippable_block,
@@ -1776,6 +2024,235 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 	vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
 }
 
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches
+ * parallel workers to scan the heap in parallel.
+*/
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	ParallelLVScanWorker *scanworker;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Launch parallel workers */
+	parallel_lazy_scan_heap_begin(vacrel);
+
+	/*
+	 * Setup the parallel scan description for the leader to join as a worker.
+	 */
+	scanworker = palloc0(sizeof(ParallelLVScanWorker));
+	scanworker->last_blkno = InvalidBlockNumber;
+	table_block_parallelscan_startblock_init(vacrel->rel,
+											 &(scanworker->pbscanwork),
+											 vacrel->plvstate->pbscan);
+	vacrel->plvstate->scanworker = scanworker;
+
+	for (;;)
+	{
+		BlockNumber min_scan_blk;
+
+		/*
+		 * Do lazy heap scan until the read stream is exhausted. We will stop
+		 * retrieving new blocks for the read stream once the space of
+		 * dead_items TIDs exceeds the limit.
+		 */
+		do_lazy_scan_heap(vacrel, false);
+
+		/*
+		 * Parallel lazy heap scan finished. Wait for parallel workers to
+		 * finish and gather scan results.
+		 */
+		parallel_lazy_scan_heap_end(vacrel);
+
+		if (!dead_items_check_memory_limit(vacrel))
+			break;
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		min_scan_blk = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (min_scan_blk > vacrel->next_fsm_block_to_vacuum)
+		{
+			/*
+			 * min_scanned_blkno was updated when gathering the workers' scan
+			 * results.
+			 */
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									min_scan_blk + 1);
+			vacrel->next_fsm_block_to_vacuum = min_scan_blk;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+		/* Re-launch workers to restart parallel lazy scan */
+		parallel_lazy_scan_heap_begin(vacrel);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process launched fewer workers in the next cycle.
+	 */
+	complete_unfinihsed_lazy_scan_heap(vacrel);
+}
+
+/*
+ * Return the minimum block number the leader and workers have scanned so far.
+ */
+static BlockNumber
+parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel)
+{
+	BlockNumber min_blk;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	min_blk = vacrel->plvstate->scanworker->last_blkno;
+
+	/*
+	 * We check all worker scan states here to compute the minimum block
+	 * number among all scan states.
+	 */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		ParallelLVScanWorker *scanstate = &(vacrel->leader->scanstates[i]);
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!scanstate->inited)
+			continue;
+
+		if (!BlockNumberIsValid(min_blk) || scanstate->last_blkno < min_blk)
+			min_blk = scanstate->last_blkno;
+	}
+
+	Assert(BlockNumberIsValid(min_blk));
+	return min_blk;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+complete_unfinihsed_lazy_scan_heap(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		ParallelLVScanWorker *scanstate = &(vacrel->leader->scanstates[i]);
+
+		if (!scanstate->inited)
+			continue;
+
+		if (scanstate->pbscanwork.phsw_chunk_remaining == 0)
+			continue;
+
+		/* Attach the worker's scan state */
+		vacrel->plvstate->scanworker = scanstate;
+
+		/* Complete the unfinished scan */
+		do_lazy_scan_heap(vacrel, true);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_begin(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->leader->nworkers_launched = parallel_vacuum_collect_dead_items_begin(vacrel->pvs);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for collecting dead tuples (planned: %d)",
+							 "launched %d parallel vacuum workers for collecting dead tuples (planned: %d)",
+							 vacrel->leader->nworkers_launched),
+					vacrel->leader->nworkers_launched,
+					parallel_vacuum_get_nworkers_table(vacrel->pvs))));
+}
+
+/*
+ * Helper routine to finish the parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_scan_end(vacrel->pvs);
+
+	/* Gather the workers' scan results */
+	parallel_lazy_scan_gather_scan_results(vacrel);
+}
+
+/* Accumulate each worker's scan results into the leader's */
+static void
+parallel_lazy_scan_gather_scan_results(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		LVScanData *data = &(vacrel->plvstate->shared->worker_scandata[i]);
+
+#define ACCUM_COUNT(item) vacrel->scan_data->item += data->item
+		ACCUM_COUNT(scanned_pages);
+		ACCUM_COUNT(removed_pages);
+		ACCUM_COUNT(new_frozen_tuple_pages);
+		ACCUM_COUNT(vm_new_visible_pages);
+		ACCUM_COUNT(vm_new_visible_frozen_pages);
+		ACCUM_COUNT(vm_new_frozen_pages);
+		ACCUM_COUNT(lpdead_item_pages);
+		ACCUM_COUNT(missed_dead_pages);
+		ACCUM_COUNT(tuples_deleted);
+		ACCUM_COUNT(tuples_frozen);
+		ACCUM_COUNT(lpdead_items);
+		ACCUM_COUNT(live_tuples);
+		ACCUM_COUNT(recently_dead_tuples);
+		ACCUM_COUNT(missed_dead_tuples);
+#undef ACCUM_COUNT
+
+		Assert(TransactionIdIsValid(data->NewRelfrozenXid));
+		Assert(MultiXactIdIsValid(data->NewRelminMxid));
+
+		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
+			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
+
+		if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
+			vacrel->scan_data->nonempty_pages = data->nonempty_pages;
+
+		vacrel->scan_data->skippedallvis |= data->skippedallvis;
+	}
+}
+
 /*
  *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
  *
@@ -3489,12 +3966,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -3512,11 +3985,17 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize the parallel vacuum state for either lazy heap
+			 * scan, index vacuuming, or both.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3556,15 +4035,35 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
 	pgstat_progress_update_multi_param(2, prog_index, prog_val);
 }
 
+/*
+ * Check the memory usage of the collected dead items and return true
+ * if we are close to overrunning the available space for dead_items TIDs.
+ * However, let's force at least one page-worth of tuples to be stored as
+ * to ensure we do at least some work when the memory configured is so low
+ * that we run out before storing anything.
+ */
+static bool
+dead_items_check_memory_limit(LVRelState *vacrel)
+{
+	return vacrel->dead_items_info->num_items > 0 &&
+		TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes;
+}
+
 /*
  * Forget all collected dead items.
  */
@@ -3760,14 +4259,219 @@ update_relstats_all_indexes(LVRelState *vacrel)
 
 /*
  * Compute the number of workers for parallel heap vacuum.
- *
- * Return 0 to disable parallel vacuum.
  */
 int
 heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
 									 void *state)
 {
-	return 0;
+	int			parallel_workers = 0;
+
+	if (nworkers_requested == 0)
+	{
+		LVRelState *vacrel = (LVRelState *) state;
+		int			heap_parallel_threshold;
+		int			heap_pages;
+		BlockNumber allvisible;
+		BlockNumber allfrozen;
+
+		/*
+		 * Estimate the number of blocks that we're going to scan during
+		 * lazy_scan_heap().
+		 */
+		visibilitymap_count(rel, &allvisible, &allfrozen);
+		heap_pages = RelationGetNumberOfBlocks(rel) -
+			(vacrel->aggressive ? allfrozen : allvisible);
+
+		Assert(heap_pages >= 0);
+
+		/*
+		 * Select the number of workers based on the log of the number of
+		 * pages to scan. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nworkers_requested;
+
+	return parallel_workers;
+}
+
+/*
+ * Estimate shared memory size required for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							  void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	Size		size = 0;
+
+	vacrel->leader = palloc(sizeof(ParallelLVLeader));
+
+	/* Estimate space for ParallelLVShared */
+	size = add_size(size, SizeOfParallelLVShared);
+	size = add_size(size, mul_size(sizeof(LVScanData), nworkers));
+	vacrel->leader->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for ParallelBlockTableScanDesc */
+	vacrel->leader->pbscan_len = table_block_parallelscan_estimate(rel);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->pbscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of ParallelLVScanWorker */
+	vacrel->leader->scanstates_len = mul_size(sizeof(ParallelLVScanWorker), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scanstates_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ParallelLVShared *shared;
+	ParallelBlockTableScanDesc pbscan;
+	ParallelLVScanWorker *scanstates;
+
+	vacrel->plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Initialize ParallelLVShared */
+	shared = shm_toc_allocate(pcxt->toc, vacrel->leader->shared_len);
+	MemSet(shared, 0, vacrel->leader->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_data->NewRelminMxid;
+	shared->vistest = *vacrel->vistest;
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SHARED, shared);
+	vacrel->plvstate->shared = shared;
+
+	/* Initialize ParallelBlockTableScanDesc */
+	pbscan = shm_toc_allocate(pcxt->toc, vacrel->leader->pbscan_len);
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pbscan);
+	pbscan->base.phs_syncscan = false;	/* always start from the first block */
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANDESC, pbscan);
+	vacrel->plvstate->pbscan = pbscan;
+
+	/* Initialize the array of ParallelLVScanWorker */
+	scanstates = shm_toc_allocate(pcxt->toc, vacrel->leader->scanstates_len);
+	MemSet(scanstates, 0, vacrel->leader->scanstates_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_WORKER_SCANSTATE, scanstates);
+	vacrel->leader->scanstates = scanstates;
+}
+
+/*
+ * Initialize lazy vacuum state with the information retrieved from
+ * shared memory.
+ */
+void
+heap_parallel_vacuum_initialize_worker(Relation rel, ParallelVacuumState *pvs,
+									   ParallelWorkerContext *pwcxt,
+									   void **state_out)
+{
+	LVRelState *vacrel;
+	ParallelLVState *plvstate;
+	ParallelLVShared *shared;
+	ParallelLVScanWorker *scanstates;
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Initialize ParallelLVState and prepare the related objects */
+
+	plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Prepare ParallelLVShared */
+	shared = (ParallelLVShared *) shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SHARED, false);
+	plvstate->shared = shared;
+
+	/* Prepare ParallelBlockTableScanWorkerData */
+	pbscan = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANDESC, false);
+	plvstate->pbscan = pbscan;
+
+	/* Prepare ParallelLVScanWorker */
+	scanstates = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_WORKER_SCANSTATE, false);
+	plvstate->scanworker = &(scanstates[ParallelWorkerNumber]);
+
+	/* Initialize LVRelState and prepare fields required by lazy scan heap */
+	vacrel = palloc0(sizeof(LVRelState));
+	vacrel->rel = rel;
+	vacrel->indrels = parallel_vacuum_get_table_indexes(pvs,
+														&vacrel->nindexes);
+	vacrel->pvs = pvs;
+	vacrel->aggressive = shared->aggressive;
+	vacrel->skipwithvm = shared->skipwithvm;
+	vacrel->cutoffs = shared->cutoffs;
+	vacrel->vistest = &(shared->vistest);
+	vacrel->dead_items = parallel_vacuum_get_dead_items(pvs,
+														&vacrel->dead_items_info);
+	vacrel->plvstate = plvstate;
+	vacrel->scan_data = &(shared->worker_scandata[ParallelWorkerNumber]);
+	MemSet(vacrel->scan_data, 0, sizeof(LVScanData));
+	vacrel->scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = shared->NewRelminMxid;
+	vacrel->scan_data->skippedallvis = false;
+	vacrel->scan_data->rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Initialize the scan state if not yet. The chunk of blocks will be
+	 * allocated when to get the scan block for the first time.
+	 */
+	if (!vacrel->plvstate->scanworker->inited)
+	{
+		vacrel->plvstate->scanworker->inited = true;
+		table_block_parallelscan_startblock_init(rel,
+												 &(vacrel->plvstate->scanworker->pbscanwork),
+												 vacrel->plvstate->pbscan);
+	}
+
+	*state_out = (void *) vacrel;
+}
+
+/*
+ * Parallel heap vacuum callback for collecting dead items (i.e., lazy heap scan).
+ */
+void
+heap_parallel_vacuum_collect_dead_items(Relation rel, ParallelVacuumState *pvs,
+										void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ErrorContextCallback errcallback;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel->dbname = get_database_name(MyDatabaseId);
+	vacrel->relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Join the parallel heap vacuum */
+	do_lazy_scan_heap(vacrel, false);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
 }
 
 /*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..a0a92dc8be5 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -600,6 +600,21 @@ table_block_parallelscan_nextpage(Relation rel,
 	return page;
 }
 
+/*
+ * skip some blocks to scan.
+ *
+ * Consume the given number of blocks in the current chunk. It doesn't skip blocks
+ * beyond the current chunk.
+ */
+void
+table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+											 ParallelBlockTableScanWorker pbscanwork,
+											 ParallelBlockTableScanDesc pbscan,
+											 BlockNumber nblocks_skip)
+{
+	pbscanwork->phsw_chunk_remaining -= Min(nblocks_skip, pbscanwork->phsw_chunk_remaining);
+}
+
 /* ----------------------------------------------------------------------------
  * Helper functions to implement relation sizing for block oriented AMs.
  * ----------------------------------------------------------------------------
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bb0d690aed8..8473971ac4f 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -501,6 +501,26 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
+/*
+ * Return the number of parallel workers initialized for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
 /*
  * Returns the dead items space and dead items information.
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 6a1ca5d5ca7..b94d783c31e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -15,6 +15,7 @@
 #define HEAPAM_H
 
 #include "access/heapam_xlog.h"
+#include "access/parallel.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
@@ -407,10 +408,20 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 
 /* in heap/vacuumlazy.c */
 struct VacuumParams;
+struct ParallelVacuumState;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
 extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
 												 void *state);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+										  void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+												   ParallelWorkerContext *pwcxt,
+												   void **state_out);
+extern void heap_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+													void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c61b1700953..6b72883353e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -2170,6 +2170,10 @@ extern BlockNumber table_block_parallelscan_nextpage(Relation rel,
 extern void table_block_parallelscan_startblock_init(Relation rel,
 													 ParallelBlockTableScanWorker pbscanwork,
 													 ParallelBlockTableScanDesc pbscan);
+extern void table_block_parallelscan_skip_pages_in_chunk(Relation rel,
+														 ParallelBlockTableScanWorker pbscanwork,
+														 ParallelBlockTableScanDesc pbscan,
+														 BlockNumber nblocks_skip);
 
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e8f75fc67b1..d66a1bf56fa 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -385,6 +385,8 @@ extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels
 												 BufferAccessStrategy bstrategy,
 												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 0abcc99989e..f92c3f73c29 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -160,6 +160,11 @@ UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
 VACUUM (PARALLEL -1) pvactst; -- error
 ERROR:  parallel workers for vacuum must be between 0 and 1024
 LINE 1: VACUUM (PARALLEL -1) pvactst;
@@ -185,6 +190,7 @@ VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 WARNING:  disabling parallel option of vacuum on "tmp" --- cannot vacuum temporary tables in parallel
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
 DROP TABLE pvactst2;
 -- INDEX_CLEANUP option
diff --git a/src/test/regress/sql/vacuum.sql b/src/test/regress/sql/vacuum.sql
index a72bdb5b619..b8abab28ea9 100644
--- a/src/test/regress/sql/vacuum.sql
+++ b/src/test/regress/sql/vacuum.sql
@@ -129,6 +129,12 @@ VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
 
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
+
 VACUUM (PARALLEL -1) pvactst; -- error
 VACUUM (PARALLEL 2, INDEX_CLEANUP FALSE) pvactst;
 VACUUM (PARALLEL 2, FULL TRUE) pvactst; -- error, cannot use both PARALLEL and FULL
@@ -148,6 +154,7 @@ CREATE INDEX tmp_idx1 ON tmp (a);
 VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
 DROP TABLE pvactst2;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9d308143ed6..2ac035237ba 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1936,6 +1936,10 @@ PLpgSQL_type
 PLpgSQL_type_type
 PLpgSQL_var
 PLpgSQL_variable
+ParallelLVLeader
+ParallelLVScanWorker
+ParallelLVShared
+ParallelLVState
 PLwdatum
 PLword
 PLyArrayToOb
-- 
2.43.5

v13-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patchapplication/octet-stream; name=v13-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patchDownload

From 3845c671185ebfbdf2f116f5009327839dfdd236 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:00:46 -0800
Subject: [PATCH v13 4/5] Move GlobalVisState definition to snapmgr_internal.h.

This commit expose the GlobalVisState struct in
snapmgr_internal.h. This is a preparatory work for parallel vacuum
heap scan.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/storage/ipc/procarray.c  | 74 ----------------------
 src/include/utils/snapmgr.h          |  2 +-
 src/include/utils/snapmgr_internal.h | 91 ++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+), 75 deletions(-)
 create mode 100644 src/include/utils/snapmgr_internal.h

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e5b945a9ee3..efdc443ee0e 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,80 +99,6 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
-/*
- * State for the GlobalVisTest* family of functions. Those functions can
- * e.g. be used to decide if a deleted row can be removed without violating
- * MVCC semantics: If the deleted row's xmax is not considered to be running
- * by anyone, the row can be removed.
- *
- * To avoid slowing down GetSnapshotData(), we don't calculate a precise
- * cutoff XID while building a snapshot (looking at the frequently changing
- * xmins scales badly). Instead we compute two boundaries while building the
- * snapshot:
- *
- * 1) definitely_needed, indicating that rows deleted by XIDs >=
- *    definitely_needed are definitely still visible.
- *
- * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
- *    definitely be removed
- *
- * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
- * && XID < definitely_needed), the boundaries can be recomputed (using
- * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
- * maintaining an accurate value all the time.
- *
- * As it is not cheap to compute accurate boundaries, we limit the number of
- * times that happens in short succession. See GlobalVisTestShouldUpdate().
- *
- *
- * There are three backend lifetime instances of this struct, optimized for
- * different types of relations. As e.g. a normal user defined table in one
- * database is inaccessible to backends connected to another database, a test
- * specific to a relation can be more aggressive than a test for a shared
- * relation.  Currently we track four different states:
- *
- * 1) GlobalVisSharedRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in any database, nor a
- *    replication slot's xmin, nor a replication slot's catalog_xmin might
- *    still consider XID as running.
- *
- * 2) GlobalVisCatalogRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin, nor a replication slot's
- *    catalog_xmin might still consider XID as running.
- *
- *    I.e. the difference to GlobalVisSharedRels is that
- *    snapshot in other databases are ignored.
- *
- * 3) GlobalVisDataRels, which only considers an XID's
- *    effects visible-to-everyone if neither snapshots in the current
- *    database, nor a replication slot's xmin consider XID as running.
- *
- *    I.e. the difference to GlobalVisCatalogRels is that
- *    replication slot's catalog_xmin is not taken into account.
- *
- * 4) GlobalVisTempRels, which only considers the current session, as temp
- *    tables are not visible to other sessions.
- *
- * GlobalVisTestFor(relation) returns the appropriate state
- * for the relation.
- *
- * The boundaries are FullTransactionIds instead of TransactionIds to avoid
- * wraparound dangers. There e.g. would otherwise exist no procarray state to
- * prevent maybe_needed to become old enough after the GetSnapshotData()
- * call.
- *
- * The typedef is in the header.
- */
-struct GlobalVisState
-{
-	/* XIDs >= are considered running by some backend */
-	FullTransactionId definitely_needed;
-
-	/* XIDs < are not considered to be running by any backend */
-	FullTransactionId maybe_needed;
-};
-
 /*
  * Result of ComputeXidHorizons().
  */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..3b6fb603544 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -17,6 +17,7 @@
 #include "utils/relcache.h"
 #include "utils/resowner.h"
 #include "utils/snapshot.h"
+#include "utils/snapmgr_internal.h"
 
 
 extern PGDLLIMPORT bool FirstSnapshotSet;
@@ -95,7 +96,6 @@ extern char *ExportSnapshot(Snapshot snapshot);
  * These live in procarray.c because they're intimately linked to the
  * procarray contents, but thematically they better fit into snapmgr.h.
  */
-typedef struct GlobalVisState GlobalVisState;
 extern GlobalVisState *GlobalVisTestFor(Relation rel);
 extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
 extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
diff --git a/src/include/utils/snapmgr_internal.h b/src/include/utils/snapmgr_internal.h
new file mode 100644
index 00000000000..4363adf7f62
--- /dev/null
+++ b/src/include/utils/snapmgr_internal.h
@@ -0,0 +1,91 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapmgr_internal.h
+ *		This file contains declarations of structs for snapshot manager
+ *		for internal use.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/snapmgr_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SNAPMGR_INTERNAL_H
+#define SNAPMGR_INTERNAL_H
+
+#include "access/transam.h"
+
+/*
+ * State for the GlobalVisTest* family of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track four different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisDataRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * 4) GlobalVisTempRels, which only considers the current session, as temp
+ *    tables are not visible to other sessions.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ */
+typedef struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+} GlobalVisState;
+
+#endif							/* SNAPMGR_INTERNAL_H */
-- 
2.43.5

v13-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchapplication/octet-stream; name=v13-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchDownload

From f360918e3ce449d0d202cb7019f6fe35d1ec5cc2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 26 Feb 2025 11:31:55 -0800
Subject: [PATCH v13 3/5] Move lazy heap scan related variables to new struct
 LVScanData.

This is a pure refactoring for upcoming parallel heap scan, which
requires storing relation statistics collected during lazy heap scan
to a shared memory area.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c | 343 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 181 insertions(+), 163 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 735b2a1cfdc..c54cffdc399 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -256,6 +256,56 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * Data and counters updated during lazy heap scan.
+ */
+typedef struct LVScanData
+{
+	BlockNumber rel_pages;		/* total number of pages */
+
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+
+	/*
+	 * Count of all-visible blocks eagerly scanned (for logging only). This
+	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
+	 */
+	BlockNumber eager_scanned_pages;
+
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVScanData;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -282,10 +332,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -310,35 +356,8 @@ typedef struct LVRelState
 	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
 	VacDeadItemsInfo *dead_items_info;
 
-	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-
-	/*
-	 * Count of all-visible blocks eagerly scanned (for logging only). This
-	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
-	 */
-	BlockNumber eager_scanned_pages;
-
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
-
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
-
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	/* Data and counters updated during lazy heap scan */
+	LVScanData *scan_data;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -348,13 +367,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -524,7 +536,7 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	 * the first region, making the second region the first to be eager
 	 * scanned normally.
 	 */
-	if (vacrel->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
+	if (vacrel->scan_data->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
 		return;
 
 	/*
@@ -616,6 +628,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVScanData *scan_data;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -730,14 +743,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->eager_scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	scan_data = palloc(sizeof(LVScanData));
+	scan_data->scanned_pages = 0;
+	scan_data->eager_scanned_pages = 0;
+	scan_data->removed_pages = 0;
+	scan_data->new_frozen_tuple_pages = 0;
+	scan_data->lpdead_item_pages = 0;
+	scan_data->missed_dead_pages = 0;
+	scan_data->nonempty_pages = 0;
+	scan_data->tuples_deleted = 0;
+	scan_data->tuples_frozen = 0;
+	scan_data->lpdead_items = 0;
+	scan_data->live_tuples = 0;
+	scan_data->recently_dead_tuples = 0;
+	scan_data->missed_dead_tuples = 0;
+	scan_data->vm_new_visible_pages = 0;
+	scan_data->vm_new_visible_frozen_pages = 0;
+	scan_data->vm_new_frozen_pages = 0;
+	scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel->scan_data = scan_data;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -747,17 +771,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
-
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
-	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -778,15 +793,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_data->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_data->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 
 	/*
 	 * Initialize state related to tracking all-visible page skipping. This is
 	 * very important to determine whether or not it is safe to advance the
 	 * relfrozenxid/relminmxid.
 	 */
-	vacrel->skippedallvis = false;
+	vacrel->scan_data->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -874,15 +889,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_data->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_data->NewRelfrozenXid));
+	Assert(vacrel->scan_data->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_data->NewRelminMxid));
+	if (vacrel->scan_data->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -890,15 +905,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_data->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_data->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * pg_class.relpages to
 	 */
-	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
+	new_rel_pages = vacrel->scan_data->rel_pages;	/* After possible rel
+													 * truncation */
 	visibilitymap_count(rel, &new_rel_allvisible, &new_rel_allfrozen);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
@@ -921,7 +937,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, new_rel_allfrozen,
 						vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_data->NewRelfrozenXid,
+						vacrel->scan_data->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -937,8 +954,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples,
+						 vacrel->scan_data->recently_dead_tuples +
+						 vacrel->scan_data->missed_dead_tuples,
 						 starttime);
 	pgstat_progress_end_command();
 
@@ -1012,23 +1029,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u eagerly scanned\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_data->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_data->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages /
+							 100.0 * vacrel->scan_data->scanned_pages /
 							 orig_rel_pages,
-							 vacrel->eager_scanned_pages);
+							 vacrel->scan_data->eager_scanned_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->scan_data->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->scan_data->recently_dead_tuples);
+			if (vacrel->scan_data->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->scan_data->missed_dead_tuples,
+								 vacrel->scan_data->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -1036,33 +1053,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_data->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_data->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_data->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_data->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_data->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_data->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 (long long) vacrel->scan_data->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_data->vm_new_visible_pages,
+							 vacrel->scan_data->vm_new_visible_frozen_pages +
+							 vacrel->scan_data->vm_new_frozen_pages,
+							 vacrel->scan_data->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -1082,10 +1099,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_data->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->scan_data->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->scan_data->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -1199,7 +1216,7 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->rel_pages,
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
 				blkno = 0,
 				next_fsm_block_to_vacuum = 0;
 	BlockNumber orig_eager_scan_success_limit =
@@ -1255,8 +1272,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages > 0 &&
-			vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_data->scanned_pages > 0 &&
+			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1311,9 +1328,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		blkno = BufferGetBlockNumber(buf);
 
-		vacrel->scanned_pages++;
+		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
-			vacrel->eager_scanned_pages++;
+			vacrel->scan_data->eager_scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -1495,16 +1512,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
 
 	read_stream_end(stream);
 
@@ -1551,7 +1568,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * callback_private_data contains a reference to the LVRelState, passed to the
  * read stream API during stream setup. The LVRelState is an in/out parameter
  * here (locally named `vacrel`). Vacuum options and information about the
- * relation are read from it. vacrel->skippedallvis is set if we skip a block
+ * relation are read from it. vacrel->scan_data->skippedallvis is set if we skip a block
  * that's all-visible but not all-frozen (to ensure that we don't update
  * relfrozenxid in that case). vacrel also holds information about the next
  * unskippable block -- as bookkeeping for this function.
@@ -1569,7 +1586,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 	next_block = vacrel->current_block + 1;
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->rel_pages)
+	if (next_block >= vacrel->scan_data->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1613,7 +1630,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_data->skippedallvis = true;
 		}
 	}
 
@@ -1664,7 +1681,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 static void
 find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
@@ -1895,11 +1912,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_data->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1974,10 +1991,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_data->NewRelfrozenXid, &vacrel->scan_data->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_data->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_data->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1987,7 +2004,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_data->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -2022,7 +2039,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -2037,15 +2054,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_data->tuples_deleted += presult.ndeleted;
+	vacrel->scan_data->tuples_frozen += presult.nfrozen;
+	vacrel->scan_data->lpdead_items += presult.lpdead_items;
+	vacrel->scan_data->live_tuples += presult.live_tuples;
+	vacrel->scan_data->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2094,17 +2111,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (presult.all_frozen)
 			{
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 				*vm_page_frozen = true;
 			}
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2192,8 +2209,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 
@@ -2203,7 +2220,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		else
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2244,8 +2261,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_data->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2372,8 +2389,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_data->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -2400,25 +2417,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_data->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_data->live_tuples += live_tuples;
+	vacrel->scan_data->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_data->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_data->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -2447,7 +2464,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_data->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -2476,12 +2493,12 @@ lazy_vacuum(LVRelState *vacrel)
 	 * HOT through careful tuning.
 	 */
 	bypass = false;
-	if (vacrel->consider_bypass_optimization && vacrel->rel_pages > 0)
+	if (vacrel->consider_bypass_optimization && vacrel->scan_data->rel_pages > 0)
 	{
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_data->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2507,8 +2524,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * be negligible.  If this optimization is ever expanded to cover more
 		 * cases then this may need to be reconsidered.
 		 */
-		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		threshold = (double) vacrel->scan_data->rel_pages * BYPASS_THRESHOLD_PAGES;
+		bypass = (vacrel->scan_data->lpdead_item_pages < threshold &&
 				  TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);
 	}
 
@@ -2646,7 +2663,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2798,8 +2815,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
+			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2915,14 +2932,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -2998,7 +3015,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_data->scanned_pages < vacrel->scan_data->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -3179,10 +3196,10 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->scan_data->rel_pages - vacrel->scan_data->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
-		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
+		 possibly_freeable >= vacrel->scan_data->rel_pages / REL_TRUNCATE_FRACTION))
 		return true;
 
 	return false;
@@ -3194,7 +3211,7 @@ should_attempt_truncation(LVRelState *vacrel)
 static void
 lazy_truncate_heap(LVRelState *vacrel)
 {
-	BlockNumber orig_rel_pages = vacrel->rel_pages;
+	BlockNumber orig_rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber new_rel_pages;
 	bool		lock_waiter_detected;
 	int			lock_retry;
@@ -3205,7 +3222,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_data->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -3306,15 +3323,15 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
-		vacrel->rel_pages = new_rel_pages;
+		vacrel->scan_data->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_data->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
 				(errmsg("table \"%s\": truncated %u to %u pages",
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_data->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -3338,11 +3355,11 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * unsigned.)  To make the scan faster, we prefetch a few blocks at a time
 	 * in forward direction, so that OS-level readahead can kick in.
 	 */
-	blkno = vacrel->rel_pages;
+	blkno = vacrel->scan_data->rel_pages;
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_data->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -3454,7 +3471,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_data->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 159e14486d9..9d308143ed6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1504,6 +1504,7 @@ LSEG
 LUID
 LVRelState
 LVSavedErrInfo
+LVScanData
 LWLock
 LWLockHandle
 LWLockMode
-- 
2.43.5

v13-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchapplication/octet-stream; name=v13-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchDownload

From 1534f2539b3728d97852ac9e098f38a7d2bb13fa Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v13 1/5] Introduces table AM APIs for parallel table
 vacuuming.

This commit introduces the following new table AM APIs for parallel
heap vacuuming:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

There is no code using these new APIs for now. Upcoming parallel
vacuum patches utilize these APIs.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/heapam_handler.c |   4 +-
 src/backend/access/heap/vacuumlazy.c     |  12 ++
 src/backend/access/table/tableamapi.c    |  11 ++
 src/include/access/heapam.h              |   2 +
 src/include/access/tableam.h             | 140 +++++++++++++++++++++++
 5 files changed, 168 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 24d3765aa20..a534100692a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2710,7 +2710,9 @@ static const TableAmRoutine heapam_methods = {
 
 	.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
 	.scan_sample_next_block = heapam_scan_sample_next_block,
-	.scan_sample_next_tuple = heapam_scan_sample_next_tuple
+	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2cbcf5e5db2..b4100dacd1d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3741,6 +3741,18 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Return 0 to disable parallel vacuum.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									 void *state)
+{
+	return 0;
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 476663b66aa..c3ee9869e12 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,6 +81,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
+	Assert(routine->parallel_vacuum_compute_workers != NULL);
 	Assert(routine->scan_analyze_next_block != NULL);
 	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
@@ -94,6 +95,16 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->scan_sample_next_block != NULL);
 	Assert(routine->scan_sample_next_tuple != NULL);
 
+	/*
+	 * Callbacks for parallel vacuum are also optional (except for
+	 * parallel_vacuum_compute_workers). But one callback implies presence of
+	 * the others.
+	 */
+	Assert(((((routine->parallel_vacuum_estimate == NULL) ==
+			  (routine->parallel_vacuum_initialize == NULL)) ==
+			 (routine->parallel_vacuum_initialize_worker == NULL)) ==
+			(routine->parallel_vacuum_collect_dead_items == NULL)));
+
 	return routine;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..6a1ca5d5ca7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -409,6 +409,8 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+												 void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b8cb1e744ad..c61b1700953 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -35,6 +35,9 @@ extern PGDLLIMPORT bool synchronize_seqscans;
 
 struct BulkInsertStateData;
 struct IndexInfo;
+struct ParallelContext;
+struct ParallelVacuumState;
+struct ParallelWorkerContext;
 struct SampleScanState;
 struct VacuumParams;
 struct ValidateIndexState;
@@ -655,6 +658,81 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * parallel degree for parallel vacuum is further limited by
+	 * max_parallel_maintenance_workers. The function must return 0 to disable
+	 * parallel table vacuum.
+	 *
+	 * 'nworkers_requested' is a >=0 number and the requested number of
+	 * workers. This comes from the PARALLEL option. 0 means to choose the
+	 * parallel degree based on the table AM specific factors such as table
+	 * size.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel,
+													int nworkers_requested,
+													void *state);
+
+	/*
+	 * Estimate the size of shared memory needed for a parallel table vacuum
+	 * of this relation.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 struct ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   struct ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * Initialize AM-specific vacuum state for worker processes.
+	 *
+	 * The state_out is the output parameter so that arbitrary data can be
+	 * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize_worker) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  struct ParallelWorkerContext *pwcxt,
+													  void **state_out);
+
+	/*
+	 * Execute a parallel scan to collect dead items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_collect_dead_items) (Relation rel,
+													   struct ParallelVacuumState *pvs,
+													   void *state);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1680,6 +1758,68 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Compute the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									  void *state)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel,
+															nworkers_requested,
+															state);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, struct ParallelContext *pcxt,
+							   int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, struct ParallelContext *pcxt,
+								 int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize AM-specific vacuum state for worker processes.
+ */
+static inline void
+table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+										struct ParallelWorkerContext *pwcxt,
+										void **state_out)
+{
+	rel->rd_tableam->parallel_vacuum_initialize_worker(rel, pvs, pwcxt, state_out);
+}
+
+/*
+ * Execute a parallel vacuum scan to collect dead items.
+ */
+static inline void
+table_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+										 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_collect_dead_items(rel, pvs, state);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
-- 
2.43.5

#89

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Andres Freund (#87)

Re: Parallel heap vacuum

On Sun, Mar 23, 2025 at 10:13 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-03-23 01:45:35 -0700, Masahiko Sawada wrote:

Another idea is that parallel workers don't exit phase 1 until it
consumes all pinned buffers in the queue, even if the memory usage of
TidStore exceeds the limit.

Yes, that seems a quite reasonable approach to me.

It would need to add new functionality to the read stream to disable the
look-ahead reading.

Couldn't your next block callback simply return InvalidBlockNumber once close
to the memory limit?

Good idea.

Since we could use much memory while processing these buffers, exceeding the
memory limit, we can trigger this mode when the memory usage of TidStore
reaches 70% of the limit or so.

It wouldn't be that much memory, would it? A few 10s-100s of buffers don't
increase the size of a TidStore that much? Using 10 parallel vacuum with a
m_w_m of 1MB doesn't make sense, we'd constantly start/stop workers.

It ultimately depends on how big the next size class of DSA segment
is, but typically yes. We configure the max DSA segment size based on
the maintenance_work_mem.

On the other hand, it means that we would not use the streaming read for the
blocks in this mode, which is not efficient.

I don't follow - why wouldn't you be using streaming read?

I was thinking of idea like stopping look-ahead reading at some point
and processing pages without the read stream until the TidStore
reaches the limit. But it seems like simpler if we could stop
retrieving next block for the read stream once the TidStore reached
the limit and then continue phase 1 until the read stream is
exhausted.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#90

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Masahiko Sawada (#88)

Re: Parallel heap vacuum

On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

You're right. I've studied the read stream code and figured out how to
use it. In the attached patch, we end the read stream at the end of
phase 1 and start a new read stream, as you suggested.

I've started looking at this patch set some more.

In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD
codepath and run out of unskippable blocks in the current chunk and
then go back to get another chunk (goto retry) but we are near the
memory limit so we can't get another block
(!dead_items_check_memory_limit()), could we get an infinite loop?

Or even incorrectly encroach on another worker's block? Asking that
because of this math
end_block = next_block +

vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1;

if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0
and we are in the goto retry loop, it seems like we could keep
incrementing next_block even when we shouldn't be.

I just want to make sure that the skip pages optimization works with
the parallel block assignment and the low memory read stream
wind-down.

I also think you do not need to expose
table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is
only called in heap-specific code and the logic seems very
heap-related. If table AMs want something to skip some blocks, they
could easily implement it.

On another topic, I think it would be good to have a comment above
this code in parallel_lazy_scan_gather_scan_results(), stating why we
are very sure it is correct.
Assert(TransactionIdIsValid(data->NewRelfrozenXid));
Assert(MultiXactIdIsValid(data->NewRelminMxid));

if (TransactionIdPrecedes(data->NewRelfrozenXid,
vacrel->scan_data->NewRelfrozenXid))
vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;

if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid,
vacrel->scan_data->NewRelminMxid))
vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;

if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
vacrel->scan_data->nonempty_pages = data->nonempty_pages;

vacrel->scan_data->skippedallvis |= data->skippedallvis;

Parallel relfrozenxid advancement sounds scary, and scary things are
best with comments. Even though the way this works is intuitive, I
think it is worth pointing out that this part is important to get
right so future programmers know how important it is.

One thing I was wondering about is if there are any implications of
different workers having different values in their GlobalVisState.
GlobalVisState can be updated during vacuum, so even if they start out
with the same values, that could diverge. It is probably okay since it
just controls what tuples are removable. Some workers may remove fewer
tuples than they absolutely could, and this is probably okay.

And if it is okay for each worker to have different GlobalVisState
then maybe you shouldn't have a GlobalVisState in shared memory. If
you look at GlobalVisTestFor() it just returns the memory address of
that global variable in the backend. So, it seems like it might be
better to just let each parallel worker use their own backend local
GlobalVisState and not try to put it in shared memory and copy it from
one worker to the other workers when initializing them. I'm not sure.
At the very least, there should be a comment explaining why you've
done it the way you have done it.

- Melanie

#91

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Melanie Plageman (#90)

Re: Parallel heap vacuum

On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

You're right. I've studied the read stream code and figured out how to
use it. In the attached patch, we end the read stream at the end of
phase 1 and start a new read stream, as you suggested.

I've started looking at this patch set some more.

Thank you for reviewing the patch!

In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD
codepath and run out of unskippable blocks in the current chunk and
then go back to get another chunk (goto retry) but we are near the
memory limit so we can't get another block
(!dead_items_check_memory_limit()), could we get an infinite loop?

Or even incorrectly encroach on another worker's block? Asking that
because of this math
end_block = next_block +

vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1;

You're right. We should make sure that reset next_block is reset to
InvalidBlockNumber at the beginning of the retry loop.

if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0
and we are in the goto retry loop, it seems like we could keep
incrementing next_block even when we shouldn't be.

Right. Will fix.

I just want to make sure that the skip pages optimization works with
the parallel block assignment and the low memory read stream
wind-down.

I also think you do not need to expose
table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is
only called in heap-specific code and the logic seems very
heap-related. If table AMs want something to skip some blocks, they
could easily implement it.

Agreed. Will remove it.

On another topic, I think it would be good to have a comment above
this code in parallel_lazy_scan_gather_scan_results(), stating why we
are very sure it is correct.
Assert(TransactionIdIsValid(data->NewRelfrozenXid));
Assert(MultiXactIdIsValid(data->NewRelminMxid));

if (TransactionIdPrecedes(data->NewRelfrozenXid,
vacrel->scan_data->NewRelfrozenXid))
vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;

if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid,
vacrel->scan_data->NewRelminMxid))
vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;

if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
vacrel->scan_data->nonempty_pages = data->nonempty_pages;

vacrel->scan_data->skippedallvis |= data->skippedallvis;

Parallel relfrozenxid advancement sounds scary, and scary things are
best with comments. Even though the way this works is intuitive, I
think it is worth pointing out that this part is important to get
right so future programmers know how important it is.

One thing I was wondering about is if there are any implications of
different workers having different values in their GlobalVisState.
GlobalVisState can be updated during vacuum, so even if they start out
with the same values, that could diverge. It is probably okay since it
just controls what tuples are removable. Some workers may remove fewer
tuples than they absolutely could, and this is probably okay.

Good point.

And if it is okay for each worker to have different GlobalVisState
then maybe you shouldn't have a GlobalVisState in shared memory. If
you look at GlobalVisTestFor() it just returns the memory address of
that global variable in the backend. So, it seems like it might be
better to just let each parallel worker use their own backend local
GlobalVisState and not try to put it in shared memory and copy it from
one worker to the other workers when initializing them. I'm not sure.
At the very least, there should be a comment explaining why you've
done it the way you have done it.

Agreed. IIUC it's not a problem even if parallel workers use their own
GlobalVisState. I'll make that change and remove the 0004 patch which
exposes GlobalVisState.

I'll send the updated patch soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#92

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Masahiko Sawada (#91)

4 attachment(s)

Re: Parallel heap vacuum

On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

You're right. I've studied the read stream code and figured out how to
use it. In the attached patch, we end the read stream at the end of
phase 1 and start a new read stream, as you suggested.

I've started looking at this patch set some more.

Thank you for reviewing the patch!

In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD
codepath and run out of unskippable blocks in the current chunk and
then go back to get another chunk (goto retry) but we are near the
memory limit so we can't get another block
(!dead_items_check_memory_limit()), could we get an infinite loop?

Or even incorrectly encroach on another worker's block? Asking that
because of this math
end_block = next_block +

vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1;

You're right. We should make sure that reset next_block is reset to
InvalidBlockNumber at the beginning of the retry loop.

if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0
and we are in the goto retry loop, it seems like we could keep
incrementing next_block even when we shouldn't be.

Right. Will fix.

I just want to make sure that the skip pages optimization works with
the parallel block assignment and the low memory read stream
wind-down.

I also think you do not need to expose
table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is
only called in heap-specific code and the logic seems very
heap-related. If table AMs want something to skip some blocks, they
could easily implement it.

Agreed. Will remove it.

On another topic, I think it would be good to have a comment above
this code in parallel_lazy_scan_gather_scan_results(), stating why we
are very sure it is correct.
Assert(TransactionIdIsValid(data->NewRelfrozenXid));
Assert(MultiXactIdIsValid(data->NewRelminMxid));

if (TransactionIdPrecedes(data->NewRelfrozenXid,
vacrel->scan_data->NewRelfrozenXid))
vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;

if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid,
vacrel->scan_data->NewRelminMxid))
vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;

if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
vacrel->scan_data->nonempty_pages = data->nonempty_pages;

vacrel->scan_data->skippedallvis |= data->skippedallvis;

Parallel relfrozenxid advancement sounds scary, and scary things are
best with comments. Even though the way this works is intuitive, I
think it is worth pointing out that this part is important to get
right so future programmers know how important it is.

One thing I was wondering about is if there are any implications of
different workers having different values in their GlobalVisState.
GlobalVisState can be updated during vacuum, so even if they start out
with the same values, that could diverge. It is probably okay since it
just controls what tuples are removable. Some workers may remove fewer
tuples than they absolutely could, and this is probably okay.

Good point.

And if it is okay for each worker to have different GlobalVisState
then maybe you shouldn't have a GlobalVisState in shared memory. If
you look at GlobalVisTestFor() it just returns the memory address of
that global variable in the backend. So, it seems like it might be
better to just let each parallel worker use their own backend local
GlobalVisState and not try to put it in shared memory and copy it from
one worker to the other workers when initializing them. I'm not sure.
At the very least, there should be a comment explaining why you've
done it the way you have done it.

Agreed. IIUC it's not a problem even if parallel workers use their own
GlobalVisState. I'll make that change and remove the 0004 patch which
exposes GlobalVisState.

I'll send the updated patch soon.

I've attached the updated patches. This version includes the comments
from Melanie, some bug fixes, and comment updates.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v14-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchapplication/octet-stream; name=v14-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchDownload

From 044fdbeb23611388507a569412d672f39535b068 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 26 Feb 2025 11:31:55 -0800
Subject: [PATCH v14 3/4] Move lazy heap scan related variables to new struct
 LVScanData.

This is a pure refactoring for upcoming parallel heap scan, which
requires storing relation statistics collected during lazy heap scan
to a shared memory area.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c | 343 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 181 insertions(+), 163 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 735b2a1cfdc..c54cffdc399 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -256,6 +256,56 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * Data and counters updated during lazy heap scan.
+ */
+typedef struct LVScanData
+{
+	BlockNumber rel_pages;		/* total number of pages */
+
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+
+	/*
+	 * Count of all-visible blocks eagerly scanned (for logging only). This
+	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
+	 */
+	BlockNumber eager_scanned_pages;
+
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVScanData;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -282,10 +332,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -310,35 +356,8 @@ typedef struct LVRelState
 	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
 	VacDeadItemsInfo *dead_items_info;
 
-	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-
-	/*
-	 * Count of all-visible blocks eagerly scanned (for logging only). This
-	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
-	 */
-	BlockNumber eager_scanned_pages;
-
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
-
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
-
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	/* Data and counters updated during lazy heap scan */
+	LVScanData *scan_data;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -348,13 +367,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -524,7 +536,7 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	 * the first region, making the second region the first to be eager
 	 * scanned normally.
 	 */
-	if (vacrel->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
+	if (vacrel->scan_data->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
 		return;
 
 	/*
@@ -616,6 +628,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVScanData *scan_data;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -730,14 +743,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->eager_scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	scan_data = palloc(sizeof(LVScanData));
+	scan_data->scanned_pages = 0;
+	scan_data->eager_scanned_pages = 0;
+	scan_data->removed_pages = 0;
+	scan_data->new_frozen_tuple_pages = 0;
+	scan_data->lpdead_item_pages = 0;
+	scan_data->missed_dead_pages = 0;
+	scan_data->nonempty_pages = 0;
+	scan_data->tuples_deleted = 0;
+	scan_data->tuples_frozen = 0;
+	scan_data->lpdead_items = 0;
+	scan_data->live_tuples = 0;
+	scan_data->recently_dead_tuples = 0;
+	scan_data->missed_dead_tuples = 0;
+	scan_data->vm_new_visible_pages = 0;
+	scan_data->vm_new_visible_frozen_pages = 0;
+	scan_data->vm_new_frozen_pages = 0;
+	scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel->scan_data = scan_data;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -747,17 +771,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
-
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
-	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -778,15 +793,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_data->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_data->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 
 	/*
 	 * Initialize state related to tracking all-visible page skipping. This is
 	 * very important to determine whether or not it is safe to advance the
 	 * relfrozenxid/relminmxid.
 	 */
-	vacrel->skippedallvis = false;
+	vacrel->scan_data->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -874,15 +889,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_data->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_data->NewRelfrozenXid));
+	Assert(vacrel->scan_data->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_data->NewRelminMxid));
+	if (vacrel->scan_data->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -890,15 +905,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_data->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_data->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * pg_class.relpages to
 	 */
-	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
+	new_rel_pages = vacrel->scan_data->rel_pages;	/* After possible rel
+													 * truncation */
 	visibilitymap_count(rel, &new_rel_allvisible, &new_rel_allfrozen);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
@@ -921,7 +937,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, new_rel_allfrozen,
 						vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_data->NewRelfrozenXid,
+						vacrel->scan_data->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -937,8 +954,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples,
+						 vacrel->scan_data->recently_dead_tuples +
+						 vacrel->scan_data->missed_dead_tuples,
 						 starttime);
 	pgstat_progress_end_command();
 
@@ -1012,23 +1029,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u eagerly scanned\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_data->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_data->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages /
+							 100.0 * vacrel->scan_data->scanned_pages /
 							 orig_rel_pages,
-							 vacrel->eager_scanned_pages);
+							 vacrel->scan_data->eager_scanned_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
-							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->scan_data->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 (long long) vacrel->scan_data->recently_dead_tuples);
+			if (vacrel->scan_data->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
-								 (long long) vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 (long long) vacrel->scan_data->missed_dead_tuples,
+								 vacrel->scan_data->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -1036,33 +1053,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_data->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_data->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_data->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_data->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %lld tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_data->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_data->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 (long long) vacrel->tuples_frozen);
+							 (long long) vacrel->scan_data->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_data->vm_new_visible_pages,
+							 vacrel->scan_data->vm_new_visible_frozen_pages +
+							 vacrel->scan_data->vm_new_frozen_pages,
+							 vacrel->scan_data->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -1082,10 +1099,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_data->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 (long long) vacrel->lpdead_items);
+							 100.0 * vacrel->scan_data->lpdead_item_pages / orig_rel_pages,
+							 (long long) vacrel->scan_data->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -1199,7 +1216,7 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->rel_pages,
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
 				blkno = 0,
 				next_fsm_block_to_vacuum = 0;
 	BlockNumber orig_eager_scan_success_limit =
@@ -1255,8 +1272,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages > 0 &&
-			vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_data->scanned_pages > 0 &&
+			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1311,9 +1328,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		blkno = BufferGetBlockNumber(buf);
 
-		vacrel->scanned_pages++;
+		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
-			vacrel->eager_scanned_pages++;
+			vacrel->scan_data->eager_scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -1495,16 +1512,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
 
 	read_stream_end(stream);
 
@@ -1551,7 +1568,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * callback_private_data contains a reference to the LVRelState, passed to the
  * read stream API during stream setup. The LVRelState is an in/out parameter
  * here (locally named `vacrel`). Vacuum options and information about the
- * relation are read from it. vacrel->skippedallvis is set if we skip a block
+ * relation are read from it. vacrel->scan_data->skippedallvis is set if we skip a block
  * that's all-visible but not all-frozen (to ensure that we don't update
  * relfrozenxid in that case). vacrel also holds information about the next
  * unskippable block -- as bookkeeping for this function.
@@ -1569,7 +1586,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 	next_block = vacrel->current_block + 1;
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->rel_pages)
+	if (next_block >= vacrel->scan_data->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1613,7 +1630,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_data->skippedallvis = true;
 		}
 	}
 
@@ -1664,7 +1681,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 static void
 find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
@@ -1895,11 +1912,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_data->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1974,10 +1991,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_data->NewRelfrozenXid, &vacrel->scan_data->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_data->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_data->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1987,7 +2004,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_data->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -2022,7 +2039,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -2037,15 +2054,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_data->tuples_deleted += presult.ndeleted;
+	vacrel->scan_data->tuples_frozen += presult.nfrozen;
+	vacrel->scan_data->lpdead_items += presult.lpdead_items;
+	vacrel->scan_data->live_tuples += presult.live_tuples;
+	vacrel->scan_data->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2094,17 +2111,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (presult.all_frozen)
 			{
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 				*vm_page_frozen = true;
 			}
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2192,8 +2209,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 
@@ -2203,7 +2220,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		else
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2244,8 +2261,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_data->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2372,8 +2389,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_data->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -2400,25 +2417,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_data->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_data->live_tuples += live_tuples;
+	vacrel->scan_data->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_data->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_data->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -2447,7 +2464,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_data->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -2476,12 +2493,12 @@ lazy_vacuum(LVRelState *vacrel)
 	 * HOT through careful tuning.
 	 */
 	bypass = false;
-	if (vacrel->consider_bypass_optimization && vacrel->rel_pages > 0)
+	if (vacrel->consider_bypass_optimization && vacrel->scan_data->rel_pages > 0)
 	{
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_data->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2507,8 +2524,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * be negligible.  If this optimization is ever expanded to cover more
 		 * cases then this may need to be reconsidered.
 		 */
-		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		threshold = (double) vacrel->scan_data->rel_pages * BYPASS_THRESHOLD_PAGES;
+		bypass = (vacrel->scan_data->lpdead_item_pages < threshold &&
 				  TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);
 	}
 
@@ -2646,7 +2663,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2798,8 +2815,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
+			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
@@ -2915,14 +2932,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -2998,7 +3015,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_data->scanned_pages < vacrel->scan_data->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -3179,10 +3196,10 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->scan_data->rel_pages - vacrel->scan_data->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
-		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
+		 possibly_freeable >= vacrel->scan_data->rel_pages / REL_TRUNCATE_FRACTION))
 		return true;
 
 	return false;
@@ -3194,7 +3211,7 @@ should_attempt_truncation(LVRelState *vacrel)
 static void
 lazy_truncate_heap(LVRelState *vacrel)
 {
-	BlockNumber orig_rel_pages = vacrel->rel_pages;
+	BlockNumber orig_rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber new_rel_pages;
 	bool		lock_waiter_detected;
 	int			lock_retry;
@@ -3205,7 +3222,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_data->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -3306,15 +3323,15 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
-		vacrel->rel_pages = new_rel_pages;
+		vacrel->scan_data->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_data->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
 				(errmsg("table \"%s\": truncated %u to %u pages",
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_data->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -3338,11 +3355,11 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * unsigned.)  To make the scan faster, we prefetch a few blocks at a time
 	 * in forward direction, so that OS-level readahead can kick in.
 	 */
-	blkno = vacrel->rel_pages;
+	blkno = vacrel->scan_data->rel_pages;
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_data->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -3454,7 +3471,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_data->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 17e3b546a17..0ad0fd90a38 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1505,6 +1505,7 @@ LSEG
 LUID
 LVRelState
 LVSavedErrInfo
+LVScanData
 LWLock
 LWLockHandle
 LWLockMode
-- 
2.43.5

v14-0004-Support-parallelism-for-collecting-dead-items-du.patchapplication/octet-stream; name=v14-0004-Support-parallelism-for-collecting-dead-items-du.patchDownload

From e37a1d648a6cb50a51eb97033d591be1d4170cbd Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Feb 2025 13:41:35 -0800
Subject: [PATCH v14 4/4] Support parallelism for collecting dead items during
 lazy vacuum.

This feature allows the vacuum to leverage multiple CPUs in order to
collect dead items (i.e. the first pass over heap table) with parallel
workers. The parallel degree for parallel heap vacuuming is determined
based on the number of blocks to vacuum unless PARALLEL option of
VACUUM command is specified, and further limited by
max_parallel_maintenance_workers.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/ref/vacuum.sgml             |  54 +-
 src/backend/access/heap/heapam_handler.c |   4 +
 src/backend/access/heap/vacuumlazy.c     | 992 ++++++++++++++++++++---
 src/backend/commands/vacuumparallel.c    |  29 +
 src/include/access/heapam.h              |  11 +
 src/include/commands/vacuum.h            |   3 +
 src/test/regress/expected/vacuum.out     |   6 +
 src/test/regress/sql/vacuum.sql          |   7 +
 src/tools/pgindent/typedefs.list         |   4 +
 9 files changed, 989 insertions(+), 121 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..294494877d9 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -280,25 +280,41 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
      <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
+      Perform scanning heap, index vacuum, and index cleanup phases of
+      <command>VACUUM</command> in parallel using
+      <replaceable class="parameter">integer</replaceable> background workers
+      (for the details of each vacuum phase, please refer to
+      <xref linkend="vacuum-phases"/>).
+     </para>
+     <para>
+      For heap tables, the number of workers used to perform the scanning
+      heap is determined based on the size of table. A table can participate in
+      parallel scanning heap if and only if the size of the table is more than
+      <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+      the heap table's blocks will be divided into ranges and shared among the
+      cooperating processes. Each worker process will complete the scanning of
+      its given range of blocks before requesting an additional range of blocks.
+     </para>
+     <para>
+      The number of workers used to perform parallel index vacuum and index
+      cleanup is equal to the number of indexes on the relation that support
+      parallel vacuum. An index can participate in parallel vacuum if and only
+      if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+      Only one worker can be used per index. So parallel workers for index vacuum
+      and index cleanup are launched only when there are at least <literal>2</literal>
+      indexes in the table.
+     </para>
+     <para>
+      Workers for vacuum are launched before the start of each phase and exit
+      at the end of the phase. The number of workers for each phase is limited by
+      the number of workers specified with <literal>PARALLEL</literal> option if
+      any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+      Please note that in any parallel vacuum phase, it is not guaanteed that the
+      number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+      will be used during execution. It is possible for a vacuum to run with fewer
+      workers than specified, or even with no workers at all. These behaviors might
+      change in a future release. This option can't be used with the <literal>FULL</literal>
+      option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a534100692a..9de9f4637b2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2713,6 +2713,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
 
 	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_initialize_worker = heap_parallel_vacuum_initialize_worker,
+	.parallel_vacuum_collect_dead_items = heap_parallel_vacuum_collect_dead_items,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c54cffdc399..bdb90c2e771 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -99,6 +99,46 @@
  * After pruning and freezing, pages that are newly all-visible and all-frozen
  * are marked as such in the visibility map.
  *
+ * Parallel Vacuum:
+ *
+ * Lazy vacuum on heap tables supports parallel processing for phase I and
+ * phase II. Before starting phase I, we initialize parallel vacuum state,
+ * ParallelVacuumState, and allocate the TID store in a DSA area if we can
+ * use parallel mode for any of these two phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size and number of indexes. Parallel
+ * workers are launched at the beginning of each phase and exit at the end of
+ * each phase.
+ *
+ * For the parallel lazy heap scan (i.e. parallel phase I), we employ a parallel
+ * block table scan, controlled by ParallelBlockTableScanDesc, in conjunction
+ * with the read stream. The table is split into multiple chunks, which are
+ * then distributed among parallel workers.
+ *
+ * While vacuum cutoffs are shared between leader and worker processes, each
+ * individual process uses its own GlobalVisState, potentially causing some
+ * workers to remove fewer tuples than optimal. During parallel lazy heap scans,
+ * each worker tracks the oldest existing XID and MXID. The leader computes the
+ * globally oldest existing XID and MXID after the parallel scan, while
+ * gathering table data too.
+ *
+ * The workers' parallel scan descriptions, ParallelBlockTableScanWorkerData,
+ * are stored in the DSM space, enabling different parallel workers to resume
+ * phase I from their previous state. However, due to the potential presence
+ * of pinned buffers loaded by the read stream's look-ahead mechanism, we
+ * cannot abruptly stop phase I even when the space of dead_items TIDs exceeds
+ * the limit. Instead, once this threshold is surpassed, we begin processing
+ * pages without attempting to retrieve additional blocks until the read
+ * stream is exhausted. While this approach may increase the memory usage, it
+ * typically doesn't pose a significant problem, as processing a few 10s-100s
+ * buffers doesn't substantially increase the size of dead_items TIDs.
+ *
+ * If the leader launches fewer workers than the previous time to resume the
+ * parallel lazy heap scan, some block within chunks may remain un-scanned.
+ * To address this, the leader completes workers' unfinished scans at the end
+ * of the parallel lazy heap scan (see complete_unfinished_lazy_scan_heap()).
+ *
  * Dead TID Storage:
  *
  * The major space usage for vacuuming is storage for the dead tuple IDs that
@@ -147,6 +187,7 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"	/* for min_parallel_table_scan_size */
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -214,11 +255,21 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * DSM keys for parallel lazy vacuum. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define PARALLEL_LV_KEY_SHARED				0xFFFF0001
+#define PARALLEL_LV_KEY_SCANDESC			0xFFFF0002
+#define PARALLEL_LV_KEY_SCANWORKER			0xFFFF0003
+
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
  * parallel mode and the DSM segment is initialized.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->plvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -306,6 +357,80 @@ typedef struct LVScanData
 	bool		skippedallvis;
 } LVScanData;
 
+/*
+ * Struct for information that needs to be shared among parallel workers
+ * for parallel lazy vacuum. All fields are static, set by the leader
+ * process.
+ */
+typedef struct ParallelLVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+} ParallelLVShared;
+
+/*
+ * Per-worker data for scan description, statistics counters, and
+ * miscellaneous data need to be shared with the leader.
+ */
+typedef struct ParallelLVScanWorker
+{
+	/* Both last_blkno and pbscanworkdata are initialized? */
+	bool		scan_inited;
+
+	/* The last processed block number */
+	pg_atomic_uint32 last_blkno;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData pbscanworkdata;
+
+	/* per-worker scan data and counters */
+	LVScanData	scandata;
+} ParallelLVScanWorker;
+
+/*
+ * Struct to store parallel lazy vacuum working state.
+ */
+typedef struct ParallelLVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Per-worker parallel table scan state */
+	ParallelBlockTableScanWorker pbscanwork;
+
+	/* Shared static information */
+	ParallelLVShared *shared;
+
+	/* Per-worker scan data. NULL for the leader process */
+	ParallelLVScanWorker *scanworker;
+} ParallelLVState;
+
+/*
+ * Struct for the leader process in parallel lazy vacuum.
+ */
+typedef struct ParallelLVLeader
+{
+	/* Shared memory size for each shared object */
+	Size		pbscan_len;
+	Size		shared_len;
+	Size		scanworker_len;
+
+	/* The number of workers launched for parallel lazy heap scan */
+	int			nworkers_launched;
+
+	/*
+	 * Points to the array of all per-worker scan states stored on DSM area.
+	 */
+	ParallelLVScanWorker *scanworkers;
+} ParallelLVLeader;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -368,6 +493,12 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 
+	/* Last processed block number */
+	BlockNumber last_blkno;
+
+	/* Next block to check for FSM vacuum */
+	BlockNumber next_fsm_block_to_vacuum;
+
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
 	BlockNumber next_unskippable_block; /* next unskippable block */
@@ -375,6 +506,16 @@ typedef struct LVRelState
 	bool		next_unskippable_eager_scanned; /* if it was eagerly scanned */
 	Buffer		next_unskippable_vmbuffer;	/* buffer containing its VM bit */
 
+	/* Fields used for parallel lazy vacuum */
+
+	/* Parallel lazy vacuum working state */
+	ParallelLVState *plvstate;
+
+	/*
+	 * The leader state for parallel lazy vacuum. NULL for parallel workers.
+	 */
+	ParallelLVLeader *leader;
+
 	/* State related to managing eager scanning of all-visible pages */
 
 	/*
@@ -434,12 +575,14 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void do_lazy_scan_heap(LVRelState *vacrel, bool do_vacuum);
 static void heap_vacuum_eager_scan_setup(LVRelState *vacrel,
 										 VacuumParams *params);
 static BlockNumber heap_vac_scan_next_block(ReadStream *stream,
 											void *callback_private_data,
 											void *per_buffer_data);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static bool find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -450,6 +593,12 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
 							  bool *has_lpdead_items);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static BlockNumber parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel);
+static void complete_unfinished_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_begin(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
+static void parallel_lazy_scan_gather_scan_results(LVRelState *vacrel);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -474,6 +623,7 @@ static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
 static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 						   int num_offsets);
+static bool dead_items_check_memory_limit(LVRelState *vacrel);
 static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
@@ -529,6 +679,22 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	if (vacrel->aggressive)
 		return;
 
+	/*
+	 * Disable eager scanning if parallel lazy vacuum is enabled.
+	 *
+	 * One might think that it would make sense to use the eager scanning even
+	 * during parallel lazy vacuum, but parallel vacuum is available only in
+	 * VACUUM command and would not be something that happens frequently,
+	 * which seems not fit to the purpose of the eager scanning. Also, it
+	 * would require making the code complex. So it would make sense to
+	 * disable it for now.
+	 *
+	 * XXX: this limitation might need to be eliminated in the future for
+	 * example when we use parallel vacuum also in autovacuum.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		return;
+
 	/*
 	 * Aggressively vacuuming a small relation shouldn't take long, so it
 	 * isn't worth amortizing. We use two times the region size as the size
@@ -771,6 +937,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
+	vacrel->next_fsm_block_to_vacuum = 0;
 
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
@@ -815,13 +982,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
-	/*
-	 * Set up eager scan tracking state. This must happen after determining
-	 * whether or not the vacuum must be aggressive, because only normal
-	 * vacuums use the eager scan algorithm.
-	 */
-	heap_vacuum_eager_scan_setup(vacrel, params);
-
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -846,6 +1006,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
 
+	/*
+	 * Set up eager scan tracking state. This must happen after determining
+	 * whether or not the vacuum must be aggressive, because only normal
+	 * vacuums use the eager scan algorithm.
+	 */
+	heap_vacuum_eager_scan_setup(vacrel, params);
+
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
@@ -1215,13 +1382,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
-				blkno = 0,
-				next_fsm_block_to_vacuum = 0;
-	BlockNumber orig_eager_scan_success_limit =
-		vacrel->eager_scan_remaining_successes; /* for logging */
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1242,6 +1403,73 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/* Do the actual work */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel, true);
+
+	/*
+	 * Report that everything is now scanned. We never skip scanning the last
+	 * block in the relation, so we can pass rel_pages here.
+	 */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
+								 rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * We can pass rel_pages here because we never skip scanning the last
+	 * block of the relation.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * If do_vacuum is true, we stop the lazy heap scan and invoke a cycle of index
+ * vacuuming and table vacuuming if the space of dead_items TIDs exceeds the limit, and
+ * then resume it. On the other hand, if it's false, we continue scanning until the
+ * read stream is exhausted.
+ */
+static void
+do_lazy_scan_heap(LVRelState *vacrel, bool do_vacuum)
+{
+	ReadStream *stream;
+	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber orig_eager_scan_success_limit =
+		vacrel->eager_scan_remaining_successes; /* for logging */
+	Buffer		vmbuffer = InvalidBuffer;
+
 	/* Set up the read stream for vacuum's first pass through the heap */
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
 										vacrel->bstrategy,
@@ -1271,8 +1499,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * that point.  This check also provides failsafe coverage for the
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
+		 *
+		 * The failsafe check is done only by the leader process.
 		 */
-		if (vacrel->scan_data->scanned_pages > 0 &&
+		if (!IsParallelWorker() &&
+			vacrel->scan_data->scanned_pages > 0 &&
 			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
@@ -1280,12 +1511,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Consider if we definitely have enough space to process TIDs on page
 		 * already.  If we are close to overrunning the available space for
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page. However, let's force at least one page-worth of tuples
-		 * to be stored as to ensure we do at least some work when the memory
-		 * configured is so low that we run out before storing anything.
+		 * this page.
 		 */
-		if (vacrel->dead_items_info->num_items > 0 &&
-			TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
+		if (do_vacuum && dead_items_check_memory_limit(vacrel))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1308,15 +1536,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * upper-level FSM pages. Note that blkno is the previously
 			 * processed block.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 									blkno + 1);
-			next_fsm_block_to_vacuum = blkno;
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 		}
 
+		/* Read the next block to process */
 		buf = read_stream_next_buffer(stream, &per_buffer_data);
 
 		/* The relation is exhausted. */
@@ -1326,7 +1555,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		blk_info = *((uint8 *) per_buffer_data);
 		CheckBufferIsPinnedOnce(buf);
 		page = BufferGetPage(buf);
-		blkno = BufferGetBlockNumber(buf);
+		blkno = vacrel->last_blkno = BufferGetBlockNumber(buf);
 
 		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
@@ -1486,13 +1715,36 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * visible on upper FSM pages. This is done after vacuuming if the
 			 * table has indexes. There will only be newly-freed space if we
 			 * held the cleanup lock and lazy_scan_prune() was called.
+			 *
+			 * During parallel lazy heap scanning, only the leader process
+			 * vacuums the FSM. However, we cannot vacuum the FSM for blocks
+			 * up to 'blk' because there may be un-scanned blocks or blocks
+			 * being processed by workers before this point. Instead, parallel
+			 * workers advertise the block numbers they have just processed,
+			 * and the leader vacuums the FSM up to the smallest block number
+			 * among them. This approach ensures we vacuum the FSM for
+			 * consecutive processed blocks.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				if (IsParallelWorker())
+				{
+					pg_atomic_write_u32(&(vacrel->plvstate->scanworker->last_blkno),
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				}
+				else
+				{
+					BlockNumber fsmvac_upto = blkno;
+
+					if (ParallelHeapVacuumIsActive(vacrel))
+						fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+					FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+											fsmvac_upto);
+				}
+
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
@@ -1503,50 +1755,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/*
-	 * Report that everything is now scanned. We never skip scanning the last
-	 * block in the relation, so we can pass rel_pages here.
-	 */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
-								 rel_pages);
-
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_data->scanned_pages,
-													 vacrel->scan_data->live_tuples);
-
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
-		vacrel->scan_data->missed_dead_tuples;
-
 	read_stream_end(stream);
-
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
-
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 * We can pass rel_pages here because we never skip scanning the last
-	 * block of the relation.
-	 */
-	if (rel_pages > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, rel_pages);
-
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
-
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
 }
 
 /*
@@ -1560,7 +1769,8 @@ lazy_scan_heap(LVRelState *vacrel)
  * heap_vac_scan_next_block() uses the visibility map, vacuum options, and
  * various thresholds to skip blocks which do not need to be processed and
  * returns the next block to process or InvalidBlockNumber if there are no
- * remaining blocks.
+ * remaining blocks or the space of dead_items TIDs reaches the limit (only
+ * in parallel lazy vacuum cases).
  *
  * The visibility status of the next block to process and whether or not it
  * was eager scanned is set in the per_buffer_data.
@@ -1582,11 +1792,37 @@ heap_vac_scan_next_block(ReadStream *stream,
 	LVRelState *vacrel = callback_private_data;
 	uint8		blk_info = 0;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+retry:
+	next_block = InvalidBlockNumber;
+
+	/* Get the next block to process */
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		/*
+		 * Stop returning the next block to the read stream if we are close to
+		 * overrunning the available space for dead_items TIDs so that the
+		 * read stream returns pinned buffers in its buffers queue until the
+		 * stream is exhausted. See the comments atop this file for details.
+		 */
+		if (!dead_items_check_memory_limit(vacrel))
+		{
+			/*
+			 * table_block_parallelscan_nextpage() returns InvalidBlockNumber
+			 * if there are no remaining blocks.
+			 */
+			next_block = table_block_parallelscan_nextpage(vacrel->rel,
+														   vacrel->plvstate->pbscanwork,
+														   vacrel->plvstate->pbscan);
+		}
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
+	}
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->scan_data->rel_pages)
+	if (!BlockNumberIsValid(next_block) || next_block >= vacrel->scan_data->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1608,8 +1844,42 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * visibility map.
 		 */
 		bool		skipsallvis;
+		bool		found;
+		BlockNumber end_block;
+		BlockNumber nblocks_skip;
+
+		if (ParallelHeapVacuumIsActive(vacrel))
+		{
+			/* We look for the next unskippable block within the chunk */
+			end_block = next_block +
+				vacrel->plvstate->pbscanwork->phsw_chunk_remaining + 1;
+		}
+		else
+			end_block = vacrel->scan_data->rel_pages;
+
+		found = find_next_unskippable_block(vacrel, &skipsallvis, next_block, end_block);
+
+		/*
+		 * We must have found the next unskippable block within the specified
+		 * range in non-parallel cases as the end_block is always the last
+		 * block + 1 and we must scan the last block.
+		 */
+		Assert(found || ParallelHeapVacuumIsActive(vacrel));
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		if (!found)
+		{
+			if (skipsallvis)
+				vacrel->scan_data->skippedallvis = true;
+
+			/*
+			 * Skip all remaining blocks in the current chunk, and retry with
+			 * the next chunk.
+			 */
+			vacrel->plvstate->pbscanwork->phsw_chunk_remaining = 0;
+			goto retry;
+		}
+
+		Assert(vacrel->next_unskippable_block < end_block);
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1626,11 +1896,20 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
 				vacrel->scan_data->skippedallvis = true;
+
+			/* Tell the parallel scans to skip blocks */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				vacrel->plvstate->pbscanwork->phsw_chunk_remaining -= nblocks_skip;
+				Assert(vacrel->plvstate->pbscanwork->phsw_chunk_remaining > 0);
+			}
+
+			next_block = vacrel->next_unskippable_block;
 		}
 	}
 
@@ -1666,9 +1945,11 @@ heap_vac_scan_next_block(ReadStream *stream,
 }
 
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
- * The next unskippable block and its visibility information is updated in
- * vacrel.
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in a range of 'start' (inclusive) and 'end' (exclusive).
+ *
+ * If found, the next unskippable block and its visibility information is updated
+ * in vacrel. Otherwise, return false and reset the information in vacrel.
  *
  * Note: our opinion of which blocks can be skipped can go stale immediately.
  * It's okay if caller "misses" a page whose all-visible or all-frozen marking
@@ -1678,22 +1959,32 @@ heap_vac_scan_next_block(ReadStream *stream,
  * older XIDs/MXIDs.  The *skippedallvis flag will be set here when the choice
  * to skip such a range is actually made, making everything safe.)
  */
-static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+static bool
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start, BlockNumber end)
 {
 	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block = start;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
 	bool		next_unskippable_allvis;
+	bool		found = true;
 
 	*skipsallvis = false;
 
 	for (;; next_unskippable_block++)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   &next_unskippable_vmbuffer);
+		uint8		mapbits;
+
+		/* Reach the end of range? */
+		if (next_unskippable_block >= end)
+		{
+			found = false;
+			break;
+		}
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_unskippable_block,
+										   &next_unskippable_vmbuffer);
 
 		next_unskippable_allvis = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
 
@@ -1769,11 +2060,274 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 		*skipsallvis = true;
 	}
 
-	/* write the local variables back to vacrel */
-	vacrel->next_unskippable_block = next_unskippable_block;
-	vacrel->next_unskippable_allvis = next_unskippable_allvis;
-	vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
-	vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	if (found)
+	{
+		/* write the local variables back to vacrel */
+		vacrel->next_unskippable_block = next_unskippable_block;
+		vacrel->next_unskippable_allvis = next_unskippable_allvis;
+		vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
+		vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	}
+	else
+	{
+		if (BufferIsValid(next_unskippable_vmbuffer))
+			ReleaseBuffer(next_unskippable_vmbuffer);
+
+		/*
+		 * There is not unskippable block in the specified range. Reset the
+		 * related fields in vacrel.
+		 */
+		vacrel->next_unskippable_block = InvalidBlockNumber;
+		vacrel->next_unskippable_allvis = InvalidBlockNumber;
+		vacrel->next_unskippable_eager_scanned = false;
+		vacrel->next_unskippable_vmbuffer = InvalidBuffer;
+	}
+
+	return found;
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches
+ * parallel workers to scan the heap in parallel.
+*/
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	ParallelBlockTableScanWorkerData pbscanworkdata;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/*
+	 * Setup the parallel scan description for the leader to join as a worker.
+	 */
+	table_block_parallelscan_startblock_init(vacrel->rel,
+											 &pbscanworkdata,
+											 vacrel->plvstate->pbscan);
+	vacrel->plvstate->pbscanwork = &pbscanworkdata;
+
+	for (;;)
+	{
+		BlockNumber fsmvac_upto;
+
+		/* Launch parallel workers */
+		parallel_lazy_scan_heap_begin(vacrel);
+
+		/*
+		 * Do lazy heap scan until the read stream is exhausted. We will stop
+		 * retrieving new blocks for the read stream once the space of
+		 * dead_items TIDs exceeds the limit.
+		 */
+		do_lazy_scan_heap(vacrel, false);
+
+		/* Wait for parallel workers to finish and gather scan results */
+		parallel_lazy_scan_heap_end(vacrel);
+
+		if (!dead_items_check_memory_limit(vacrel))
+			break;
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		/* Compute the smallest processed block number */
+		fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (fsmvac_upto > vacrel->next_fsm_block_to_vacuum)
+		{
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									fsmvac_upto);
+			vacrel->next_fsm_block_to_vacuum = fsmvac_upto;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process launched fewer workers in the next cycle.
+	 */
+	complete_unfinished_lazy_scan_heap(vacrel);
+}
+
+/*
+ * Return the smallest block number that the leader and workers have scanned.
+ */
+static BlockNumber
+parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel)
+{
+	BlockNumber min_blk;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/* Initialized with the leader's value */
+	min_blk = vacrel->last_blkno;
+
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		ParallelLVScanWorker *scanworker = &(vacrel->leader->scanworkers[i]);
+		BlockNumber blkno;
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!scanworker->scan_inited)
+			continue;
+
+		blkno = pg_atomic_read_u32(&(scanworker->last_blkno));
+
+		if (!BlockNumberIsValid(min_blk) || min_blk > blkno)
+			min_blk = blkno;
+	}
+
+	Assert(BlockNumberIsValid(min_blk));
+
+	return min_blk;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+complete_unfinished_lazy_scan_heap(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		ParallelLVScanWorker *scanworker = &(vacrel->leader->scanworkers[i]);
+
+		if (!scanworker->scan_inited)
+			continue;
+
+		if (scanworker->pbscanworkdata.phsw_chunk_remaining == 0)
+			continue;
+
+		/* Attach the worker's scan state */
+		vacrel->plvstate->pbscanwork = &(scanworker->pbscanworkdata);
+
+		/*
+		 * Complete the unfinished scan. Note that we might perform multiple
+		 * cycles of index and heap vacuuming while completing the scans.
+		 */
+		vacrel->next_fsm_block_to_vacuum = pg_atomic_read_u32(&(scanworker->last_blkno));
+		do_lazy_scan_heap(vacrel, true);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_begin(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->leader->nworkers_launched = parallel_vacuum_collect_dead_items_begin(vacrel->pvs);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for collecting dead tuples (planned: %d)",
+							 "launched %d parallel vacuum workers for collecting dead tuples (planned: %d)",
+							 vacrel->leader->nworkers_launched),
+					vacrel->leader->nworkers_launched,
+					parallel_vacuum_get_nworkers_table(vacrel->pvs))));
+}
+
+/*
+ * Helper routine to finish the parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_scan_end(vacrel->pvs);
+
+	/* Gather the workers' scan results */
+	parallel_lazy_scan_gather_scan_results(vacrel);
+}
+
+/*
+ * Accumulate each worker's scan results into the leader's.
+*/
+static void
+parallel_lazy_scan_gather_scan_results(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		LVScanData *data = &(vacrel->leader->scanworkers[i].scandata);
+
+		/* Accumulate the counters collected by workers */
+#define ACCUM_COUNT(item) vacrel->scan_data->item += data->item
+		ACCUM_COUNT(scanned_pages);
+		ACCUM_COUNT(removed_pages);
+		ACCUM_COUNT(new_frozen_tuple_pages);
+		ACCUM_COUNT(vm_new_visible_pages);
+		ACCUM_COUNT(vm_new_visible_frozen_pages);
+		ACCUM_COUNT(vm_new_frozen_pages);
+		ACCUM_COUNT(lpdead_item_pages);
+		ACCUM_COUNT(missed_dead_pages);
+		ACCUM_COUNT(tuples_deleted);
+		ACCUM_COUNT(tuples_frozen);
+		ACCUM_COUNT(lpdead_items);
+		ACCUM_COUNT(live_tuples);
+		ACCUM_COUNT(recently_dead_tuples);
+		ACCUM_COUNT(missed_dead_tuples);
+#undef ACCUM_COUNT
+
+		/*
+		 * Track the greatest non-empty page among values the workers
+		 * collected as it's used to cut-off point of heap truncation.
+		 */
+		if (vacrel->scan_data->nonempty_pages < data->nonempty_pages)
+			vacrel->scan_data->nonempty_pages = data->nonempty_pages;
+
+		/*
+		 * All workers must have initialized both values with the values
+		 * passed by the leader.
+		 */
+		Assert(TransactionIdIsValid(data->NewRelfrozenXid));
+		Assert(MultiXactIdIsValid(data->NewRelminMxid));
+
+		/*
+		 * During parallel lazy scanning, since different workers process
+		 * separate blocks, they may observe different existing XIDs and
+		 * MXIDs. Therefore, we compute the oldest XID and MXID from the
+		 * values observed by each worker (including the leader). These
+		 * computations are crucial for correctly advancing both relfrozenxid
+		 * and relmminmxid values.
+		 */
+
+		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
+			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
+
+		/* Has any one of workers skipped all-visible page? */
+		vacrel->scan_data->skippedallvis |= data->skippedallvis;
+	}
 }
 
 /*
@@ -2062,7 +2616,8 @@ lazy_scan_prune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2435,7 +2990,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -3489,12 +4045,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -3512,11 +4064,17 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize the parallel vacuum state for either lazy heap
+			 * scan, index vacuuming, or both.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3556,15 +4114,35 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
 	pgstat_progress_update_multi_param(2, prog_index, prog_val);
 }
 
+/*
+ * Check the memory usage of the collected dead items and return true
+ * if we are close to overrunning the available space for dead_items TIDs.
+ * However, let's force at least one page-worth of tuples to be stored as
+ * to ensure we do at least some work when the memory configured is so low
+ * that we run out before storing anything.
+ */
+static bool
+dead_items_check_memory_limit(LVRelState *vacrel)
+{
+	return vacrel->dead_items_info->num_items > 0 &&
+		TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes;
+}
+
 /*
  * Forget all collected dead items.
  */
@@ -3760,14 +4338,224 @@ update_relstats_all_indexes(LVRelState *vacrel)
 
 /*
  * Compute the number of workers for parallel heap vacuum.
- *
- * Return 0 to disable parallel vacuum.
  */
 int
 heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
 									 void *state)
 {
-	return 0;
+	int			parallel_workers = 0;
+
+	if (nworkers_requested == 0)
+	{
+		LVRelState *vacrel = (LVRelState *) state;
+		int			heap_parallel_threshold;
+		int			heap_pages;
+		BlockNumber allvisible;
+		BlockNumber allfrozen;
+
+		/*
+		 * Estimate the number of blocks that we're going to scan during
+		 * lazy_scan_heap().
+		 */
+		visibilitymap_count(rel, &allvisible, &allfrozen);
+		heap_pages = RelationGetNumberOfBlocks(rel) -
+			(vacrel->aggressive ? allfrozen : allvisible);
+
+		Assert(heap_pages >= 0);
+
+		/*
+		 * Select the number of workers based on the log of the number of
+		 * pages to scan. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nworkers_requested;
+
+	return parallel_workers;
+}
+
+/*
+ * Estimate shared memory size required for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							  void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	Size		size = 0;
+
+	vacrel->leader = palloc(sizeof(ParallelLVLeader));
+
+	/* Estimate space for ParallelLVShared */
+	size = add_size(size, sizeof(ParallelLVShared));
+	vacrel->leader->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for ParallelBlockTableScanDesc */
+	vacrel->leader->pbscan_len = table_block_parallelscan_estimate(rel);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->pbscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of ParallelLVScanWorker */
+	vacrel->leader->scanworker_len = mul_size(sizeof(ParallelLVScanWorker), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scanworker_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ParallelLVShared *shared;
+	ParallelBlockTableScanDesc pbscan;
+	ParallelLVScanWorker *scanworkers;
+
+	vacrel->plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Initialize ParallelLVShared */
+	shared = shm_toc_allocate(pcxt->toc, vacrel->leader->shared_len);
+	MemSet(shared, 0, vacrel->leader->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_data->NewRelminMxid;
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SHARED, shared);
+	vacrel->plvstate->shared = shared;
+
+	/* Initialize ParallelBlockTableScanDesc */
+	pbscan = shm_toc_allocate(pcxt->toc, vacrel->leader->pbscan_len);
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pbscan);
+	pbscan->base.phs_syncscan = false;	/* always start from the first block */
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANDESC, pbscan);
+	vacrel->plvstate->pbscan = pbscan;
+
+	/* Initialize the array of ParallelLVScanWorker */
+	scanworkers = shm_toc_allocate(pcxt->toc, vacrel->leader->scanworker_len);
+	MemSet(scanworkers, 0, vacrel->leader->scanworker_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANWORKER, scanworkers);
+	vacrel->leader->scanworkers = scanworkers;
+}
+
+/*
+ * Initialize lazy vacuum state with the information retrieved from
+ * shared memory.
+ */
+void
+heap_parallel_vacuum_initialize_worker(Relation rel, ParallelVacuumState *pvs,
+									   ParallelWorkerContext *pwcxt,
+									   void **state_out)
+{
+	LVRelState *vacrel;
+	ParallelLVState *plvstate;
+	ParallelLVShared *shared;
+	ParallelLVScanWorker *scanworker;
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Initialize ParallelLVState and prepare the related objects */
+
+	plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Prepare ParallelLVShared */
+	shared = (ParallelLVShared *) shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SHARED, false);
+	plvstate->shared = shared;
+
+	/* Prepare ParallelBlockTableScanWorkerData */
+	pbscan = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANDESC, false);
+	plvstate->pbscan = pbscan;
+
+	/* Prepare ParallelLVScanWorker */
+	scanworker = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANWORKER, false);
+	plvstate->scanworker = &(scanworker[ParallelWorkerNumber]);
+	plvstate->pbscanwork = &(plvstate->scanworker->pbscanworkdata);
+
+	/* Initialize LVRelState and prepare fields required by lazy scan heap */
+	vacrel = palloc0(sizeof(LVRelState));
+	vacrel->rel = rel;
+	vacrel->indrels = parallel_vacuum_get_table_indexes(pvs,
+														&vacrel->nindexes);
+	vacrel->bstrategy = parallel_vacuum_get_bstrategy(pvs);
+	vacrel->pvs = pvs;
+	vacrel->aggressive = shared->aggressive;
+	vacrel->skipwithvm = shared->skipwithvm;
+	vacrel->vistest = GlobalVisTestFor(rel);
+	vacrel->cutoffs = shared->cutoffs;
+	vacrel->dead_items = parallel_vacuum_get_dead_items(pvs,
+														&vacrel->dead_items_info);
+	vacrel->plvstate = plvstate;
+	vacrel->scan_data = &(plvstate->scanworker->scandata);
+	MemSet(vacrel->scan_data, 0, sizeof(LVScanData));
+	vacrel->scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = shared->NewRelminMxid;
+	vacrel->scan_data->skippedallvis = false;
+	vacrel->scan_data->rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Initialize the scan state if not yet. The chunk of blocks will be
+	 * allocated when to get the scan block for the first time.
+	 */
+	if (!vacrel->plvstate->scanworker->scan_inited)
+	{
+		vacrel->plvstate->scanworker->scan_inited = true;
+		table_block_parallelscan_startblock_init(rel,
+												 vacrel->plvstate->pbscanwork,
+												 vacrel->plvstate->pbscan);
+		pg_atomic_init_u32(&(vacrel->plvstate->scanworker->last_blkno),
+						   InvalidBlockNumber);
+	}
+
+	*state_out = (void *) vacrel;
+}
+
+/*
+ * Parallel heap vacuum callback for collecting dead items (i.e., lazy heap scan).
+ */
+void
+heap_parallel_vacuum_collect_dead_items(Relation rel, ParallelVacuumState *pvs,
+										void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ErrorContextCallback errcallback;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel->dbname = get_database_name(MyDatabaseId);
+	vacrel->relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Join the parallel heap vacuum */
+	do_lazy_scan_heap(vacrel, false);
+
+	/* Advertise the last processed block number */
+	pg_atomic_write_u32(&(vacrel->plvstate->scanworker->last_blkno), vacrel->last_blkno);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
 }
 
 /*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bb0d690aed8..e6742b29b7b 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -501,6 +501,35 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
+/*
+ * Return the number of parallel workers initialized for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/*
+ * Return the buffer strategy for parallel vacuum.
+ */
+BufferAccessStrategy
+parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs)
+{
+	return pvs->bstrategy;
+}
+
 /*
  * Returns the dead items space and dead items information.
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 6a1ca5d5ca7..b94d783c31e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -15,6 +15,7 @@
 #define HEAPAM_H
 
 #include "access/heapam_xlog.h"
+#include "access/parallel.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
@@ -407,10 +408,20 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 
 /* in heap/vacuumlazy.c */
 struct VacuumParams;
+struct ParallelVacuumState;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
 extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
 												 void *state);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+										  void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+												   ParallelWorkerContext *pwcxt,
+												   void **state_out);
+extern void heap_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+													void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e8f75fc67b1..b0b9547b4b8 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -385,6 +385,9 @@ extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels
 												 BufferAccessStrategy bstrategy,
 												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
+extern BufferAccessStrategy parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 0abcc99989e..f92c3f73c29 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -160,6 +160,11 @@ UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
 VACUUM (PARALLEL -1) pvactst; -- error
 ERROR:  parallel workers for vacuum must be between 0 and 1024
 LINE 1: VACUUM (PARALLEL -1) pvactst;
@@ -185,6 +190,7 @@ VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 WARNING:  disabling parallel option of vacuum on "tmp" --- cannot vacuum temporary tables in parallel
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
 DROP TABLE pvactst2;
 -- INDEX_CLEANUP option
diff --git a/src/test/regress/sql/vacuum.sql b/src/test/regress/sql/vacuum.sql
index a72bdb5b619..b8abab28ea9 100644
--- a/src/test/regress/sql/vacuum.sql
+++ b/src/test/regress/sql/vacuum.sql
@@ -129,6 +129,12 @@ VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
 
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
+
 VACUUM (PARALLEL -1) pvactst; -- error
 VACUUM (PARALLEL 2, INDEX_CLEANUP FALSE) pvactst;
 VACUUM (PARALLEL 2, FULL TRUE) pvactst; -- error, cannot use both PARALLEL and FULL
@@ -148,6 +154,7 @@ CREATE INDEX tmp_idx1 ON tmp (a);
 VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
 DROP TABLE pvactst2;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0ad0fd90a38..3cdf038ecd2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1937,6 +1937,10 @@ PLpgSQL_type
 PLpgSQL_type_type
 PLpgSQL_var
 PLpgSQL_variable
+ParallelLVLeader
+ParallelLVScanWorker
+ParallelLVShared
+ParallelLVState
 PLwdatum
 PLword
 PLyArrayToOb
-- 
2.43.5

v14-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchapplication/octet-stream; name=v14-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchDownload

From 6154fe696927778688f37b0241e03b58e52c6245 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 18 Feb 2025 17:45:36 -0800
Subject: [PATCH v14 2/4] vacuumparallel.c: Support parallel vacuuming for
 tables to collect dead items.

Since parallel vacuum was available only for index vacuuming and index
cleanup, ParallelVacuumState was initialized only when the table has
at least two indexes that are eligible for parallel index vacuuming
and cleanup.

This commit extends vacuumparallel.c to support parallel table
vacuuming. parallel_vacuum_init() now initializes ParallelVacuumState
and it enables parallel table vacuuming and parallel index
vacuuming/cleanup separately. During the initialization, it asks the
table AM for the number of parallel workers required for parallel
table vacuuming. If >0, it enables parallel table vacuuming and calls
further table AM APIs such as parallel_vacuum_estimate.

For parallel table vacuuming, this commit introduces
parallel_vacuum_collect_dead_items_begin() function, which can be used
to collect dead items in the table (for example, the first pass over
heap table in lazy vacuum for heap tables).

Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/vacuumlazy.c  |   2 +-
 src/backend/commands/vacuumparallel.c | 307 +++++++++++++++++++++-----
 src/include/commands/vacuum.h         |   5 +-
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 256 insertions(+), 59 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b4100dacd1d..735b2a1cfdc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3499,7 +3499,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..bb0d690aed8 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -4,17 +4,18 @@
  *	  Support routines for parallel vacuum execution.
  *
  * This file contains routines that are intended to support setting up, using,
- * and tearing down a ParallelVacuumState.
+ * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared
+ * information as well as the memory space for storing dead items allocated in
+ * the DSA area. We launch
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan, index bulk-deletion, index
+ * cleanup, or all of them with parallel worker processes depending on the
+ * number of parallel workers required for each phase. So different numbers of
+ * workers might be required for the table scanning and index processing.
+ * We launch parallel worker processes at the start of a phase, and once we
+ * complete all work in the phase, parallel workers exit. Each time we process
+ * table or indexes in parallel, the parallel context is re-initialized so that
+ * the same DSM can be used for multiple passes of each phase.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -26,8 +27,10 @@
  */
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -50,6 +53,13 @@
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		4
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
+/* The kind of parallel vacuum phases */
+typedef enum
+{
+	PV_WORK_PHASE_PROCESS_INDEXES,	/* index vacuuming or cleanup */
+	PV_WORK_PHASE_COLLECT_DEAD_ITEMS,	/* collect dead tuples */
+} PVWorkPhase;
+
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
  * segment.
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * Tell parallel workers what phase to perform: processing indexes or
+	 * removing dead tuples from the table.
+	 */
+	PVWorkPhase work_phase;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +180,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Do we need to reinitialize parallel DSM? */
+	bool		need_reinitialize_dsm;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -171,6 +190,12 @@ struct ParallelVacuumState
 	Relation   *indrels;
 	int			nindexes;
 
+	/*
+	 * The number of workers for parallel table vacuuming. If 0, the parallel
+	 * table vacuum is disabled.
+	 */
+	int			nworkers_for_table;
+
 	/* Shared information among parallel vacuum workers */
 	PVShared   *shared;
 
@@ -178,7 +203,7 @@ struct ParallelVacuumState
 	 * Shared index statistics among parallel vacuum workers. The array
 	 * element is allocated for every index, even those indexes where parallel
 	 * index vacuuming is unsafe or not worthwhile (e.g.,
-	 * will_parallel_vacuum[] is false).  During parallel vacuum,
+	 * idx_will_parallel_vacuum[] is false).  During parallel vacuum,
 	 * IndexBulkDeleteResult of each index is kept in DSM and is copied into
 	 * local memory at the end of parallel vacuum.
 	 */
@@ -198,7 +223,7 @@ struct ParallelVacuumState
 	 * processing. For example, the index could be <
 	 * min_parallel_index_scan_size cutoff.
 	 */
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
@@ -221,8 +246,10 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static int	parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											bool *idx_will_parallel_vacuum,
+											void *state);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -237,12 +264,16 @@ static void parallel_vacuum_error_callback(void *arg);
  * Try to enter parallel mode and create a parallel context.  Then initialize
  * shared memory state.
  *
+ * nrequested_workers is the requested parallel degree. 0 means that the parallel
+ * degrees for table and indexes vacuum are decided differently. See the comments
+ * of parallel_vacuum_compute_workers() for details.
+ *
  * On success, return parallel vacuum state.  Otherwise return NULL.
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -251,38 +282,38 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
 	int			querylen;
 
-	/*
-	 * A parallel vacuum must be requested and there must be indexes on the
-	 * relation
-	 */
+	/* A parallel vacuum must be requested */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
-	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
+	idx_will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
+	parallel_workers = parallel_vacuum_compute_workers(rel, indrels, nindexes,
 													   nrequested_workers,
-													   will_parallel_vacuum);
+													   &nworkers_for_table,
+													   idx_will_parallel_vacuum,
+													   state);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
-		pfree(will_parallel_vacuum);
+		pfree(idx_will_parallel_vacuum);
 		return NULL;
 	}
 
 	pvs = (ParallelVacuumState *) palloc0(sizeof(ParallelVacuumState));
 	pvs->indrels = indrels;
 	pvs->nindexes = nindexes;
-	pvs->will_parallel_vacuum = will_parallel_vacuum;
+	pvs->idx_will_parallel_vacuum = idx_will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
 
@@ -291,6 +322,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 parallel_workers);
 	Assert(pcxt->nworkers > 0);
 	pvs->pcxt = pcxt;
+	pvs->need_reinitialize_dsm = false;
+	pvs->nworkers_for_table = nworkers_for_table;
 
 	/* Estimate size for index vacuum stats -- PARALLEL_VACUUM_KEY_INDEX_STATS */
 	est_indstats_len = mul_size(sizeof(PVIndStats), nindexes);
@@ -327,6 +360,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, pvs->nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -345,7 +382,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 			   ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) == 0));
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
-		if (!will_parallel_vacuum[i])
+		if (!idx_will_parallel_vacuum[i])
 			continue;
 
 		if (indrel->rd_indam->amusemaintenanceworkmem)
@@ -419,6 +456,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Initialize AM-specific DSM space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, pvs->nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -456,7 +497,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
-	pfree(pvs->will_parallel_vacuum);
+	pfree(pvs->idx_will_parallel_vacuum);
 	pfree(pvs);
 }
 
@@ -533,26 +574,35 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.  Return the maximum number of parallel
+ * workers for table vacuuming and index vacuuming.
+ *
+ * nrequested is the number of parallel workers that user requested, which
+ * applies to both the number of workers for table vacuum and index vacuum.
+ * If nrequested is 0, we compute the parallel degree for them differently
+ * as described below.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel table vacuum, we ask AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to nworkers_table_p.
+ *
+ * For parallel index vacuum, the index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. This function sets
+ * idx_will_parallel_vacuum to remember indexes that participate in parallel vacuum.
  */
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table_p,
+								bool *idx_will_parallel_vacuum, void *state)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			nworkers_table = 0;
+	int			nworkers_index = 0;
+
+	*nworkers_table_p = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -561,6 +611,13 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;
 
+	/* Compute the number of workers for parallel table scan */
+	nworkers_table = table_parallel_vacuum_compute_workers(rel, nrequested,
+														   state);
+
+	/* Cap by max_parallel_maintenance_workers */
+	nworkers_table = Min(nworkers_table, max_parallel_maintenance_workers);
+
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
@@ -574,7 +631,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		will_parallel_vacuum[i] = true;
+		idx_will_parallel_vacuum[i] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -589,18 +646,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	/* The leader process takes one index */
 	nindexes_parallel--;
 
-	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table_p = nworkers_table;
+	return Max(nworkers_table, nworkers_index);
 }
 
 /*
@@ -657,7 +714,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL);
 		indstats->status = new_status;
 		indstats->parallel_workers_can_process =
-			(pvs->will_parallel_vacuum[i] &&
+			(pvs->idx_will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
 													num_index_scans,
 													vacuum));
@@ -669,8 +726,10 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
+		pvs->shared->work_phase = PV_WORK_PHASE_PROCESS_INDEXES;
+
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (pvs->need_reinitialize_dsm)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -764,6 +823,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
 }
 
 /*
@@ -979,6 +1041,115 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Begin the parallel scan to collect dead items. Return the number of
+ * launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_scan_end() to finish the parallel
+ * table scan.
+ */
+int
+parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->work_phase = PV_WORK_PHASE_COLLECT_DEAD_ITEMS;
+
+	if (pvs->need_reinitialize_dsm)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	Assert(pvs->nworkers_for_table <= pvs->pcxt->nworkers);
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel vacuum workers launched by
+ * parallel_vacuum_collect_dead_items_begin(), and gather workers' statistics.
+ */
+void
+parallel_vacuum_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
+}
+
+/*
+ * The function is for parallel workers to execute the parallel scan to
+ * collect dead tuples.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs, void *state)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do the parallel scan to collect dead tuples */
+	table_parallel_vacuum_collect_dead_items(pvs->heaprel, pvs, state);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -998,6 +1169,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
+	void	   *state;
 	ErrorContextCallback errcallback;
 
 	/*
@@ -1030,7 +1202,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	/*
 	 * Apply the desired value of maintenance_work_mem within this process.
@@ -1076,6 +1247,17 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
+	/* Initialize AM-specific vacuum state for parallel table vacuuming */
+	if (shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS)
+	{
+		ParallelWorkerContext pwcxt;
+
+		pwcxt.toc = toc;
+		pwcxt.seg = seg;
+		table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+												&state);
+	}
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = parallel_vacuum_error_callback;
 	errcallback.arg = &pvs;
@@ -1085,8 +1267,19 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	switch (pvs.shared->work_phase)
+	{
+		case PV_WORK_PHASE_COLLECT_DEAD_ITEMS:
+			/* Scan the table to collect dead items */
+			parallel_vacuum_process_table(&pvs, state);
+			break;
+		case PV_WORK_PHASE_PROCESS_INDEXES:
+			/* Process indexes to perform vacuum/cleanup */
+			parallel_vacuum_process_safe_indexes(&pvs);
+			break;
+		default:
+			elog(ERROR, "unrecognized parallel vacuum phase %d", pvs.shared->work_phase);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index bc37a80dc74..e8f75fc67b1 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -382,7 +382,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -394,6 +395,8 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_scan_end(ParallelVacuumState *pvs);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1279b69422a..17e3b546a17 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2009,6 +2009,7 @@ PVIndStats
 PVIndVacStatus
 PVOID
 PVShared
+PVWorkPhase
 PX_Alias
 PX_Cipher
 PX_Combo
-- 
2.43.5

v14-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchapplication/octet-stream; name=v14-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchDownload

From 85488bb0d93d71d7fb164516622ff779a9e5f4d6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v14 1/4] Introduces table AM APIs for parallel table
 vacuuming.

This commit introduces the following new table AM APIs for parallel
heap vacuuming:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

There is no code using these new APIs for now. Upcoming parallel
vacuum patches utilize these APIs.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/heapam_handler.c |   4 +-
 src/backend/access/heap/vacuumlazy.c     |  12 ++
 src/backend/access/table/tableamapi.c    |  11 ++
 src/include/access/heapam.h              |   2 +
 src/include/access/tableam.h             | 140 +++++++++++++++++++++++
 5 files changed, 168 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 24d3765aa20..a534100692a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2710,7 +2710,9 @@ static const TableAmRoutine heapam_methods = {
 
 	.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
 	.scan_sample_next_block = heapam_scan_sample_next_block,
-	.scan_sample_next_tuple = heapam_scan_sample_next_tuple
+	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2cbcf5e5db2..b4100dacd1d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3741,6 +3741,18 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Return 0 to disable parallel vacuum.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									 void *state)
+{
+	return 0;
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 476663b66aa..c3ee9869e12 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,6 +81,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
+	Assert(routine->parallel_vacuum_compute_workers != NULL);
 	Assert(routine->scan_analyze_next_block != NULL);
 	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
@@ -94,6 +95,16 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->scan_sample_next_block != NULL);
 	Assert(routine->scan_sample_next_tuple != NULL);
 
+	/*
+	 * Callbacks for parallel vacuum are also optional (except for
+	 * parallel_vacuum_compute_workers). But one callback implies presence of
+	 * the others.
+	 */
+	Assert(((((routine->parallel_vacuum_estimate == NULL) ==
+			  (routine->parallel_vacuum_initialize == NULL)) ==
+			 (routine->parallel_vacuum_initialize_worker == NULL)) ==
+			(routine->parallel_vacuum_collect_dead_items == NULL)));
+
 	return routine;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..6a1ca5d5ca7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -409,6 +409,8 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+												 void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b8cb1e744ad..c61b1700953 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -35,6 +35,9 @@ extern PGDLLIMPORT bool synchronize_seqscans;
 
 struct BulkInsertStateData;
 struct IndexInfo;
+struct ParallelContext;
+struct ParallelVacuumState;
+struct ParallelWorkerContext;
 struct SampleScanState;
 struct VacuumParams;
 struct ValidateIndexState;
@@ -655,6 +658,81 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * parallel degree for parallel vacuum is further limited by
+	 * max_parallel_maintenance_workers. The function must return 0 to disable
+	 * parallel table vacuum.
+	 *
+	 * 'nworkers_requested' is a >=0 number and the requested number of
+	 * workers. This comes from the PARALLEL option. 0 means to choose the
+	 * parallel degree based on the table AM specific factors such as table
+	 * size.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel,
+													int nworkers_requested,
+													void *state);
+
+	/*
+	 * Estimate the size of shared memory needed for a parallel table vacuum
+	 * of this relation.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 struct ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   struct ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * Initialize AM-specific vacuum state for worker processes.
+	 *
+	 * The state_out is the output parameter so that arbitrary data can be
+	 * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize_worker) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  struct ParallelWorkerContext *pwcxt,
+													  void **state_out);
+
+	/*
+	 * Execute a parallel scan to collect dead items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_collect_dead_items) (Relation rel,
+													   struct ParallelVacuumState *pvs,
+													   void *state);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1680,6 +1758,68 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Compute the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									  void *state)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel,
+															nworkers_requested,
+															state);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, struct ParallelContext *pcxt,
+							   int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, struct ParallelContext *pcxt,
+								 int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize AM-specific vacuum state for worker processes.
+ */
+static inline void
+table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+										struct ParallelWorkerContext *pwcxt,
+										void **state_out)
+{
+	rel->rd_tableam->parallel_vacuum_initialize_worker(rel, pvs, pwcxt, state_out);
+}
+
+/*
+ * Execute a parallel vacuum scan to collect dead items.
+ */
+static inline void
+table_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+										 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_collect_dead_items(rel, pvs, state);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
-- 
2.43.5

#93

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Masahiko Sawada (#92)

4 attachment(s)

Re: Parallel heap vacuum

On Thu, Mar 27, 2025 at 11:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

You're right. I've studied the read stream code and figured out how to
use it. In the attached patch, we end the read stream at the end of
phase 1 and start a new read stream, as you suggested.

I've started looking at this patch set some more.

Thank you for reviewing the patch!

In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD
codepath and run out of unskippable blocks in the current chunk and
then go back to get another chunk (goto retry) but we are near the
memory limit so we can't get another block
(!dead_items_check_memory_limit()), could we get an infinite loop?

Or even incorrectly encroach on another worker's block? Asking that
because of this math
end_block = next_block +

vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1;

You're right. We should make sure that reset next_block is reset to
InvalidBlockNumber at the beginning of the retry loop.

if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0
and we are in the goto retry loop, it seems like we could keep
incrementing next_block even when we shouldn't be.

Right. Will fix.

I just want to make sure that the skip pages optimization works with
the parallel block assignment and the low memory read stream
wind-down.

I also think you do not need to expose
table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is
only called in heap-specific code and the logic seems very
heap-related. If table AMs want something to skip some blocks, they
could easily implement it.

Agreed. Will remove it.

On another topic, I think it would be good to have a comment above
this code in parallel_lazy_scan_gather_scan_results(), stating why we
are very sure it is correct.
Assert(TransactionIdIsValid(data->NewRelfrozenXid));
Assert(MultiXactIdIsValid(data->NewRelminMxid));

if (TransactionIdPrecedes(data->NewRelfrozenXid,
vacrel->scan_data->NewRelfrozenXid))
vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;

if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid,
vacrel->scan_data->NewRelminMxid))
vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;

if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
vacrel->scan_data->nonempty_pages = data->nonempty_pages;

vacrel->scan_data->skippedallvis |= data->skippedallvis;

Parallel relfrozenxid advancement sounds scary, and scary things are
best with comments. Even though the way this works is intuitive, I
think it is worth pointing out that this part is important to get
right so future programmers know how important it is.

One thing I was wondering about is if there are any implications of
different workers having different values in their GlobalVisState.
GlobalVisState can be updated during vacuum, so even if they start out
with the same values, that could diverge. It is probably okay since it
just controls what tuples are removable. Some workers may remove fewer
tuples than they absolutely could, and this is probably okay.

Good point.

And if it is okay for each worker to have different GlobalVisState
then maybe you shouldn't have a GlobalVisState in shared memory. If
you look at GlobalVisTestFor() it just returns the memory address of
that global variable in the backend. So, it seems like it might be
better to just let each parallel worker use their own backend local
GlobalVisState and not try to put it in shared memory and copy it from
one worker to the other workers when initializing them. I'm not sure.
At the very least, there should be a comment explaining why you've
done it the way you have done it.

Agreed. IIUC it's not a problem even if parallel workers use their own
GlobalVisState. I'll make that change and remove the 0004 patch which
exposes GlobalVisState.

I'll send the updated patch soon.

I've attached the updated patches. This version includes the comments
from Melanie, some bug fixes, and comment updates.

Rebased the patches as they conflicted with recent commits.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v15-0004-Support-parallelism-for-collecting-dead-items-du.patchapplication/octet-stream; name=v15-0004-Support-parallelism-for-collecting-dead-items-du.patchDownload

From 5a42ead49a3d28456f14b610fe2d33c335e851a0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Feb 2025 13:41:35 -0800
Subject: [PATCH v15 4/4] Support parallelism for collecting dead items during
 lazy vacuum.

This feature allows the vacuum to leverage multiple CPUs in order to
collect dead items (i.e. the first pass over heap table) with parallel
workers. The parallel degree for parallel heap vacuuming is determined
based on the number of blocks to vacuum unless PARALLEL option of
VACUUM command is specified, and further limited by
max_parallel_maintenance_workers.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/ref/vacuum.sgml             |  54 +-
 src/backend/access/heap/heapam_handler.c |   4 +
 src/backend/access/heap/vacuumlazy.c     | 992 ++++++++++++++++++++---
 src/backend/commands/vacuumparallel.c    |  29 +
 src/include/access/heapam.h              |  11 +
 src/include/commands/vacuum.h            |   3 +
 src/test/regress/expected/vacuum.out     |   6 +
 src/test/regress/sql/vacuum.sql          |   7 +
 src/tools/pgindent/typedefs.list         |   4 +
 9 files changed, 989 insertions(+), 121 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..294494877d9 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -280,25 +280,41 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
      <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
+      Perform scanning heap, index vacuum, and index cleanup phases of
+      <command>VACUUM</command> in parallel using
+      <replaceable class="parameter">integer</replaceable> background workers
+      (for the details of each vacuum phase, please refer to
+      <xref linkend="vacuum-phases"/>).
+     </para>
+     <para>
+      For heap tables, the number of workers used to perform the scanning
+      heap is determined based on the size of table. A table can participate in
+      parallel scanning heap if and only if the size of the table is more than
+      <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+      the heap table's blocks will be divided into ranges and shared among the
+      cooperating processes. Each worker process will complete the scanning of
+      its given range of blocks before requesting an additional range of blocks.
+     </para>
+     <para>
+      The number of workers used to perform parallel index vacuum and index
+      cleanup is equal to the number of indexes on the relation that support
+      parallel vacuum. An index can participate in parallel vacuum if and only
+      if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+      Only one worker can be used per index. So parallel workers for index vacuum
+      and index cleanup are launched only when there are at least <literal>2</literal>
+      indexes in the table.
+     </para>
+     <para>
+      Workers for vacuum are launched before the start of each phase and exit
+      at the end of the phase. The number of workers for each phase is limited by
+      the number of workers specified with <literal>PARALLEL</literal> option if
+      any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+      Please note that in any parallel vacuum phase, it is not guaanteed that the
+      number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+      will be used during execution. It is possible for a vacuum to run with fewer
+      workers than specified, or even with no workers at all. These behaviors might
+      change in a future release. This option can't be used with the <literal>FULL</literal>
+      option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a534100692a..9de9f4637b2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2713,6 +2713,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
 
 	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_initialize_worker = heap_parallel_vacuum_initialize_worker,
+	.parallel_vacuum_collect_dead_items = heap_parallel_vacuum_collect_dead_items,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index aebc7c91379..5afb86ba2f3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -99,6 +99,46 @@
  * After pruning and freezing, pages that are newly all-visible and all-frozen
  * are marked as such in the visibility map.
  *
+ * Parallel Vacuum:
+ *
+ * Lazy vacuum on heap tables supports parallel processing for phase I and
+ * phase II. Before starting phase I, we initialize parallel vacuum state,
+ * ParallelVacuumState, and allocate the TID store in a DSA area if we can
+ * use parallel mode for any of these two phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size and number of indexes. Parallel
+ * workers are launched at the beginning of each phase and exit at the end of
+ * each phase.
+ *
+ * For the parallel lazy heap scan (i.e. parallel phase I), we employ a parallel
+ * block table scan, controlled by ParallelBlockTableScanDesc, in conjunction
+ * with the read stream. The table is split into multiple chunks, which are
+ * then distributed among parallel workers.
+ *
+ * While vacuum cutoffs are shared between leader and worker processes, each
+ * individual process uses its own GlobalVisState, potentially causing some
+ * workers to remove fewer tuples than optimal. During parallel lazy heap scans,
+ * each worker tracks the oldest existing XID and MXID. The leader computes the
+ * globally oldest existing XID and MXID after the parallel scan, while
+ * gathering table data too.
+ *
+ * The workers' parallel scan descriptions, ParallelBlockTableScanWorkerData,
+ * are stored in the DSM space, enabling different parallel workers to resume
+ * phase I from their previous state. However, due to the potential presence
+ * of pinned buffers loaded by the read stream's look-ahead mechanism, we
+ * cannot abruptly stop phase I even when the space of dead_items TIDs exceeds
+ * the limit. Instead, once this threshold is surpassed, we begin processing
+ * pages without attempting to retrieve additional blocks until the read
+ * stream is exhausted. While this approach may increase the memory usage, it
+ * typically doesn't pose a significant problem, as processing a few 10s-100s
+ * buffers doesn't substantially increase the size of dead_items TIDs.
+ *
+ * If the leader launches fewer workers than the previous time to resume the
+ * parallel lazy heap scan, some block within chunks may remain un-scanned.
+ * To address this, the leader completes workers' unfinished scans at the end
+ * of the parallel lazy heap scan (see complete_unfinished_lazy_scan_heap()).
+ *
  * Dead TID Storage:
  *
  * The major space usage for vacuuming is storage for the dead tuple IDs that
@@ -147,6 +187,7 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"	/* for min_parallel_table_scan_size */
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -214,11 +255,21 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * DSM keys for parallel lazy vacuum. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define PARALLEL_LV_KEY_SHARED				0xFFFF0001
+#define PARALLEL_LV_KEY_SCANDESC			0xFFFF0002
+#define PARALLEL_LV_KEY_SCANWORKER			0xFFFF0003
+
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
  * parallel mode and the DSM segment is initialized.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->plvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -306,6 +357,80 @@ typedef struct LVScanData
 	bool		skippedallvis;
 } LVScanData;
 
+/*
+ * Struct for information that needs to be shared among parallel workers
+ * for parallel lazy vacuum. All fields are static, set by the leader
+ * process.
+ */
+typedef struct ParallelLVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+} ParallelLVShared;
+
+/*
+ * Per-worker data for scan description, statistics counters, and
+ * miscellaneous data need to be shared with the leader.
+ */
+typedef struct ParallelLVScanWorker
+{
+	/* Both last_blkno and pbscanworkdata are initialized? */
+	bool		scan_inited;
+
+	/* The last processed block number */
+	pg_atomic_uint32 last_blkno;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData pbscanworkdata;
+
+	/* per-worker scan data and counters */
+	LVScanData	scandata;
+} ParallelLVScanWorker;
+
+/*
+ * Struct to store parallel lazy vacuum working state.
+ */
+typedef struct ParallelLVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Per-worker parallel table scan state */
+	ParallelBlockTableScanWorker pbscanwork;
+
+	/* Shared static information */
+	ParallelLVShared *shared;
+
+	/* Per-worker scan data. NULL for the leader process */
+	ParallelLVScanWorker *scanworker;
+} ParallelLVState;
+
+/*
+ * Struct for the leader process in parallel lazy vacuum.
+ */
+typedef struct ParallelLVLeader
+{
+	/* Shared memory size for each shared object */
+	Size		pbscan_len;
+	Size		shared_len;
+	Size		scanworker_len;
+
+	/* The number of workers launched for parallel lazy heap scan */
+	int			nworkers_launched;
+
+	/*
+	 * Points to the array of all per-worker scan states stored on DSM area.
+	 */
+	ParallelLVScanWorker *scanworkers;
+} ParallelLVLeader;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -368,6 +493,12 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 
+	/* Last processed block number */
+	BlockNumber last_blkno;
+
+	/* Next block to check for FSM vacuum */
+	BlockNumber next_fsm_block_to_vacuum;
+
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
 	BlockNumber next_unskippable_block; /* next unskippable block */
@@ -375,6 +506,16 @@ typedef struct LVRelState
 	bool		next_unskippable_eager_scanned; /* if it was eagerly scanned */
 	Buffer		next_unskippable_vmbuffer;	/* buffer containing its VM bit */
 
+	/* Fields used for parallel lazy vacuum */
+
+	/* Parallel lazy vacuum working state */
+	ParallelLVState *plvstate;
+
+	/*
+	 * The leader state for parallel lazy vacuum. NULL for parallel workers.
+	 */
+	ParallelLVLeader *leader;
+
 	/* State related to managing eager scanning of all-visible pages */
 
 	/*
@@ -434,12 +575,14 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void do_lazy_scan_heap(LVRelState *vacrel, bool do_vacuum);
 static void heap_vacuum_eager_scan_setup(LVRelState *vacrel,
 										 VacuumParams *params);
 static BlockNumber heap_vac_scan_next_block(ReadStream *stream,
 											void *callback_private_data,
 											void *per_buffer_data);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static bool find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -450,6 +593,12 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
 							  bool *has_lpdead_items);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static BlockNumber parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel);
+static void complete_unfinished_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_begin(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
+static void parallel_lazy_scan_gather_scan_results(LVRelState *vacrel);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -474,6 +623,7 @@ static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
 static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 						   int num_offsets);
+static bool dead_items_check_memory_limit(LVRelState *vacrel);
 static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
@@ -529,6 +679,22 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	if (vacrel->aggressive)
 		return;
 
+	/*
+	 * Disable eager scanning if parallel lazy vacuum is enabled.
+	 *
+	 * One might think that it would make sense to use the eager scanning even
+	 * during parallel lazy vacuum, but parallel vacuum is available only in
+	 * VACUUM command and would not be something that happens frequently,
+	 * which seems not fit to the purpose of the eager scanning. Also, it
+	 * would require making the code complex. So it would make sense to
+	 * disable it for now.
+	 *
+	 * XXX: this limitation might need to be eliminated in the future for
+	 * example when we use parallel vacuum also in autovacuum.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		return;
+
 	/*
 	 * Aggressively vacuuming a small relation shouldn't take long, so it
 	 * isn't worth amortizing. We use two times the region size as the size
@@ -771,6 +937,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
+	vacrel->next_fsm_block_to_vacuum = 0;
 
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
@@ -815,13 +982,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
-	/*
-	 * Set up eager scan tracking state. This must happen after determining
-	 * whether or not the vacuum must be aggressive, because only normal
-	 * vacuums use the eager scan algorithm.
-	 */
-	heap_vacuum_eager_scan_setup(vacrel, params);
-
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -846,6 +1006,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
 
+	/*
+	 * Set up eager scan tracking state. This must happen after determining
+	 * whether or not the vacuum must be aggressive, because only normal
+	 * vacuums use the eager scan algorithm.
+	 */
+	heap_vacuum_eager_scan_setup(vacrel, params);
+
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
@@ -1215,13 +1382,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
-				blkno = 0,
-				next_fsm_block_to_vacuum = 0;
-	BlockNumber orig_eager_scan_success_limit =
-		vacrel->eager_scan_remaining_successes; /* for logging */
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1242,6 +1403,73 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/* Do the actual work */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel, true);
+
+	/*
+	 * Report that everything is now scanned. We never skip scanning the last
+	 * block in the relation, so we can pass rel_pages here.
+	 */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
+								 rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * We can pass rel_pages here because we never skip scanning the last
+	 * block of the relation.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * If do_vacuum is true, we stop the lazy heap scan and invoke a cycle of index
+ * vacuuming and table vacuuming if the space of dead_items TIDs exceeds the limit, and
+ * then resume it. On the other hand, if it's false, we continue scanning until the
+ * read stream is exhausted.
+ */
+static void
+do_lazy_scan_heap(LVRelState *vacrel, bool do_vacuum)
+{
+	ReadStream *stream;
+	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber orig_eager_scan_success_limit =
+		vacrel->eager_scan_remaining_successes; /* for logging */
+	Buffer		vmbuffer = InvalidBuffer;
+
 	/*
 	 * Set up the read stream for vacuum's first pass through the heap.
 	 *
@@ -1276,8 +1504,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * that point.  This check also provides failsafe coverage for the
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
+		 *
+		 * The failsafe check is done only by the leader process.
 		 */
-		if (vacrel->scan_data->scanned_pages > 0 &&
+		if (!IsParallelWorker() &&
+			vacrel->scan_data->scanned_pages > 0 &&
 			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
@@ -1285,12 +1516,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Consider if we definitely have enough space to process TIDs on page
 		 * already.  If we are close to overrunning the available space for
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page. However, let's force at least one page-worth of tuples
-		 * to be stored as to ensure we do at least some work when the memory
-		 * configured is so low that we run out before storing anything.
+		 * this page.
 		 */
-		if (vacrel->dead_items_info->num_items > 0 &&
-			TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
+		if (do_vacuum && dead_items_check_memory_limit(vacrel))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1313,15 +1541,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * upper-level FSM pages. Note that blkno is the previously
 			 * processed block.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 									blkno + 1);
-			next_fsm_block_to_vacuum = blkno;
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 		}
 
+		/* Read the next block to process */
 		buf = read_stream_next_buffer(stream, &per_buffer_data);
 
 		/* The relation is exhausted. */
@@ -1331,7 +1560,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		blk_info = *((uint8 *) per_buffer_data);
 		CheckBufferIsPinnedOnce(buf);
 		page = BufferGetPage(buf);
-		blkno = BufferGetBlockNumber(buf);
+		blkno = vacrel->last_blkno = BufferGetBlockNumber(buf);
 
 		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
@@ -1491,13 +1720,36 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * visible on upper FSM pages. This is done after vacuuming if the
 			 * table has indexes. There will only be newly-freed space if we
 			 * held the cleanup lock and lazy_scan_prune() was called.
+			 *
+			 * During parallel lazy heap scanning, only the leader process
+			 * vacuums the FSM. However, we cannot vacuum the FSM for blocks
+			 * up to 'blk' because there may be un-scanned blocks or blocks
+			 * being processed by workers before this point. Instead, parallel
+			 * workers advertise the block numbers they have just processed,
+			 * and the leader vacuums the FSM up to the smallest block number
+			 * among them. This approach ensures we vacuum the FSM for
+			 * consecutive processed blocks.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				if (IsParallelWorker())
+				{
+					pg_atomic_write_u32(&(vacrel->plvstate->scanworker->last_blkno),
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				}
+				else
+				{
+					BlockNumber fsmvac_upto = blkno;
+
+					if (ParallelHeapVacuumIsActive(vacrel))
+						fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+					FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+											fsmvac_upto);
+				}
+
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
@@ -1508,50 +1760,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/*
-	 * Report that everything is now scanned. We never skip scanning the last
-	 * block in the relation, so we can pass rel_pages here.
-	 */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
-								 rel_pages);
-
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_data->scanned_pages,
-													 vacrel->scan_data->live_tuples);
-
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
-		vacrel->scan_data->missed_dead_tuples;
-
 	read_stream_end(stream);
-
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
-
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 * We can pass rel_pages here because we never skip scanning the last
-	 * block of the relation.
-	 */
-	if (rel_pages > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, rel_pages);
-
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
-
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
 }
 
 /*
@@ -1565,7 +1774,8 @@ lazy_scan_heap(LVRelState *vacrel)
  * heap_vac_scan_next_block() uses the visibility map, vacuum options, and
  * various thresholds to skip blocks which do not need to be processed and
  * returns the next block to process or InvalidBlockNumber if there are no
- * remaining blocks.
+ * remaining blocks or the space of dead_items TIDs reaches the limit (only
+ * in parallel lazy vacuum cases).
  *
  * The visibility status of the next block to process and whether or not it
  * was eager scanned is set in the per_buffer_data.
@@ -1587,11 +1797,37 @@ heap_vac_scan_next_block(ReadStream *stream,
 	LVRelState *vacrel = callback_private_data;
 	uint8		blk_info = 0;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+retry:
+	next_block = InvalidBlockNumber;
+
+	/* Get the next block to process */
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		/*
+		 * Stop returning the next block to the read stream if we are close to
+		 * overrunning the available space for dead_items TIDs so that the
+		 * read stream returns pinned buffers in its buffers queue until the
+		 * stream is exhausted. See the comments atop this file for details.
+		 */
+		if (!dead_items_check_memory_limit(vacrel))
+		{
+			/*
+			 * table_block_parallelscan_nextpage() returns InvalidBlockNumber
+			 * if there are no remaining blocks.
+			 */
+			next_block = table_block_parallelscan_nextpage(vacrel->rel,
+														   vacrel->plvstate->pbscanwork,
+														   vacrel->plvstate->pbscan);
+		}
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
+	}
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->scan_data->rel_pages)
+	if (!BlockNumberIsValid(next_block) || next_block >= vacrel->scan_data->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1613,8 +1849,42 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * visibility map.
 		 */
 		bool		skipsallvis;
+		bool		found;
+		BlockNumber end_block;
+		BlockNumber nblocks_skip;
+
+		if (ParallelHeapVacuumIsActive(vacrel))
+		{
+			/* We look for the next unskippable block within the chunk */
+			end_block = next_block +
+				vacrel->plvstate->pbscanwork->phsw_chunk_remaining + 1;
+		}
+		else
+			end_block = vacrel->scan_data->rel_pages;
+
+		found = find_next_unskippable_block(vacrel, &skipsallvis, next_block, end_block);
+
+		/*
+		 * We must have found the next unskippable block within the specified
+		 * range in non-parallel cases as the end_block is always the last
+		 * block + 1 and we must scan the last block.
+		 */
+		Assert(found || ParallelHeapVacuumIsActive(vacrel));
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		if (!found)
+		{
+			if (skipsallvis)
+				vacrel->scan_data->skippedallvis = true;
+
+			/*
+			 * Skip all remaining blocks in the current chunk, and retry with
+			 * the next chunk.
+			 */
+			vacrel->plvstate->pbscanwork->phsw_chunk_remaining = 0;
+			goto retry;
+		}
+
+		Assert(vacrel->next_unskippable_block < end_block);
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1631,11 +1901,20 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
 				vacrel->scan_data->skippedallvis = true;
+
+			/* Tell the parallel scans to skip blocks */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				vacrel->plvstate->pbscanwork->phsw_chunk_remaining -= nblocks_skip;
+				Assert(vacrel->plvstate->pbscanwork->phsw_chunk_remaining > 0);
+			}
+
+			next_block = vacrel->next_unskippable_block;
 		}
 	}
 
@@ -1671,9 +1950,11 @@ heap_vac_scan_next_block(ReadStream *stream,
 }
 
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
- * The next unskippable block and its visibility information is updated in
- * vacrel.
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in a range of 'start' (inclusive) and 'end' (exclusive).
+ *
+ * If found, the next unskippable block and its visibility information is updated
+ * in vacrel. Otherwise, return false and reset the information in vacrel.
  *
  * Note: our opinion of which blocks can be skipped can go stale immediately.
  * It's okay if caller "misses" a page whose all-visible or all-frozen marking
@@ -1683,22 +1964,32 @@ heap_vac_scan_next_block(ReadStream *stream,
  * older XIDs/MXIDs.  The *skippedallvis flag will be set here when the choice
  * to skip such a range is actually made, making everything safe.)
  */
-static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+static bool
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start, BlockNumber end)
 {
 	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block = start;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
 	bool		next_unskippable_allvis;
+	bool		found = true;
 
 	*skipsallvis = false;
 
 	for (;; next_unskippable_block++)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   &next_unskippable_vmbuffer);
+		uint8		mapbits;
+
+		/* Reach the end of range? */
+		if (next_unskippable_block >= end)
+		{
+			found = false;
+			break;
+		}
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_unskippable_block,
+										   &next_unskippable_vmbuffer);
 
 		next_unskippable_allvis = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
 
@@ -1774,11 +2065,274 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 		*skipsallvis = true;
 	}
 
-	/* write the local variables back to vacrel */
-	vacrel->next_unskippable_block = next_unskippable_block;
-	vacrel->next_unskippable_allvis = next_unskippable_allvis;
-	vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
-	vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	if (found)
+	{
+		/* write the local variables back to vacrel */
+		vacrel->next_unskippable_block = next_unskippable_block;
+		vacrel->next_unskippable_allvis = next_unskippable_allvis;
+		vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
+		vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	}
+	else
+	{
+		if (BufferIsValid(next_unskippable_vmbuffer))
+			ReleaseBuffer(next_unskippable_vmbuffer);
+
+		/*
+		 * There is not unskippable block in the specified range. Reset the
+		 * related fields in vacrel.
+		 */
+		vacrel->next_unskippable_block = InvalidBlockNumber;
+		vacrel->next_unskippable_allvis = InvalidBlockNumber;
+		vacrel->next_unskippable_eager_scanned = false;
+		vacrel->next_unskippable_vmbuffer = InvalidBuffer;
+	}
+
+	return found;
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches
+ * parallel workers to scan the heap in parallel.
+*/
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	ParallelBlockTableScanWorkerData pbscanworkdata;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/*
+	 * Setup the parallel scan description for the leader to join as a worker.
+	 */
+	table_block_parallelscan_startblock_init(vacrel->rel,
+											 &pbscanworkdata,
+											 vacrel->plvstate->pbscan);
+	vacrel->plvstate->pbscanwork = &pbscanworkdata;
+
+	for (;;)
+	{
+		BlockNumber fsmvac_upto;
+
+		/* Launch parallel workers */
+		parallel_lazy_scan_heap_begin(vacrel);
+
+		/*
+		 * Do lazy heap scan until the read stream is exhausted. We will stop
+		 * retrieving new blocks for the read stream once the space of
+		 * dead_items TIDs exceeds the limit.
+		 */
+		do_lazy_scan_heap(vacrel, false);
+
+		/* Wait for parallel workers to finish and gather scan results */
+		parallel_lazy_scan_heap_end(vacrel);
+
+		if (!dead_items_check_memory_limit(vacrel))
+			break;
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		/* Compute the smallest processed block number */
+		fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (fsmvac_upto > vacrel->next_fsm_block_to_vacuum)
+		{
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									fsmvac_upto);
+			vacrel->next_fsm_block_to_vacuum = fsmvac_upto;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process launched fewer workers in the next cycle.
+	 */
+	complete_unfinished_lazy_scan_heap(vacrel);
+}
+
+/*
+ * Return the smallest block number that the leader and workers have scanned.
+ */
+static BlockNumber
+parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel)
+{
+	BlockNumber min_blk;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/* Initialized with the leader's value */
+	min_blk = vacrel->last_blkno;
+
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		ParallelLVScanWorker *scanworker = &(vacrel->leader->scanworkers[i]);
+		BlockNumber blkno;
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!scanworker->scan_inited)
+			continue;
+
+		blkno = pg_atomic_read_u32(&(scanworker->last_blkno));
+
+		if (!BlockNumberIsValid(min_blk) || min_blk > blkno)
+			min_blk = blkno;
+	}
+
+	Assert(BlockNumberIsValid(min_blk));
+
+	return min_blk;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+complete_unfinished_lazy_scan_heap(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		ParallelLVScanWorker *scanworker = &(vacrel->leader->scanworkers[i]);
+
+		if (!scanworker->scan_inited)
+			continue;
+
+		if (scanworker->pbscanworkdata.phsw_chunk_remaining == 0)
+			continue;
+
+		/* Attach the worker's scan state */
+		vacrel->plvstate->pbscanwork = &(scanworker->pbscanworkdata);
+
+		/*
+		 * Complete the unfinished scan. Note that we might perform multiple
+		 * cycles of index and heap vacuuming while completing the scans.
+		 */
+		vacrel->next_fsm_block_to_vacuum = pg_atomic_read_u32(&(scanworker->last_blkno));
+		do_lazy_scan_heap(vacrel, true);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_begin(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->leader->nworkers_launched = parallel_vacuum_collect_dead_items_begin(vacrel->pvs);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for collecting dead tuples (planned: %d)",
+							 "launched %d parallel vacuum workers for collecting dead tuples (planned: %d)",
+							 vacrel->leader->nworkers_launched),
+					vacrel->leader->nworkers_launched,
+					parallel_vacuum_get_nworkers_table(vacrel->pvs))));
+}
+
+/*
+ * Helper routine to finish the parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_scan_end(vacrel->pvs);
+
+	/* Gather the workers' scan results */
+	parallel_lazy_scan_gather_scan_results(vacrel);
+}
+
+/*
+ * Accumulate each worker's scan results into the leader's.
+*/
+static void
+parallel_lazy_scan_gather_scan_results(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		LVScanData *data = &(vacrel->leader->scanworkers[i].scandata);
+
+		/* Accumulate the counters collected by workers */
+#define ACCUM_COUNT(item) vacrel->scan_data->item += data->item
+		ACCUM_COUNT(scanned_pages);
+		ACCUM_COUNT(removed_pages);
+		ACCUM_COUNT(new_frozen_tuple_pages);
+		ACCUM_COUNT(vm_new_visible_pages);
+		ACCUM_COUNT(vm_new_visible_frozen_pages);
+		ACCUM_COUNT(vm_new_frozen_pages);
+		ACCUM_COUNT(lpdead_item_pages);
+		ACCUM_COUNT(missed_dead_pages);
+		ACCUM_COUNT(tuples_deleted);
+		ACCUM_COUNT(tuples_frozen);
+		ACCUM_COUNT(lpdead_items);
+		ACCUM_COUNT(live_tuples);
+		ACCUM_COUNT(recently_dead_tuples);
+		ACCUM_COUNT(missed_dead_tuples);
+#undef ACCUM_COUNT
+
+		/*
+		 * Track the greatest non-empty page among values the workers
+		 * collected as it's used to cut-off point of heap truncation.
+		 */
+		if (vacrel->scan_data->nonempty_pages < data->nonempty_pages)
+			vacrel->scan_data->nonempty_pages = data->nonempty_pages;
+
+		/*
+		 * All workers must have initialized both values with the values
+		 * passed by the leader.
+		 */
+		Assert(TransactionIdIsValid(data->NewRelfrozenXid));
+		Assert(MultiXactIdIsValid(data->NewRelminMxid));
+
+		/*
+		 * During parallel lazy scanning, since different workers process
+		 * separate blocks, they may observe different existing XIDs and
+		 * MXIDs. Therefore, we compute the oldest XID and MXID from the
+		 * values observed by each worker (including the leader). These
+		 * computations are crucial for correctly advancing both relfrozenxid
+		 * and relmminmxid values.
+		 */
+
+		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
+			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
+
+		/* Has any one of workers skipped all-visible page? */
+		vacrel->scan_data->skippedallvis |= data->skippedallvis;
+	}
 }
 
 /*
@@ -2067,7 +2621,8 @@ lazy_scan_prune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2440,7 +2995,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -3504,12 +4060,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -3527,11 +4079,17 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize the parallel vacuum state for either lazy heap
+			 * scan, index vacuuming, or both.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3571,15 +4129,35 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
 	pgstat_progress_update_multi_param(2, prog_index, prog_val);
 }
 
+/*
+ * Check the memory usage of the collected dead items and return true
+ * if we are close to overrunning the available space for dead_items TIDs.
+ * However, let's force at least one page-worth of tuples to be stored as
+ * to ensure we do at least some work when the memory configured is so low
+ * that we run out before storing anything.
+ */
+static bool
+dead_items_check_memory_limit(LVRelState *vacrel)
+{
+	return vacrel->dead_items_info->num_items > 0 &&
+		TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes;
+}
+
 /*
  * Forget all collected dead items.
  */
@@ -3775,14 +4353,224 @@ update_relstats_all_indexes(LVRelState *vacrel)
 
 /*
  * Compute the number of workers for parallel heap vacuum.
- *
- * Return 0 to disable parallel vacuum.
  */
 int
 heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
 									 void *state)
 {
-	return 0;
+	int			parallel_workers = 0;
+
+	if (nworkers_requested == 0)
+	{
+		LVRelState *vacrel = (LVRelState *) state;
+		int			heap_parallel_threshold;
+		int			heap_pages;
+		BlockNumber allvisible;
+		BlockNumber allfrozen;
+
+		/*
+		 * Estimate the number of blocks that we're going to scan during
+		 * lazy_scan_heap().
+		 */
+		visibilitymap_count(rel, &allvisible, &allfrozen);
+		heap_pages = RelationGetNumberOfBlocks(rel) -
+			(vacrel->aggressive ? allfrozen : allvisible);
+
+		Assert(heap_pages >= 0);
+
+		/*
+		 * Select the number of workers based on the log of the number of
+		 * pages to scan. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nworkers_requested;
+
+	return parallel_workers;
+}
+
+/*
+ * Estimate shared memory size required for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							  void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	Size		size = 0;
+
+	vacrel->leader = palloc(sizeof(ParallelLVLeader));
+
+	/* Estimate space for ParallelLVShared */
+	size = add_size(size, sizeof(ParallelLVShared));
+	vacrel->leader->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for ParallelBlockTableScanDesc */
+	vacrel->leader->pbscan_len = table_block_parallelscan_estimate(rel);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->pbscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of ParallelLVScanWorker */
+	vacrel->leader->scanworker_len = mul_size(sizeof(ParallelLVScanWorker), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scanworker_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ParallelLVShared *shared;
+	ParallelBlockTableScanDesc pbscan;
+	ParallelLVScanWorker *scanworkers;
+
+	vacrel->plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Initialize ParallelLVShared */
+	shared = shm_toc_allocate(pcxt->toc, vacrel->leader->shared_len);
+	MemSet(shared, 0, vacrel->leader->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_data->NewRelminMxid;
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SHARED, shared);
+	vacrel->plvstate->shared = shared;
+
+	/* Initialize ParallelBlockTableScanDesc */
+	pbscan = shm_toc_allocate(pcxt->toc, vacrel->leader->pbscan_len);
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pbscan);
+	pbscan->base.phs_syncscan = false;	/* always start from the first block */
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANDESC, pbscan);
+	vacrel->plvstate->pbscan = pbscan;
+
+	/* Initialize the array of ParallelLVScanWorker */
+	scanworkers = shm_toc_allocate(pcxt->toc, vacrel->leader->scanworker_len);
+	MemSet(scanworkers, 0, vacrel->leader->scanworker_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANWORKER, scanworkers);
+	vacrel->leader->scanworkers = scanworkers;
+}
+
+/*
+ * Initialize lazy vacuum state with the information retrieved from
+ * shared memory.
+ */
+void
+heap_parallel_vacuum_initialize_worker(Relation rel, ParallelVacuumState *pvs,
+									   ParallelWorkerContext *pwcxt,
+									   void **state_out)
+{
+	LVRelState *vacrel;
+	ParallelLVState *plvstate;
+	ParallelLVShared *shared;
+	ParallelLVScanWorker *scanworker;
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Initialize ParallelLVState and prepare the related objects */
+
+	plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Prepare ParallelLVShared */
+	shared = (ParallelLVShared *) shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SHARED, false);
+	plvstate->shared = shared;
+
+	/* Prepare ParallelBlockTableScanWorkerData */
+	pbscan = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANDESC, false);
+	plvstate->pbscan = pbscan;
+
+	/* Prepare ParallelLVScanWorker */
+	scanworker = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANWORKER, false);
+	plvstate->scanworker = &(scanworker[ParallelWorkerNumber]);
+	plvstate->pbscanwork = &(plvstate->scanworker->pbscanworkdata);
+
+	/* Initialize LVRelState and prepare fields required by lazy scan heap */
+	vacrel = palloc0(sizeof(LVRelState));
+	vacrel->rel = rel;
+	vacrel->indrels = parallel_vacuum_get_table_indexes(pvs,
+														&vacrel->nindexes);
+	vacrel->bstrategy = parallel_vacuum_get_bstrategy(pvs);
+	vacrel->pvs = pvs;
+	vacrel->aggressive = shared->aggressive;
+	vacrel->skipwithvm = shared->skipwithvm;
+	vacrel->vistest = GlobalVisTestFor(rel);
+	vacrel->cutoffs = shared->cutoffs;
+	vacrel->dead_items = parallel_vacuum_get_dead_items(pvs,
+														&vacrel->dead_items_info);
+	vacrel->plvstate = plvstate;
+	vacrel->scan_data = &(plvstate->scanworker->scandata);
+	MemSet(vacrel->scan_data, 0, sizeof(LVScanData));
+	vacrel->scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = shared->NewRelminMxid;
+	vacrel->scan_data->skippedallvis = false;
+	vacrel->scan_data->rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Initialize the scan state if not yet. The chunk of blocks will be
+	 * allocated when to get the scan block for the first time.
+	 */
+	if (!vacrel->plvstate->scanworker->scan_inited)
+	{
+		vacrel->plvstate->scanworker->scan_inited = true;
+		table_block_parallelscan_startblock_init(rel,
+												 vacrel->plvstate->pbscanwork,
+												 vacrel->plvstate->pbscan);
+		pg_atomic_init_u32(&(vacrel->plvstate->scanworker->last_blkno),
+						   InvalidBlockNumber);
+	}
+
+	*state_out = (void *) vacrel;
+}
+
+/*
+ * Parallel heap vacuum callback for collecting dead items (i.e., lazy heap scan).
+ */
+void
+heap_parallel_vacuum_collect_dead_items(Relation rel, ParallelVacuumState *pvs,
+										void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ErrorContextCallback errcallback;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel->dbname = get_database_name(MyDatabaseId);
+	vacrel->relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Join the parallel heap vacuum */
+	do_lazy_scan_heap(vacrel, false);
+
+	/* Advertise the last processed block number */
+	pg_atomic_write_u32(&(vacrel->plvstate->scanworker->last_blkno), vacrel->last_blkno);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
 }
 
 /*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bb0d690aed8..e6742b29b7b 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -501,6 +501,35 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
+/*
+ * Return the number of parallel workers initialized for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/*
+ * Return the buffer strategy for parallel vacuum.
+ */
+BufferAccessStrategy
+parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs)
+{
+	return pvs->bstrategy;
+}
+
 /*
  * Returns the dead items space and dead items information.
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 6a1ca5d5ca7..b94d783c31e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -15,6 +15,7 @@
 #define HEAPAM_H
 
 #include "access/heapam_xlog.h"
+#include "access/parallel.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
@@ -407,10 +408,20 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 
 /* in heap/vacuumlazy.c */
 struct VacuumParams;
+struct ParallelVacuumState;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
 extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
 												 void *state);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+										  void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+												   ParallelWorkerContext *pwcxt,
+												   void **state_out);
+extern void heap_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+													void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e8f75fc67b1..b0b9547b4b8 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -385,6 +385,9 @@ extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels
 												 BufferAccessStrategy bstrategy,
 												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
+extern BufferAccessStrategy parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 0abcc99989e..f92c3f73c29 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -160,6 +160,11 @@ UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
 VACUUM (PARALLEL -1) pvactst; -- error
 ERROR:  parallel workers for vacuum must be between 0 and 1024
 LINE 1: VACUUM (PARALLEL -1) pvactst;
@@ -185,6 +190,7 @@ VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 WARNING:  disabling parallel option of vacuum on "tmp" --- cannot vacuum temporary tables in parallel
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
 DROP TABLE pvactst2;
 -- INDEX_CLEANUP option
diff --git a/src/test/regress/sql/vacuum.sql b/src/test/regress/sql/vacuum.sql
index a72bdb5b619..b8abab28ea9 100644
--- a/src/test/regress/sql/vacuum.sql
+++ b/src/test/regress/sql/vacuum.sql
@@ -129,6 +129,12 @@ VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
 
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
+
 VACUUM (PARALLEL -1) pvactst; -- error
 VACUUM (PARALLEL 2, INDEX_CLEANUP FALSE) pvactst;
 VACUUM (PARALLEL 2, FULL TRUE) pvactst; -- error, cannot use both PARALLEL and FULL
@@ -148,6 +154,7 @@ CREATE INDEX tmp_idx1 ON tmp (a);
 VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
 DROP TABLE pvactst2;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4833bc70710..69dc9ad43f9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1940,6 +1940,10 @@ PLpgSQL_type
 PLpgSQL_type_type
 PLpgSQL_var
 PLpgSQL_variable
+ParallelLVLeader
+ParallelLVScanWorker
+ParallelLVShared
+ParallelLVState
 PLwdatum
 PLword
 PLyArrayToOb
-- 
2.43.5

v15-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchapplication/octet-stream; name=v15-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchDownload

From 657c240c704f4784a165d4efe0c4564aa13a8a0a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 26 Feb 2025 11:31:55 -0800
Subject: [PATCH v15 3/4] Move lazy heap scan related variables to new struct
 LVScanData.

This is a pure refactoring for upcoming parallel heap scan, which
requires storing relation statistics collected during lazy heap scan
to a shared memory area.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/heap/vacuumlazy.c | 343 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 181 insertions(+), 163 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3b948970437..aebc7c91379 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -256,6 +256,56 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * Data and counters updated during lazy heap scan.
+ */
+typedef struct LVScanData
+{
+	BlockNumber rel_pages;		/* total number of pages */
+
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+
+	/*
+	 * Count of all-visible blocks eagerly scanned (for logging only). This
+	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
+	 */
+	BlockNumber eager_scanned_pages;
+
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVScanData;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -282,10 +332,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -310,35 +356,8 @@ typedef struct LVRelState
 	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
 	VacDeadItemsInfo *dead_items_info;
 
-	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-
-	/*
-	 * Count of all-visible blocks eagerly scanned (for logging only). This
-	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
-	 */
-	BlockNumber eager_scanned_pages;
-
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
-
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
-
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	/* Data and counters updated during lazy heap scan */
+	LVScanData *scan_data;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -348,13 +367,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -524,7 +536,7 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	 * the first region, making the second region the first to be eager
 	 * scanned normally.
 	 */
-	if (vacrel->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
+	if (vacrel->scan_data->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
 		return;
 
 	/*
@@ -616,6 +628,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVScanData *scan_data;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -730,14 +743,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->eager_scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	scan_data = palloc(sizeof(LVScanData));
+	scan_data->scanned_pages = 0;
+	scan_data->eager_scanned_pages = 0;
+	scan_data->removed_pages = 0;
+	scan_data->new_frozen_tuple_pages = 0;
+	scan_data->lpdead_item_pages = 0;
+	scan_data->missed_dead_pages = 0;
+	scan_data->nonempty_pages = 0;
+	scan_data->tuples_deleted = 0;
+	scan_data->tuples_frozen = 0;
+	scan_data->lpdead_items = 0;
+	scan_data->live_tuples = 0;
+	scan_data->recently_dead_tuples = 0;
+	scan_data->missed_dead_tuples = 0;
+	scan_data->vm_new_visible_pages = 0;
+	scan_data->vm_new_visible_frozen_pages = 0;
+	scan_data->vm_new_frozen_pages = 0;
+	scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel->scan_data = scan_data;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -747,17 +771,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
-
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
-	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -778,15 +793,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_data->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_data->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 
 	/*
 	 * Initialize state related to tracking all-visible page skipping. This is
 	 * very important to determine whether or not it is safe to advance the
 	 * relfrozenxid/relminmxid.
 	 */
-	vacrel->skippedallvis = false;
+	vacrel->scan_data->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -874,15 +889,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_data->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_data->NewRelfrozenXid));
+	Assert(vacrel->scan_data->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_data->NewRelminMxid));
+	if (vacrel->scan_data->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -890,15 +905,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_data->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_data->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * pg_class.relpages to
 	 */
-	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
+	new_rel_pages = vacrel->scan_data->rel_pages;	/* After possible rel
+													 * truncation */
 	visibilitymap_count(rel, &new_rel_allvisible, &new_rel_allfrozen);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
@@ -921,7 +937,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, new_rel_allfrozen,
 						vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_data->NewRelfrozenXid,
+						vacrel->scan_data->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -937,8 +954,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples,
+						 vacrel->scan_data->recently_dead_tuples +
+						 vacrel->scan_data->missed_dead_tuples,
 						 starttime);
 	pgstat_progress_end_command();
 
@@ -1012,23 +1029,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u eagerly scanned\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_data->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_data->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages /
+							 100.0 * vacrel->scan_data->scanned_pages /
 							 orig_rel_pages,
-							 vacrel->eager_scanned_pages);
+							 vacrel->scan_data->eager_scanned_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %" PRId64 " removed, %" PRId64 " remain, %" PRId64 " are dead but not yet removable\n"),
-							 vacrel->tuples_deleted,
+							 vacrel->scan_data->tuples_deleted,
 							 (int64) vacrel->new_rel_tuples,
-							 vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 vacrel->scan_data->recently_dead_tuples);
+			if (vacrel->scan_data->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %" PRId64 " dead from %u pages not removed due to cleanup lock contention\n"),
-								 vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 vacrel->scan_data->missed_dead_tuples,
+								 vacrel->scan_data->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -1036,33 +1053,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_data->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_data->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_data->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_data->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %" PRId64 " tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_data->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_data->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 vacrel->tuples_frozen);
+							 vacrel->scan_data->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_data->vm_new_visible_pages,
+							 vacrel->scan_data->vm_new_visible_frozen_pages +
+							 vacrel->scan_data->vm_new_frozen_pages,
+							 vacrel->scan_data->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -1082,10 +1099,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %" PRId64 " dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_data->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 vacrel->lpdead_items);
+							 100.0 * vacrel->scan_data->lpdead_item_pages / orig_rel_pages,
+							 vacrel->scan_data->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -1199,7 +1216,7 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->rel_pages,
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
 				blkno = 0,
 				next_fsm_block_to_vacuum = 0;
 	BlockNumber orig_eager_scan_success_limit =
@@ -1260,8 +1277,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages > 0 &&
-			vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_data->scanned_pages > 0 &&
+			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1316,9 +1333,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		blkno = BufferGetBlockNumber(buf);
 
-		vacrel->scanned_pages++;
+		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
-			vacrel->eager_scanned_pages++;
+			vacrel->scan_data->eager_scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -1500,16 +1517,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
 
 	read_stream_end(stream);
 
@@ -1556,7 +1573,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * callback_private_data contains a reference to the LVRelState, passed to the
  * read stream API during stream setup. The LVRelState is an in/out parameter
  * here (locally named `vacrel`). Vacuum options and information about the
- * relation are read from it. vacrel->skippedallvis is set if we skip a block
+ * relation are read from it. vacrel->scan_data->skippedallvis is set if we skip a block
  * that's all-visible but not all-frozen (to ensure that we don't update
  * relfrozenxid in that case). vacrel also holds information about the next
  * unskippable block -- as bookkeeping for this function.
@@ -1574,7 +1591,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 	next_block = vacrel->current_block + 1;
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->rel_pages)
+	if (next_block >= vacrel->scan_data->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1618,7 +1635,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_data->skippedallvis = true;
 		}
 	}
 
@@ -1669,7 +1686,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 static void
 find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
@@ -1900,11 +1917,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_data->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1979,10 +1996,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_data->NewRelfrozenXid, &vacrel->scan_data->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_data->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_data->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1992,7 +2009,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_data->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -2027,7 +2044,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -2042,15 +2059,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_data->tuples_deleted += presult.ndeleted;
+	vacrel->scan_data->tuples_frozen += presult.nfrozen;
+	vacrel->scan_data->lpdead_items += presult.lpdead_items;
+	vacrel->scan_data->live_tuples += presult.live_tuples;
+	vacrel->scan_data->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2099,17 +2116,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (presult.all_frozen)
 			{
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 				*vm_page_frozen = true;
 			}
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2197,8 +2214,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 
@@ -2208,7 +2225,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		else
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2249,8 +2266,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_data->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2377,8 +2394,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_data->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -2405,25 +2422,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_data->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_data->live_tuples += live_tuples;
+	vacrel->scan_data->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_data->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_data->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -2452,7 +2469,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_data->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -2481,12 +2498,12 @@ lazy_vacuum(LVRelState *vacrel)
 	 * HOT through careful tuning.
 	 */
 	bypass = false;
-	if (vacrel->consider_bypass_optimization && vacrel->rel_pages > 0)
+	if (vacrel->consider_bypass_optimization && vacrel->scan_data->rel_pages > 0)
 	{
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_data->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2512,8 +2529,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * be negligible.  If this optimization is ever expanded to cover more
 		 * cases then this may need to be reconsidered.
 		 */
-		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		threshold = (double) vacrel->scan_data->rel_pages * BYPASS_THRESHOLD_PAGES;
+		bypass = (vacrel->scan_data->lpdead_item_pages < threshold &&
 				  TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);
 	}
 
@@ -2651,7 +2668,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2813,8 +2830,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
+			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %" PRId64 " dead item identifiers in %u pages",
@@ -2930,14 +2947,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -3013,7 +3030,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_data->scanned_pages < vacrel->scan_data->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -3194,10 +3211,10 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->scan_data->rel_pages - vacrel->scan_data->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
-		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
+		 possibly_freeable >= vacrel->scan_data->rel_pages / REL_TRUNCATE_FRACTION))
 		return true;
 
 	return false;
@@ -3209,7 +3226,7 @@ should_attempt_truncation(LVRelState *vacrel)
 static void
 lazy_truncate_heap(LVRelState *vacrel)
 {
-	BlockNumber orig_rel_pages = vacrel->rel_pages;
+	BlockNumber orig_rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber new_rel_pages;
 	bool		lock_waiter_detected;
 	int			lock_retry;
@@ -3220,7 +3237,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_data->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -3321,15 +3338,15 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
-		vacrel->rel_pages = new_rel_pages;
+		vacrel->scan_data->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_data->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
 				(errmsg("table \"%s\": truncated %u to %u pages",
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_data->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -3353,11 +3370,11 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * unsigned.)  To make the scan faster, we prefetch a few blocks at a time
 	 * in forward direction, so that OS-level readahead can kick in.
 	 */
-	blkno = vacrel->rel_pages;
+	blkno = vacrel->scan_data->rel_pages;
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_data->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -3469,7 +3486,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_data->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9f52ceba1c6..4833bc70710 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1508,6 +1508,7 @@ LSEG
 LUID
 LVRelState
 LVSavedErrInfo
+LVScanData
 LWLock
 LWLockHandle
 LWLockMode
-- 
2.43.5

v15-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchapplication/octet-stream; name=v15-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchDownload

From 57b30c0001b5f1fa0c008761c92130cb7e77de9b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 18 Feb 2025 17:45:36 -0800
Subject: [PATCH v15 2/4] vacuumparallel.c: Support parallel vacuuming for
 tables to collect dead items.

Since parallel vacuum was available only for index vacuuming and index
cleanup, ParallelVacuumState was initialized only when the table has
at least two indexes that are eligible for parallel index vacuuming
and cleanup.

This commit extends vacuumparallel.c to support parallel table
vacuuming. parallel_vacuum_init() now initializes ParallelVacuumState
and it enables parallel table vacuuming and parallel index
vacuuming/cleanup separately. During the initialization, it asks the
table AM for the number of parallel workers required for parallel
table vacuuming. If >0, it enables parallel table vacuuming and calls
further table AM APIs such as parallel_vacuum_estimate.

For parallel table vacuuming, this commit introduces
parallel_vacuum_collect_dead_items_begin() function, which can be used
to collect dead items in the table (for example, the first pass over
heap table in lazy vacuum for heap tables).

Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/vacuumlazy.c  |   2 +-
 src/backend/commands/vacuumparallel.c | 307 +++++++++++++++++++++-----
 src/include/commands/vacuum.h         |   5 +-
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 256 insertions(+), 59 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8fd44ccf5dc..3b948970437 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3514,7 +3514,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..bb0d690aed8 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -4,17 +4,18 @@
  *	  Support routines for parallel vacuum execution.
  *
  * This file contains routines that are intended to support setting up, using,
- * and tearing down a ParallelVacuumState.
+ * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared
+ * information as well as the memory space for storing dead items allocated in
+ * the DSA area. We launch
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan, index bulk-deletion, index
+ * cleanup, or all of them with parallel worker processes depending on the
+ * number of parallel workers required for each phase. So different numbers of
+ * workers might be required for the table scanning and index processing.
+ * We launch parallel worker processes at the start of a phase, and once we
+ * complete all work in the phase, parallel workers exit. Each time we process
+ * table or indexes in parallel, the parallel context is re-initialized so that
+ * the same DSM can be used for multiple passes of each phase.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -26,8 +27,10 @@
  */
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -50,6 +53,13 @@
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		4
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
+/* The kind of parallel vacuum phases */
+typedef enum
+{
+	PV_WORK_PHASE_PROCESS_INDEXES,	/* index vacuuming or cleanup */
+	PV_WORK_PHASE_COLLECT_DEAD_ITEMS,	/* collect dead tuples */
+} PVWorkPhase;
+
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
  * segment.
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * Tell parallel workers what phase to perform: processing indexes or
+	 * removing dead tuples from the table.
+	 */
+	PVWorkPhase work_phase;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +180,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Do we need to reinitialize parallel DSM? */
+	bool		need_reinitialize_dsm;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -171,6 +190,12 @@ struct ParallelVacuumState
 	Relation   *indrels;
 	int			nindexes;
 
+	/*
+	 * The number of workers for parallel table vacuuming. If 0, the parallel
+	 * table vacuum is disabled.
+	 */
+	int			nworkers_for_table;
+
 	/* Shared information among parallel vacuum workers */
 	PVShared   *shared;
 
@@ -178,7 +203,7 @@ struct ParallelVacuumState
 	 * Shared index statistics among parallel vacuum workers. The array
 	 * element is allocated for every index, even those indexes where parallel
 	 * index vacuuming is unsafe or not worthwhile (e.g.,
-	 * will_parallel_vacuum[] is false).  During parallel vacuum,
+	 * idx_will_parallel_vacuum[] is false).  During parallel vacuum,
 	 * IndexBulkDeleteResult of each index is kept in DSM and is copied into
 	 * local memory at the end of parallel vacuum.
 	 */
@@ -198,7 +223,7 @@ struct ParallelVacuumState
 	 * processing. For example, the index could be <
 	 * min_parallel_index_scan_size cutoff.
 	 */
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
@@ -221,8 +246,10 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static int	parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											bool *idx_will_parallel_vacuum,
+											void *state);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -237,12 +264,16 @@ static void parallel_vacuum_error_callback(void *arg);
  * Try to enter parallel mode and create a parallel context.  Then initialize
  * shared memory state.
  *
+ * nrequested_workers is the requested parallel degree. 0 means that the parallel
+ * degrees for table and indexes vacuum are decided differently. See the comments
+ * of parallel_vacuum_compute_workers() for details.
+ *
  * On success, return parallel vacuum state.  Otherwise return NULL.
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -251,38 +282,38 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
 	int			querylen;
 
-	/*
-	 * A parallel vacuum must be requested and there must be indexes on the
-	 * relation
-	 */
+	/* A parallel vacuum must be requested */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
-	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
+	idx_will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
+	parallel_workers = parallel_vacuum_compute_workers(rel, indrels, nindexes,
 													   nrequested_workers,
-													   will_parallel_vacuum);
+													   &nworkers_for_table,
+													   idx_will_parallel_vacuum,
+													   state);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
-		pfree(will_parallel_vacuum);
+		pfree(idx_will_parallel_vacuum);
 		return NULL;
 	}
 
 	pvs = (ParallelVacuumState *) palloc0(sizeof(ParallelVacuumState));
 	pvs->indrels = indrels;
 	pvs->nindexes = nindexes;
-	pvs->will_parallel_vacuum = will_parallel_vacuum;
+	pvs->idx_will_parallel_vacuum = idx_will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
 
@@ -291,6 +322,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 parallel_workers);
 	Assert(pcxt->nworkers > 0);
 	pvs->pcxt = pcxt;
+	pvs->need_reinitialize_dsm = false;
+	pvs->nworkers_for_table = nworkers_for_table;
 
 	/* Estimate size for index vacuum stats -- PARALLEL_VACUUM_KEY_INDEX_STATS */
 	est_indstats_len = mul_size(sizeof(PVIndStats), nindexes);
@@ -327,6 +360,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, pvs->nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -345,7 +382,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 			   ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) == 0));
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
-		if (!will_parallel_vacuum[i])
+		if (!idx_will_parallel_vacuum[i])
 			continue;
 
 		if (indrel->rd_indam->amusemaintenanceworkmem)
@@ -419,6 +456,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Initialize AM-specific DSM space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, pvs->nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -456,7 +497,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
-	pfree(pvs->will_parallel_vacuum);
+	pfree(pvs->idx_will_parallel_vacuum);
 	pfree(pvs);
 }
 
@@ -533,26 +574,35 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.  Return the maximum number of parallel
+ * workers for table vacuuming and index vacuuming.
+ *
+ * nrequested is the number of parallel workers that user requested, which
+ * applies to both the number of workers for table vacuum and index vacuum.
+ * If nrequested is 0, we compute the parallel degree for them differently
+ * as described below.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel table vacuum, we ask AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to nworkers_table_p.
+ *
+ * For parallel index vacuum, the index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. This function sets
+ * idx_will_parallel_vacuum to remember indexes that participate in parallel vacuum.
  */
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table_p,
+								bool *idx_will_parallel_vacuum, void *state)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			nworkers_table = 0;
+	int			nworkers_index = 0;
+
+	*nworkers_table_p = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -561,6 +611,13 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;
 
+	/* Compute the number of workers for parallel table scan */
+	nworkers_table = table_parallel_vacuum_compute_workers(rel, nrequested,
+														   state);
+
+	/* Cap by max_parallel_maintenance_workers */
+	nworkers_table = Min(nworkers_table, max_parallel_maintenance_workers);
+
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
@@ -574,7 +631,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		will_parallel_vacuum[i] = true;
+		idx_will_parallel_vacuum[i] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -589,18 +646,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	/* The leader process takes one index */
 	nindexes_parallel--;
 
-	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table_p = nworkers_table;
+	return Max(nworkers_table, nworkers_index);
 }
 
 /*
@@ -657,7 +714,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL);
 		indstats->status = new_status;
 		indstats->parallel_workers_can_process =
-			(pvs->will_parallel_vacuum[i] &&
+			(pvs->idx_will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
 													num_index_scans,
 													vacuum));
@@ -669,8 +726,10 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
+		pvs->shared->work_phase = PV_WORK_PHASE_PROCESS_INDEXES;
+
 		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
+		if (pvs->need_reinitialize_dsm)
 			ReinitializeParallelDSM(pvs->pcxt);
 
 		/*
@@ -764,6 +823,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
 }
 
 /*
@@ -979,6 +1041,115 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Begin the parallel scan to collect dead items. Return the number of
+ * launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_scan_end() to finish the parallel
+ * table scan.
+ */
+int
+parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	pvs->shared->work_phase = PV_WORK_PHASE_COLLECT_DEAD_ITEMS;
+
+	if (pvs->need_reinitialize_dsm)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * The number of workers might vary between table vacuum and index
+	 * processing
+	 */
+	Assert(pvs->nworkers_for_table <= pvs->pcxt->nworkers);
+	ReinitializeParallelWorkers(pvs->pcxt, pvs->nworkers_for_table);
+	LaunchParallelWorkers(pvs->pcxt);
+
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+
+		/* Include the worker count for the leader itself */
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+	}
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel vacuum workers launched by
+ * parallel_vacuum_collect_dead_items_begin(), and gather workers' statistics.
+ */
+void
+parallel_vacuum_scan_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	/* Decrement the worker count for the leader itself */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
+}
+
+/*
+ * The function is for parallel workers to execute the parallel scan to
+ * collect dead tuples.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs, void *state)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do the parallel scan to collect dead tuples */
+	table_parallel_vacuum_collect_dead_items(pvs->heaprel, pvs, state);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -998,6 +1169,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
+	void	   *state;
 	ErrorContextCallback errcallback;
 
 	/*
@@ -1030,7 +1202,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	/*
 	 * Apply the desired value of maintenance_work_mem within this process.
@@ -1076,6 +1247,17 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
+	/* Initialize AM-specific vacuum state for parallel table vacuuming */
+	if (shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS)
+	{
+		ParallelWorkerContext pwcxt;
+
+		pwcxt.toc = toc;
+		pwcxt.seg = seg;
+		table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+												&state);
+	}
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = parallel_vacuum_error_callback;
 	errcallback.arg = &pvs;
@@ -1085,8 +1267,19 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	switch (pvs.shared->work_phase)
+	{
+		case PV_WORK_PHASE_COLLECT_DEAD_ITEMS:
+			/* Scan the table to collect dead items */
+			parallel_vacuum_process_table(&pvs, state);
+			break;
+		case PV_WORK_PHASE_PROCESS_INDEXES:
+			/* Process indexes to perform vacuum/cleanup */
+			parallel_vacuum_process_safe_indexes(&pvs);
+			break;
+		default:
+			elog(ERROR, "unrecognized parallel vacuum phase %d", pvs.shared->work_phase);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index bc37a80dc74..e8f75fc67b1 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -382,7 +382,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -394,6 +395,8 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_scan_end(ParallelVacuumState *pvs);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b66cecd8799..9f52ceba1c6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2012,6 +2012,7 @@ PVIndStats
 PVIndVacStatus
 PVOID
 PVShared
+PVWorkPhase
 PX_Alias
 PX_Cipher
 PX_Combo
-- 
2.43.5

v15-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchapplication/octet-stream; name=v15-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchDownload

From 3f0d63a8a6adcd565b338f3f8fb441837421150e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v15 1/4] Introduces table AM APIs for parallel table
 vacuuming.

This commit introduces the following new table AM APIs for parallel
heap vacuuming:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

There is no code using these new APIs for now. Upcoming parallel
vacuum patches utilize these APIs.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/heapam_handler.c |   4 +-
 src/backend/access/heap/vacuumlazy.c     |  12 ++
 src/backend/access/table/tableamapi.c    |  11 ++
 src/include/access/heapam.h              |   2 +
 src/include/access/tableam.h             | 140 +++++++++++++++++++++++
 5 files changed, 168 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 24d3765aa20..a534100692a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2710,7 +2710,9 @@ static const TableAmRoutine heapam_methods = {
 
 	.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
 	.scan_sample_next_block = heapam_scan_sample_next_block,
-	.scan_sample_next_tuple = heapam_scan_sample_next_tuple
+	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f28326bad09..8fd44ccf5dc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3756,6 +3756,18 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Return 0 to disable parallel vacuum.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									 void *state)
+{
+	return 0;
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 476663b66aa..c3ee9869e12 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,6 +81,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
+	Assert(routine->parallel_vacuum_compute_workers != NULL);
 	Assert(routine->scan_analyze_next_block != NULL);
 	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
@@ -94,6 +95,16 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->scan_sample_next_block != NULL);
 	Assert(routine->scan_sample_next_tuple != NULL);
 
+	/*
+	 * Callbacks for parallel vacuum are also optional (except for
+	 * parallel_vacuum_compute_workers). But one callback implies presence of
+	 * the others.
+	 */
+	Assert(((((routine->parallel_vacuum_estimate == NULL) ==
+			  (routine->parallel_vacuum_initialize == NULL)) ==
+			 (routine->parallel_vacuum_initialize_worker == NULL)) ==
+			(routine->parallel_vacuum_collect_dead_items == NULL)));
+
 	return routine;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..6a1ca5d5ca7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -409,6 +409,8 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+												 void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b8cb1e744ad..c61b1700953 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -35,6 +35,9 @@ extern PGDLLIMPORT bool synchronize_seqscans;
 
 struct BulkInsertStateData;
 struct IndexInfo;
+struct ParallelContext;
+struct ParallelVacuumState;
+struct ParallelWorkerContext;
 struct SampleScanState;
 struct VacuumParams;
 struct ValidateIndexState;
@@ -655,6 +658,81 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * parallel degree for parallel vacuum is further limited by
+	 * max_parallel_maintenance_workers. The function must return 0 to disable
+	 * parallel table vacuum.
+	 *
+	 * 'nworkers_requested' is a >=0 number and the requested number of
+	 * workers. This comes from the PARALLEL option. 0 means to choose the
+	 * parallel degree based on the table AM specific factors such as table
+	 * size.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel,
+													int nworkers_requested,
+													void *state);
+
+	/*
+	 * Estimate the size of shared memory needed for a parallel table vacuum
+	 * of this relation.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 struct ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   struct ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * Initialize AM-specific vacuum state for worker processes.
+	 *
+	 * The state_out is the output parameter so that arbitrary data can be
+	 * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize_worker) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  struct ParallelWorkerContext *pwcxt,
+													  void **state_out);
+
+	/*
+	 * Execute a parallel scan to collect dead items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_collect_dead_items) (Relation rel,
+													   struct ParallelVacuumState *pvs,
+													   void *state);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1680,6 +1758,68 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Compute the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									  void *state)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel,
+															nworkers_requested,
+															state);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, struct ParallelContext *pcxt,
+							   int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, struct ParallelContext *pcxt,
+								 int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize AM-specific vacuum state for worker processes.
+ */
+static inline void
+table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+										struct ParallelWorkerContext *pwcxt,
+										void **state_out)
+{
+	rel->rd_tableam->parallel_vacuum_initialize_worker(rel, pvs, pwcxt, state_out);
+}
+
+/*
+ * Execute a parallel vacuum scan to collect dead items.
+ */
+static inline void
+table_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+										 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_collect_dead_items(rel, pvs, state);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
-- 
2.43.5

#94

Masahiko Sawada

sawada.mshk@gmail.com

10 months ago

In reply to: Masahiko Sawada (#93)

4 attachment(s)

Re: Parallel heap vacuum

On Mon, Mar 31, 2025 at 2:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 27, 2025 at 11:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

You're right. I've studied the read stream code and figured out how to
use it. In the attached patch, we end the read stream at the end of
phase 1 and start a new read stream, as you suggested.

I've started looking at this patch set some more.

Thank you for reviewing the patch!

In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD
codepath and run out of unskippable blocks in the current chunk and
then go back to get another chunk (goto retry) but we are near the
memory limit so we can't get another block
(!dead_items_check_memory_limit()), could we get an infinite loop?

Or even incorrectly encroach on another worker's block? Asking that
because of this math
end_block = next_block +

vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1;

You're right. We should make sure that reset next_block is reset to
InvalidBlockNumber at the beginning of the retry loop.

if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0
and we are in the goto retry loop, it seems like we could keep
incrementing next_block even when we shouldn't be.

Right. Will fix.

I just want to make sure that the skip pages optimization works with
the parallel block assignment and the low memory read stream
wind-down.

I also think you do not need to expose
table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is
only called in heap-specific code and the logic seems very
heap-related. If table AMs want something to skip some blocks, they
could easily implement it.

Agreed. Will remove it.

On another topic, I think it would be good to have a comment above
this code in parallel_lazy_scan_gather_scan_results(), stating why we
are very sure it is correct.
Assert(TransactionIdIsValid(data->NewRelfrozenXid));
Assert(MultiXactIdIsValid(data->NewRelminMxid));

if (TransactionIdPrecedes(data->NewRelfrozenXid,
vacrel->scan_data->NewRelfrozenXid))
vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;

if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid,
vacrel->scan_data->NewRelminMxid))
vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;

if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
vacrel->scan_data->nonempty_pages = data->nonempty_pages;

vacrel->scan_data->skippedallvis |= data->skippedallvis;

Parallel relfrozenxid advancement sounds scary, and scary things are
best with comments. Even though the way this works is intuitive, I
think it is worth pointing out that this part is important to get
right so future programmers know how important it is.

One thing I was wondering about is if there are any implications of
different workers having different values in their GlobalVisState.
GlobalVisState can be updated during vacuum, so even if they start out
with the same values, that could diverge. It is probably okay since it
just controls what tuples are removable. Some workers may remove fewer
tuples than they absolutely could, and this is probably okay.

Good point.

And if it is okay for each worker to have different GlobalVisState
then maybe you shouldn't have a GlobalVisState in shared memory. If
you look at GlobalVisTestFor() it just returns the memory address of
that global variable in the backend. So, it seems like it might be
better to just let each parallel worker use their own backend local
GlobalVisState and not try to put it in shared memory and copy it from
one worker to the other workers when initializing them. I'm not sure.
At the very least, there should be a comment explaining why you've
done it the way you have done it.

Agreed. IIUC it's not a problem even if parallel workers use their own
GlobalVisState. I'll make that change and remove the 0004 patch which
exposes GlobalVisState.

I'll send the updated patch soon.

I've attached the updated patches. This version includes the comments
from Melanie, some bug fixes, and comment updates.

Rebased the patches as they conflicted with recent commits.

I've attached the new version patch. There are no major changes; I
fixed some typos, improved the comment, and removed duplicated codes.
Also, I've updated the commit messages.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v16-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchapplication/octet-stream; name=v16-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchDownload

From a278311ccfb5f924ccaa6d03610e136460be0b9e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v16 1/4] Introduces table AM APIs for parallel table
 vacuuming.

This commit introduces the following new table AM APIs for parallel
table vacuuming:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

While parallel_vacuum_compute_workers is required, other new callbacks
are optional.

There is no code using these new APIs for now. Upcoming parallel
vacuum patches utilize these APIs.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/backend/access/heap/heapam_handler.c |   4 +-
 src/backend/access/heap/vacuumlazy.c     |  12 ++
 src/backend/access/table/tableamapi.c    |  11 ++
 src/include/access/heapam.h              |   2 +
 src/include/access/tableam.h             | 140 +++++++++++++++++++++++
 5 files changed, 168 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 24d3765aa20..a534100692a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2710,7 +2710,9 @@ static const TableAmRoutine heapam_methods = {
 
 	.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
 	.scan_sample_next_block = heapam_scan_sample_next_block,
-	.scan_sample_next_tuple = heapam_scan_sample_next_tuple
+	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f28326bad09..8fd44ccf5dc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3756,6 +3756,18 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Return 0 to disable parallel vacuum.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									 void *state)
+{
+	return 0;
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 476663b66aa..c3ee9869e12 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,6 +81,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
+	Assert(routine->parallel_vacuum_compute_workers != NULL);
 	Assert(routine->scan_analyze_next_block != NULL);
 	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
@@ -94,6 +95,16 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->scan_sample_next_block != NULL);
 	Assert(routine->scan_sample_next_tuple != NULL);
 
+	/*
+	 * Callbacks for parallel vacuum are also optional (except for
+	 * parallel_vacuum_compute_workers). But one callback implies presence of
+	 * the others.
+	 */
+	Assert(((((routine->parallel_vacuum_estimate == NULL) ==
+			  (routine->parallel_vacuum_initialize == NULL)) ==
+			 (routine->parallel_vacuum_initialize_worker == NULL)) ==
+			(routine->parallel_vacuum_collect_dead_items == NULL)));
+
 	return routine;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..6a1ca5d5ca7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -409,6 +409,8 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+												 void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b8cb1e744ad..c61b1700953 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -35,6 +35,9 @@ extern PGDLLIMPORT bool synchronize_seqscans;
 
 struct BulkInsertStateData;
 struct IndexInfo;
+struct ParallelContext;
+struct ParallelVacuumState;
+struct ParallelWorkerContext;
 struct SampleScanState;
 struct VacuumParams;
 struct ValidateIndexState;
@@ -655,6 +658,81 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * parallel degree for parallel vacuum is further limited by
+	 * max_parallel_maintenance_workers. The function must return 0 to disable
+	 * parallel table vacuum.
+	 *
+	 * 'nworkers_requested' is a >=0 number and the requested number of
+	 * workers. This comes from the PARALLEL option. 0 means to choose the
+	 * parallel degree based on the table AM specific factors such as table
+	 * size.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel,
+													int nworkers_requested,
+													void *state);
+
+	/*
+	 * Estimate the size of shared memory needed for a parallel table vacuum
+	 * of this relation.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 struct ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   struct ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * Initialize AM-specific vacuum state for worker processes.
+	 *
+	 * The state_out is the output parameter so that arbitrary data can be
+	 * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_initialize_worker) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  struct ParallelWorkerContext *pwcxt,
+													  void **state_out);
+
+	/*
+	 * Execute a parallel scan to collect dead items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback, but either all other parallel vacuum callbacks need
+	 * to exist, or neither.
+	 */
+	void		(*parallel_vacuum_collect_dead_items) (Relation rel,
+													   struct ParallelVacuumState *pvs,
+													   void *state);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1680,6 +1758,68 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Compute the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									  void *state)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel,
+															nworkers_requested,
+															state);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, struct ParallelContext *pcxt,
+							   int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, struct ParallelContext *pcxt,
+								 int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize AM-specific vacuum state for worker processes.
+ */
+static inline void
+table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+										struct ParallelWorkerContext *pwcxt,
+										void **state_out)
+{
+	rel->rd_tableam->parallel_vacuum_initialize_worker(rel, pvs, pwcxt, state_out);
+}
+
+/*
+ * Execute a parallel vacuum scan to collect dead items.
+ */
+static inline void
+table_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+										 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_collect_dead_items(rel, pvs, state);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
-- 
2.43.5

v16-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchapplication/octet-stream; name=v16-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchDownload

From a025cc7eda6598eb676663d96cd32dab0d20e92e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 18 Feb 2025 17:45:36 -0800
Subject: [PATCH v16 2/4] vacuumparallel.c: Support parallel vacuuming for
 tables to collect dead items.

Previously, parallel vacuum was available only for index vacuuming and
index cleanup, ParallelVacuumState was initialized only when the table
has at least two indexes that are eligible for parallel index
vacuuming and cleanup.

This commit extends vacuumparallel.c to support parallel table
vacuuming. parallel_vacuum_init() now initializes ParallelVacuumState
to perform parallel heap scan to collect dead items, or paralel index
vacuuming/cleanup, or both. During the initialization, it asks the
table AM for the number of parallel workers required for parallel
table vacuuming. If >0, it enables parallel table vacuuming and calls
further table AM APIs such as parallel_vacuum_estimate.

For parallel table vacuuming, this commit introduces
parallel_vacuum_collect_dead_items_begin() function, which can be used
to collect dead items in the table (for example, the first pass over
heap table in lazy vacuum for heap tables).

Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c  |   2 +-
 src/backend/commands/vacuumparallel.c | 392 +++++++++++++++++++-------
 src/include/commands/vacuum.h         |   5 +-
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 292 insertions(+), 108 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8fd44ccf5dc..3b948970437 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3514,7 +3514,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..28997918f1c 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -4,17 +4,18 @@
  *	  Support routines for parallel vacuum execution.
  *
  * This file contains routines that are intended to support setting up, using,
- * and tearing down a ParallelVacuumState.
+ * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared
+ * information as well as the memory space for storing dead items allocated in
+ * the DSA area. We launch
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan, index bulk-deletion, index
+ * cleanup, or all of them with parallel worker processes depending on the
+ * number of parallel workers required for each phase. So different numbers of
+ * workers might be required for the table scanning and index processing.
+ * We launch parallel worker processes at the start of a phase, and once we
+ * complete all work in the phase, parallel workers exit. Each time we process
+ * table or indexes in parallel, the parallel context is re-initialized so that
+ * the same DSM can be used for multiple passes of each phase.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -26,8 +27,10 @@
  */
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -50,6 +53,13 @@
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		4
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
+/* The kind of parallel vacuum phases */
+typedef enum
+{
+	PV_WORK_PHASE_PROCESS_INDEXES,	/* index vacuuming or cleanup */
+	PV_WORK_PHASE_COLLECT_DEAD_ITEMS,	/* collect dead tuples */
+} PVWorkPhase;
+
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
  * segment.
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	uint64		queryid;
 
+	/*
+	 * Tell parallel workers what phase to perform: processing indexes or
+	 * collecting dead tuples from the table.
+	 */
+	PVWorkPhase work_phase;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +180,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Do we need to reinitialize parallel DSM? */
+	bool		need_reinitialize_dsm;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -178,7 +197,7 @@ struct ParallelVacuumState
 	 * Shared index statistics among parallel vacuum workers. The array
 	 * element is allocated for every index, even those indexes where parallel
 	 * index vacuuming is unsafe or not worthwhile (e.g.,
-	 * will_parallel_vacuum[] is false).  During parallel vacuum,
+	 * idx_will_parallel_vacuum[] is false).  During parallel vacuum,
 	 * IndexBulkDeleteResult of each index is kept in DSM and is copied into
 	 * local memory at the end of parallel vacuum.
 	 */
@@ -193,12 +212,18 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/*
+	 * The number of workers for parallel table vacuuming. If 0, the parallel
+	 * table vacuum is disabled.
+	 */
+	int			nworkers_for_table;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
 	 * min_parallel_index_scan_size cutoff.
 	 */
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
@@ -221,8 +246,10 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static int	parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											bool *idx_will_parallel_vacuum,
+											void *state);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -231,18 +258,25 @@ static void parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation
 											  PVIndStats *indstats);
 static bool parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 												   bool vacuum);
+static void parallel_vacuum_begin_work_phase(ParallelVacuumState *pvs, int nworkers,
+											 PVWorkPhase work_phase);
+static void parallel_vacuum_end_worke_phase(ParallelVacuumState *pvs);
 static void parallel_vacuum_error_callback(void *arg);
 
 /*
  * Try to enter parallel mode and create a parallel context.  Then initialize
  * shared memory state.
  *
+ * nrequested_workers is the requested parallel degree. 0 means that the parallel
+ * degrees for table and indexes vacuum are decided differently. See the comments
+ * of parallel_vacuum_compute_workers() for details.
+ *
  * On success, return parallel vacuum state.  Otherwise return NULL.
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -251,38 +285,38 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
 	int			querylen;
 
-	/*
-	 * A parallel vacuum must be requested and there must be indexes on the
-	 * relation
-	 */
+	/* A parallel vacuum must be requested */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
-	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
+	idx_will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
+	parallel_workers = parallel_vacuum_compute_workers(rel, indrels, nindexes,
 													   nrequested_workers,
-													   will_parallel_vacuum);
+													   &nworkers_for_table,
+													   idx_will_parallel_vacuum,
+													   state);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
-		pfree(will_parallel_vacuum);
+		pfree(idx_will_parallel_vacuum);
 		return NULL;
 	}
 
 	pvs = (ParallelVacuumState *) palloc0(sizeof(ParallelVacuumState));
 	pvs->indrels = indrels;
 	pvs->nindexes = nindexes;
-	pvs->will_parallel_vacuum = will_parallel_vacuum;
+	pvs->idx_will_parallel_vacuum = idx_will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
 
@@ -291,6 +325,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 parallel_workers);
 	Assert(pcxt->nworkers > 0);
 	pvs->pcxt = pcxt;
+	pvs->need_reinitialize_dsm = false;
+	pvs->nworkers_for_table = nworkers_for_table;
 
 	/* Estimate size for index vacuum stats -- PARALLEL_VACUUM_KEY_INDEX_STATS */
 	est_indstats_len = mul_size(sizeof(PVIndStats), nindexes);
@@ -327,6 +363,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, pvs->nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -345,7 +385,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 			   ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) == 0));
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
-		if (!will_parallel_vacuum[i])
+		if (!idx_will_parallel_vacuum[i])
 			continue;
 
 		if (indrel->rd_indam->amusemaintenanceworkmem)
@@ -419,6 +459,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Initialize AM-specific DSM space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, pvs->nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -456,7 +500,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
-	pfree(pvs->will_parallel_vacuum);
+	pfree(pvs->idx_will_parallel_vacuum);
 	pfree(pvs);
 }
 
@@ -533,26 +577,35 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.  Return the maximum number of parallel
+ * workers for table vacuuming and index vacuuming.
+ *
+ * nrequested is the number of parallel workers that user requested, which
+ * applies to both the number of workers for table vacuum and index vacuum.
+ * If nrequested is 0, we compute the parallel degree for them differently
+ * as described below.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel table vacuum, we ask AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to nworkers_table_p.
+ *
+ * For parallel index vacuum, the index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. This function sets
+ * idx_will_parallel_vacuum to remember indexes that participate in parallel vacuum.
  */
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table_p,
+								bool *idx_will_parallel_vacuum, void *state)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			nworkers_table = 0;
+	int			nworkers_index = 0;
+
+	*nworkers_table_p = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -561,6 +614,13 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;
 
+	/* Compute the number of workers for parallel table scan */
+	nworkers_table = table_parallel_vacuum_compute_workers(rel, nrequested,
+														   state);
+
+	/* Cap by max_parallel_maintenance_workers */
+	nworkers_table = Min(nworkers_table, max_parallel_maintenance_workers);
+
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
@@ -574,7 +634,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		will_parallel_vacuum[i] = true;
+		idx_will_parallel_vacuum[i] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -589,18 +649,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	/* The leader process takes one index */
 	nindexes_parallel--;
 
-	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table_p = nworkers_table;
+	return Max(nworkers_table, nworkers_index);
 }
 
 /*
@@ -657,7 +717,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL);
 		indstats->status = new_status;
 		indstats->parallel_workers_can_process =
-			(pvs->will_parallel_vacuum[i] &&
+			(pvs->idx_will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
 													num_index_scans,
 													vacuum));
@@ -669,40 +729,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
-		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
-			ReinitializeParallelDSM(pvs->pcxt);
-
-		/*
-		 * Set up shared cost balance and the number of active workers for
-		 * vacuum delay.  We need to do this before launching workers as
-		 * otherwise, they might not see the updated values for these
-		 * parameters.
-		 */
-		pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
-		pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
-
-		/*
-		 * The number of workers can vary between bulkdelete and cleanup
-		 * phase.
-		 */
-		ReinitializeParallelWorkers(pvs->pcxt, nworkers);
-
-		LaunchParallelWorkers(pvs->pcxt);
-
-		if (pvs->pcxt->nworkers_launched > 0)
-		{
-			/*
-			 * Reset the local cost values for leader backend as we have
-			 * already accumulated the remaining balance of heap.
-			 */
-			VacuumCostBalance = 0;
-			VacuumCostBalanceLocal = 0;
-
-			/* Enable shared cost balance for leader backend */
-			VacuumSharedCostBalance = &(pvs->shared->cost_balance);
-			VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
-		}
+		/* Start parallel vacuum workers for processing indexes */
+		parallel_vacuum_begin_work_phase(pvs, nworkers,
+										 PV_WORK_PHASE_PROCESS_INDEXES);
 
 		if (vacuum)
 			ereport(pvs->shared->elevel,
@@ -732,13 +761,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	 * to finish, or we might get incomplete data.)
 	 */
 	if (nworkers > 0)
-	{
-		/* Wait for all vacuum workers to finish */
-		WaitForParallelWorkersToFinish(pvs->pcxt);
-
-		for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
-			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
-	}
+		parallel_vacuum_end_worke_phase(pvs);
 
 	/*
 	 * Reset all index status back to initial (while checking that we have
@@ -755,15 +778,8 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		indstats->status = PARALLEL_INDVAC_STATUS_INITIAL;
 	}
 
-	/*
-	 * Carry the shared balance value to heap scan and disable shared costing
-	 */
-	if (VacuumSharedCostBalance)
-	{
-		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
-		VacuumSharedCostBalance = NULL;
-		VacuumActiveNWorkers = NULL;
-	}
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
 }
 
 /*
@@ -979,6 +995,77 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Begin the parallel scan to collect dead items. Return the number of
+ * launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_collect_dead_items_end() to finish
+ * the parallel scan.
+ */
+int
+parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	/* Start parallel vacuum workers for collecting dead items */
+	Assert(pvs->nworkers_for_table <= pvs->pcxt->nworkers);
+	parallel_vacuum_begin_work_phase(pvs, pvs->nworkers_for_table,
+									 PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Include the worker count for the leader itself */
+	if (pvs->pcxt->nworkers_launched > 0)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel vacuum workers launched by
+ * parallel_vacuum_collect_dead_items_begin(), and gather workers' statistics.
+ */
+void
+parallel_vacuum_collect_dead_items_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	/* Wait for parallel workers to finish */
+	parallel_vacuum_end_worke_phase(pvs);
+
+	/* Decrement the worker count for the leader itself */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * The function is for parallel workers to execute the parallel scan to
+ * collect dead tuples.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs, void *state)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do the parallel scan to collect dead tuples */
+	table_parallel_vacuum_collect_dead_items(pvs->heaprel, pvs, state);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -998,6 +1085,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
+	void	   *state;
 	ErrorContextCallback errcallback;
 
 	/*
@@ -1030,7 +1118,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	/*
 	 * Apply the desired value of maintenance_work_mem within this process.
@@ -1076,6 +1163,17 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
+	/* Initialize AM-specific vacuum state for parallel table vacuuming */
+	if (shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS)
+	{
+		ParallelWorkerContext pwcxt;
+
+		pwcxt.toc = toc;
+		pwcxt.seg = seg;
+		table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+												&state);
+	}
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = parallel_vacuum_error_callback;
 	errcallback.arg = &pvs;
@@ -1085,8 +1183,19 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	switch (pvs.shared->work_phase)
+	{
+		case PV_WORK_PHASE_COLLECT_DEAD_ITEMS:
+			/* Scan the table to collect dead items */
+			parallel_vacuum_process_table(&pvs, state);
+			break;
+		case PV_WORK_PHASE_PROCESS_INDEXES:
+			/* Process indexes to perform vacuum/cleanup */
+			parallel_vacuum_process_safe_indexes(&pvs);
+			break;
+		default:
+			elog(ERROR, "unrecognized parallel vacuum phase %d", pvs.shared->work_phase);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
@@ -1109,6 +1218,77 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	FreeAccessStrategy(pvs.bstrategy);
 }
 
+/*
+ * Launch parallel vacuum workers for the given phase. If at least one
+ * worker launched, enable the shared vacuum delay costing.
+ */
+static void
+parallel_vacuum_begin_work_phase(ParallelVacuumState *pvs, int nworkers,
+								 PVWorkPhase work_phase)
+{
+	/* Set the work phase */
+	pvs->shared->work_phase = work_phase;
+
+	/* Reinitialize parallel context to relaunch parallel workers */
+	if (pvs->need_reinitialize_dsm)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * Set up shared cost balance and the number of active workers for vacuum
+	 * delay.  We need to do this before launching workers as otherwise, they
+	 * might not see the updated values for these parameters.
+	 */
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	/*
+	 * The number of workers can vary between bulkdelete and cleanup phase.
+	 */
+	ReinitializeParallelWorkers(pvs->pcxt, nworkers);
+
+	LaunchParallelWorkers(pvs->pcxt);
+
+	/* Enable shared vacuum costing if we are able to launch any worker */
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+	}
+}
+
+/*
+ * Wait for parallel vacuum workers to finish, accumulate the statistics,
+ * and disable shared vacuum delay costing if enabled.
+ */
+static void
+parallel_vacuum_end_worke_phase(ParallelVacuumState *pvs)
+{
+	/* Wait for all vacuum workers to finish */
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/* Carry the shared balance value and disable shared costing */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
+}
+
 /*
  * Error context callback for errors occurring during parallel index vacuum.
  * The error context messages should match the messages set in the lazy vacuum
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index bc37a80dc74..e785a4a583f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -382,7 +382,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -394,6 +395,8 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_collect_dead_items_end(ParallelVacuumState *pvs);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b66cecd8799..9f52ceba1c6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2012,6 +2012,7 @@ PVIndStats
 PVIndVacStatus
 PVOID
 PVShared
+PVWorkPhase
 PX_Alias
 PX_Cipher
 PX_Combo
-- 
2.43.5

v16-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchapplication/octet-stream; name=v16-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchDownload

From a3b29600c22b996f81761ca9b743905ebaddafbb Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 26 Feb 2025 11:31:55 -0800
Subject: [PATCH v16 3/4] Move lazy heap scan related variables to new struct
 LVScanData.

This is a pure refactoring for upcoming parallel heap scan, which
requires storing relation statistics and relation data such as extant
oldest XID/MXID collected during lazy heap scan to a shared memory
area.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 343 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 181 insertions(+), 163 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3b948970437..aebc7c91379 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -256,6 +256,56 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * Data and counters updated during lazy heap scan.
+ */
+typedef struct LVScanData
+{
+	BlockNumber rel_pages;		/* total number of pages */
+
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+
+	/*
+	 * Count of all-visible blocks eagerly scanned (for logging only). This
+	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
+	 */
+	BlockNumber eager_scanned_pages;
+
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVScanData;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -282,10 +332,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -310,35 +356,8 @@ typedef struct LVRelState
 	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
 	VacDeadItemsInfo *dead_items_info;
 
-	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-
-	/*
-	 * Count of all-visible blocks eagerly scanned (for logging only). This
-	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
-	 */
-	BlockNumber eager_scanned_pages;
-
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
-
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
-
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	/* Data and counters updated during lazy heap scan */
+	LVScanData *scan_data;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -348,13 +367,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -524,7 +536,7 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	 * the first region, making the second region the first to be eager
 	 * scanned normally.
 	 */
-	if (vacrel->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
+	if (vacrel->scan_data->rel_pages < 2 * EAGER_SCAN_REGION_SIZE)
 		return;
 
 	/*
@@ -616,6 +628,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVScanData *scan_data;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -730,14 +743,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->eager_scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	scan_data = palloc(sizeof(LVScanData));
+	scan_data->scanned_pages = 0;
+	scan_data->eager_scanned_pages = 0;
+	scan_data->removed_pages = 0;
+	scan_data->new_frozen_tuple_pages = 0;
+	scan_data->lpdead_item_pages = 0;
+	scan_data->missed_dead_pages = 0;
+	scan_data->nonempty_pages = 0;
+	scan_data->tuples_deleted = 0;
+	scan_data->tuples_frozen = 0;
+	scan_data->lpdead_items = 0;
+	scan_data->live_tuples = 0;
+	scan_data->recently_dead_tuples = 0;
+	scan_data->missed_dead_tuples = 0;
+	scan_data->vm_new_visible_pages = 0;
+	scan_data->vm_new_visible_frozen_pages = 0;
+	scan_data->vm_new_frozen_pages = 0;
+	scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+	vacrel->scan_data = scan_data;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -747,17 +771,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
-
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
-	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -778,15 +793,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_data->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_data->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 
 	/*
 	 * Initialize state related to tracking all-visible page skipping. This is
 	 * very important to determine whether or not it is safe to advance the
 	 * relfrozenxid/relminmxid.
 	 */
-	vacrel->skippedallvis = false;
+	vacrel->scan_data->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -874,15 +889,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_data->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_data->NewRelfrozenXid));
+	Assert(vacrel->scan_data->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_data->NewRelminMxid));
+	if (vacrel->scan_data->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -890,15 +905,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_data->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_data->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * pg_class.relpages to
 	 */
-	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
+	new_rel_pages = vacrel->scan_data->rel_pages;	/* After possible rel
+													 * truncation */
 	visibilitymap_count(rel, &new_rel_allvisible, &new_rel_allfrozen);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
@@ -921,7 +937,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, new_rel_allfrozen,
 						vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_data->NewRelfrozenXid,
+						vacrel->scan_data->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -937,8 +954,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples,
+						 vacrel->scan_data->recently_dead_tuples +
+						 vacrel->scan_data->missed_dead_tuples,
 						 starttime);
 	pgstat_progress_end_command();
 
@@ -1012,23 +1029,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u eagerly scanned\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_data->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_data->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages /
+							 100.0 * vacrel->scan_data->scanned_pages /
 							 orig_rel_pages,
-							 vacrel->eager_scanned_pages);
+							 vacrel->scan_data->eager_scanned_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %" PRId64 " removed, %" PRId64 " remain, %" PRId64 " are dead but not yet removable\n"),
-							 vacrel->tuples_deleted,
+							 vacrel->scan_data->tuples_deleted,
 							 (int64) vacrel->new_rel_tuples,
-							 vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 vacrel->scan_data->recently_dead_tuples);
+			if (vacrel->scan_data->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %" PRId64 " dead from %u pages not removed due to cleanup lock contention\n"),
-								 vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 vacrel->scan_data->missed_dead_tuples,
+								 vacrel->scan_data->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -1036,33 +1053,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_data->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_data->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_data->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_data->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %" PRId64 " tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_data->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_data->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 vacrel->tuples_frozen);
+							 vacrel->scan_data->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_data->vm_new_visible_pages,
+							 vacrel->scan_data->vm_new_visible_frozen_pages +
+							 vacrel->scan_data->vm_new_frozen_pages,
+							 vacrel->scan_data->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -1082,10 +1099,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %" PRId64 " dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_data->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 vacrel->lpdead_items);
+							 100.0 * vacrel->scan_data->lpdead_item_pages / orig_rel_pages,
+							 vacrel->scan_data->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -1199,7 +1216,7 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->rel_pages,
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
 				blkno = 0,
 				next_fsm_block_to_vacuum = 0;
 	BlockNumber orig_eager_scan_success_limit =
@@ -1260,8 +1277,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages > 0 &&
-			vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_data->scanned_pages > 0 &&
+			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1316,9 +1333,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		blkno = BufferGetBlockNumber(buf);
 
-		vacrel->scanned_pages++;
+		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
-			vacrel->eager_scanned_pages++;
+			vacrel->scan_data->eager_scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -1500,16 +1517,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
 
 	read_stream_end(stream);
 
@@ -1556,7 +1573,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * callback_private_data contains a reference to the LVRelState, passed to the
  * read stream API during stream setup. The LVRelState is an in/out parameter
  * here (locally named `vacrel`). Vacuum options and information about the
- * relation are read from it. vacrel->skippedallvis is set if we skip a block
+ * relation are read from it. vacrel->scan_data->skippedallvis is set if we skip a block
  * that's all-visible but not all-frozen (to ensure that we don't update
  * relfrozenxid in that case). vacrel also holds information about the next
  * unskippable block -- as bookkeeping for this function.
@@ -1574,7 +1591,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 	next_block = vacrel->current_block + 1;
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->rel_pages)
+	if (next_block >= vacrel->scan_data->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1618,7 +1635,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_data->skippedallvis = true;
 		}
 	}
 
@@ -1669,7 +1686,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 static void
 find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
@@ -1900,11 +1917,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_data->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1979,10 +1996,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_data->NewRelfrozenXid, &vacrel->scan_data->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_data->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_data->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1992,7 +2009,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_data->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -2027,7 +2044,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -2042,15 +2059,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_data->tuples_deleted += presult.ndeleted;
+	vacrel->scan_data->tuples_frozen += presult.nfrozen;
+	vacrel->scan_data->lpdead_items += presult.lpdead_items;
+	vacrel->scan_data->live_tuples += presult.live_tuples;
+	vacrel->scan_data->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2099,17 +2116,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (presult.all_frozen)
 			{
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 				*vm_page_frozen = true;
 			}
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2197,8 +2214,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 
@@ -2208,7 +2225,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		else
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2249,8 +2266,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_data->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2377,8 +2394,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_data->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -2405,25 +2422,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_data->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_data->live_tuples += live_tuples;
+	vacrel->scan_data->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_data->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_data->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -2452,7 +2469,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_data->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -2481,12 +2498,12 @@ lazy_vacuum(LVRelState *vacrel)
 	 * HOT through careful tuning.
 	 */
 	bypass = false;
-	if (vacrel->consider_bypass_optimization && vacrel->rel_pages > 0)
+	if (vacrel->consider_bypass_optimization && vacrel->scan_data->rel_pages > 0)
 	{
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_data->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2512,8 +2529,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * be negligible.  If this optimization is ever expanded to cover more
 		 * cases then this may need to be reconsidered.
 		 */
-		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		threshold = (double) vacrel->scan_data->rel_pages * BYPASS_THRESHOLD_PAGES;
+		bypass = (vacrel->scan_data->lpdead_item_pages < threshold &&
 				  TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);
 	}
 
@@ -2651,7 +2668,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2813,8 +2830,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
+			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %" PRId64 " dead item identifiers in %u pages",
@@ -2930,14 +2947,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -3013,7 +3030,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_data->scanned_pages < vacrel->scan_data->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -3194,10 +3211,10 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->scan_data->rel_pages - vacrel->scan_data->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
-		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
+		 possibly_freeable >= vacrel->scan_data->rel_pages / REL_TRUNCATE_FRACTION))
 		return true;
 
 	return false;
@@ -3209,7 +3226,7 @@ should_attempt_truncation(LVRelState *vacrel)
 static void
 lazy_truncate_heap(LVRelState *vacrel)
 {
-	BlockNumber orig_rel_pages = vacrel->rel_pages;
+	BlockNumber orig_rel_pages = vacrel->scan_data->rel_pages;
 	BlockNumber new_rel_pages;
 	bool		lock_waiter_detected;
 	int			lock_retry;
@@ -3220,7 +3237,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_data->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -3321,15 +3338,15 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
-		vacrel->rel_pages = new_rel_pages;
+		vacrel->scan_data->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_data->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
 				(errmsg("table \"%s\": truncated %u to %u pages",
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_data->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -3353,11 +3370,11 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * unsigned.)  To make the scan faster, we prefetch a few blocks at a time
 	 * in forward direction, so that OS-level readahead can kick in.
 	 */
-	blkno = vacrel->rel_pages;
+	blkno = vacrel->scan_data->rel_pages;
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_data->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -3469,7 +3486,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_data->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9f52ceba1c6..4833bc70710 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1508,6 +1508,7 @@ LSEG
 LUID
 LVRelState
 LVSavedErrInfo
+LVScanData
 LWLock
 LWLockHandle
 LWLockMode
-- 
2.43.5

v16-0004-Support-parallelism-for-collecting-dead-items-du.patchapplication/octet-stream; name=v16-0004-Support-parallelism-for-collecting-dead-items-du.patchDownload

From 754bba7af31e5bae9f67f879e42fd31dbc32db2c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Feb 2025 13:41:35 -0800
Subject: [PATCH v16 4/4] Support parallelism for collecting dead items during
 lazy vacuum.

This feature allows the vacuum to leverage multiple CPUs in order to
collect dead items (i.e. the first pass over heap table) with parallel
workers. The parallel degree for parallel heap vacuuming is determined
based on the number of blocks to vacuum unless PARALLEL option of
VACUUM command is specified, and further limited by
max_parallel_maintenance_workers.

For the parallel heap scan to collect dead items, we utilize a
parallel block table scan, controlled by ParallelBlockTableScanDesc,
in conjunction with the read stream. The workers' parallel scan
descriptions are stored in the DSM space, enabling different parallel
workers to resume the heap scan (phase 1) after a cycle of heap
vacuuming and index vacuuming (phase 2 and 3) from their previous
state. However, due to the potential presence of pinned buffers loaded
by the read stream's look-ahead mechanism, we cannot abruptly stop
phase 1 even when the space of dead_items TIDs exceeds the
limit. Therefore, once the space of dead_items TIDs exceeds the limit,
we begin processing pages without attempting to retrieve additional
blocks by look-ahead mechanism until the read stream is exhausted,
even if the the memory limit is surpassed. While this approach may
increase the memory usage, it typically doesn't pose a significant
problem, as processing a few 10s-100s buffers doesn't substantially
increase the size of dead_items TIDs.

When the parallel heap scan to collect dead items is enabled, we
disable eager scanning. This is because parallel vacuum is available
only in the VACUUM command and would not occur frequently, which
doesn't align with the purpose of eager scanning.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 doc/src/sgml/ref/vacuum.sgml             |  54 +-
 src/backend/access/heap/heapam_handler.c |   4 +
 src/backend/access/heap/vacuumlazy.c     | 992 ++++++++++++++++++++---
 src/backend/commands/vacuumparallel.c    |  29 +
 src/include/access/heapam.h              |  11 +
 src/include/commands/vacuum.h            |   3 +
 src/test/regress/expected/vacuum.out     |   6 +
 src/test/regress/sql/vacuum.sql          |   7 +
 src/tools/pgindent/typedefs.list         |   4 +
 9 files changed, 989 insertions(+), 121 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..294494877d9 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -280,25 +280,41 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
      <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
+      Perform scanning heap, index vacuum, and index cleanup phases of
+      <command>VACUUM</command> in parallel using
+      <replaceable class="parameter">integer</replaceable> background workers
+      (for the details of each vacuum phase, please refer to
+      <xref linkend="vacuum-phases"/>).
+     </para>
+     <para>
+      For heap tables, the number of workers used to perform the scanning
+      heap is determined based on the size of table. A table can participate in
+      parallel scanning heap if and only if the size of the table is more than
+      <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+      the heap table's blocks will be divided into ranges and shared among the
+      cooperating processes. Each worker process will complete the scanning of
+      its given range of blocks before requesting an additional range of blocks.
+     </para>
+     <para>
+      The number of workers used to perform parallel index vacuum and index
+      cleanup is equal to the number of indexes on the relation that support
+      parallel vacuum. An index can participate in parallel vacuum if and only
+      if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+      Only one worker can be used per index. So parallel workers for index vacuum
+      and index cleanup are launched only when there are at least <literal>2</literal>
+      indexes in the table.
+     </para>
+     <para>
+      Workers for vacuum are launched before the start of each phase and exit
+      at the end of the phase. The number of workers for each phase is limited by
+      the number of workers specified with <literal>PARALLEL</literal> option if
+      any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+      Please note that in any parallel vacuum phase, it is not guaanteed that the
+      number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+      will be used during execution. It is possible for a vacuum to run with fewer
+      workers than specified, or even with no workers at all. These behaviors might
+      change in a future release. This option can't be used with the <literal>FULL</literal>
+      option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a534100692a..9de9f4637b2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2713,6 +2713,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
 
 	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_initialize_worker = heap_parallel_vacuum_initialize_worker,
+	.parallel_vacuum_collect_dead_items = heap_parallel_vacuum_collect_dead_items,
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index aebc7c91379..88e13eea0fc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -99,6 +99,46 @@
  * After pruning and freezing, pages that are newly all-visible and all-frozen
  * are marked as such in the visibility map.
  *
+ * Parallel Vacuum:
+ *
+ * Lazy vacuum on heap tables supports parallel processing for phase I and
+ * phase II. Before starting phase I, we initialize parallel vacuum state,
+ * ParallelVacuumState, and allocate the TID store in a DSA area if we can
+ * use parallel mode for any of these two phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size and number of indexes. Parallel
+ * workers are launched at the beginning of each phase and exit at the end of
+ * each phase.
+ *
+ * For the parallel lazy heap scan (i.e. parallel phase I), we employ a parallel
+ * block table scan, controlled by ParallelBlockTableScanDesc, in conjunction
+ * with the read stream. The table is split into multiple chunks, which are
+ * then distributed among parallel workers.
+ *
+ * While vacuum cutoffs are shared between leader and worker processes, each
+ * individual process uses its own GlobalVisState, potentially causing some
+ * workers to remove fewer tuples than optimal. During parallel lazy heap scans,
+ * each worker tracks the oldest existing XID and MXID. The leader computes the
+ * globally oldest existing XID and MXID after the parallel scan, while
+ * gathering table data too.
+ *
+ * The workers' parallel scan descriptions, ParallelBlockTableScanWorkerData,
+ * are stored in the DSM space, enabling different parallel workers to resume
+ * phase I from their previous state. However, due to the potential presence
+ * of pinned buffers loaded by the read stream's look-ahead mechanism, we
+ * cannot abruptly stop phase I even when the space of dead_items TIDs exceeds
+ * the limit. Instead, once this threshold is surpassed, we begin processing
+ * pages without attempting to retrieve additional blocks until the read
+ * stream is exhausted. While this approach may increase the memory usage, it
+ * typically doesn't pose a significant problem, as processing a few 10s-100s
+ * buffers doesn't substantially increase the size of dead_items TIDs.
+ *
+ * If the leader launches fewer workers than the previous time to resume the
+ * parallel lazy heap scan, some block within chunks may remain un-scanned.
+ * To address this, the leader completes workers' unfinished scans at the end
+ * of the parallel lazy heap scan (see complete_unfinished_lazy_scan_heap()).
+ *
  * Dead TID Storage:
  *
  * The major space usage for vacuuming is storage for the dead tuple IDs that
@@ -147,6 +187,7 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"	/* for min_parallel_table_scan_size */
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -214,11 +255,21 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * DSM keys for parallel lazy vacuum. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define PARALLEL_LV_KEY_SHARED				0xFFFF0001
+#define PARALLEL_LV_KEY_SCANDESC			0xFFFF0002
+#define PARALLEL_LV_KEY_SCANWORKER			0xFFFF0003
+
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
  * parallel mode and the DSM segment is initialized.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->plvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -306,6 +357,80 @@ typedef struct LVScanData
 	bool		skippedallvis;
 } LVScanData;
 
+/*
+ * Struct for information that needs to be shared among parallel workers
+ * for parallel lazy vacuum. All fields are static, set by the leader
+ * process.
+ */
+typedef struct ParallelLVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+} ParallelLVShared;
+
+/*
+ * Per-worker data for scan description, statistics counters, and
+ * miscellaneous data need to be shared with the leader.
+ */
+typedef struct ParallelLVScanWorker
+{
+	/* Both last_blkno and pbscanworkdata are initialized? */
+	bool		scan_inited;
+
+	/* The last processed block number */
+	pg_atomic_uint32 last_blkno;
+
+	/* per-worker parallel table scan state */
+	ParallelBlockTableScanWorkerData pbscanworkdata;
+
+	/* per-worker scan data and counters */
+	LVScanData	scandata;
+} ParallelLVScanWorker;
+
+/*
+ * Struct to store parallel lazy vacuum working state.
+ */
+typedef struct ParallelLVState
+{
+	/* Parallel scan description shared among parallel workers */
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Per-worker parallel table scan state */
+	ParallelBlockTableScanWorker pbscanwork;
+
+	/* Shared static information */
+	ParallelLVShared *shared;
+
+	/* Per-worker scan data. NULL for the leader process */
+	ParallelLVScanWorker *scanworker;
+} ParallelLVState;
+
+/*
+ * Struct for the leader process in parallel lazy vacuum.
+ */
+typedef struct ParallelLVLeader
+{
+	/* Shared memory size for each shared object */
+	Size		pbscan_len;
+	Size		shared_len;
+	Size		scanworker_len;
+
+	/* The number of workers launched for parallel lazy heap scan */
+	int			nworkers_launched;
+
+	/*
+	 * Points to the array of all per-worker scan states stored on DSM area.
+	 */
+	ParallelLVScanWorker *scanworkers;
+} ParallelLVLeader;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -368,6 +493,12 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 
+	/* Last processed block number */
+	BlockNumber last_blkno;
+
+	/* Next block to check for FSM vacuum */
+	BlockNumber next_fsm_block_to_vacuum;
+
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
 	BlockNumber next_unskippable_block; /* next unskippable block */
@@ -375,6 +506,16 @@ typedef struct LVRelState
 	bool		next_unskippable_eager_scanned; /* if it was eagerly scanned */
 	Buffer		next_unskippable_vmbuffer;	/* buffer containing its VM bit */
 
+	/* Fields used for parallel lazy vacuum */
+
+	/* Parallel lazy vacuum working state */
+	ParallelLVState *plvstate;
+
+	/*
+	 * The leader state for parallel lazy vacuum. NULL for parallel workers.
+	 */
+	ParallelLVLeader *leader;
+
 	/* State related to managing eager scanning of all-visible pages */
 
 	/*
@@ -434,12 +575,14 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void do_lazy_scan_heap(LVRelState *vacrel, bool do_vacuum);
 static void heap_vacuum_eager_scan_setup(LVRelState *vacrel,
 										 VacuumParams *params);
 static BlockNumber heap_vac_scan_next_block(ReadStream *stream,
 											void *callback_private_data,
 											void *per_buffer_data);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static bool find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -450,6 +593,12 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
 							  bool *has_lpdead_items);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static BlockNumber parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel);
+static void complete_unfinished_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_begin(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
+static void parallel_lazy_scan_gather_scan_results(LVRelState *vacrel);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -474,6 +623,7 @@ static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
 static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 						   int num_offsets);
+static bool dead_items_check_memory_limit(LVRelState *vacrel);
 static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
@@ -529,6 +679,22 @@ heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
 	if (vacrel->aggressive)
 		return;
 
+	/*
+	 * Disable eager scanning if parallel lazy vacuum is enabled.
+	 *
+	 * One might think that it would make sense to use the eager scanning even
+	 * during parallel lazy vacuum, but parallel vacuum is available only in
+	 * VACUUM command and would not be something that happens frequently,
+	 * which seems not fit to the purpose of the eager scanning. Also, it
+	 * would require making the code complex. So it would make sense to
+	 * disable it for now.
+	 *
+	 * XXX: this limitation might need to be eliminated in the future for
+	 * example when we use parallel vacuum also in autovacuum.
+	 */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		return;
+
 	/*
 	 * Aggressively vacuuming a small relation shouldn't take long, so it
 	 * isn't worth amortizing. We use two times the region size as the size
@@ -771,6 +937,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
+	vacrel->next_fsm_block_to_vacuum = 0;
 
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
@@ -815,13 +982,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
-	/*
-	 * Set up eager scan tracking state. This must happen after determining
-	 * whether or not the vacuum must be aggressive, because only normal
-	 * vacuums use the eager scan algorithm.
-	 */
-	heap_vacuum_eager_scan_setup(vacrel, params);
-
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -846,6 +1006,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
 
+	/*
+	 * Set up eager scan tracking state. This must happen after determining
+	 * whether or not the vacuum must be aggressive, because only normal
+	 * vacuums use the eager scan algorithm.
+	 */
+	heap_vacuum_eager_scan_setup(vacrel, params);
+
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
@@ -1215,13 +1382,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->scan_data->rel_pages,
-				blkno = 0,
-				next_fsm_block_to_vacuum = 0;
-	BlockNumber orig_eager_scan_success_limit =
-		vacrel->eager_scan_remaining_successes; /* for logging */
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1242,6 +1403,73 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/* Do the actual work */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel, true);
+
+	/*
+	 * Report that everything is now scanned. We never skip scanning the last
+	 * block in the relation, so we can pass rel_pages here.
+	 */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
+								 rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * We can pass rel_pages here because we never skip scanning the last
+	 * block of the relation.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * If do_vacuum is true, we stop the lazy heap scan and invoke a cycle of index
+ * vacuuming and table vacuuming if the space of dead_items TIDs exceeds the limit, and
+ * then resume it. On the other hand, if it's false, we continue scanning until the
+ * read stream is exhausted.
+ */
+static void
+do_lazy_scan_heap(LVRelState *vacrel, bool do_vacuum)
+{
+	ReadStream *stream;
+	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber orig_eager_scan_success_limit =
+		vacrel->eager_scan_remaining_successes; /* for logging */
+	Buffer		vmbuffer = InvalidBuffer;
+
 	/*
 	 * Set up the read stream for vacuum's first pass through the heap.
 	 *
@@ -1276,8 +1504,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * that point.  This check also provides failsafe coverage for the
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
+		 *
+		 * The failsafe check is done only by the leader process.
 		 */
-		if (vacrel->scan_data->scanned_pages > 0 &&
+		if (!IsParallelWorker() &&
+			vacrel->scan_data->scanned_pages > 0 &&
 			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
@@ -1285,12 +1516,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Consider if we definitely have enough space to process TIDs on page
 		 * already.  If we are close to overrunning the available space for
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page. However, let's force at least one page-worth of tuples
-		 * to be stored as to ensure we do at least some work when the memory
-		 * configured is so low that we run out before storing anything.
+		 * this page.
 		 */
-		if (vacrel->dead_items_info->num_items > 0 &&
-			TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
+		if (do_vacuum && dead_items_check_memory_limit(vacrel))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1313,15 +1541,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * upper-level FSM pages. Note that blkno is the previously
 			 * processed block.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 									blkno + 1);
-			next_fsm_block_to_vacuum = blkno;
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 		}
 
+		/* Read the next block to process */
 		buf = read_stream_next_buffer(stream, &per_buffer_data);
 
 		/* The relation is exhausted. */
@@ -1331,7 +1560,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		blk_info = *((uint8 *) per_buffer_data);
 		CheckBufferIsPinnedOnce(buf);
 		page = BufferGetPage(buf);
-		blkno = BufferGetBlockNumber(buf);
+		blkno = vacrel->last_blkno = BufferGetBlockNumber(buf);
 
 		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
@@ -1491,13 +1720,36 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * visible on upper FSM pages. This is done after vacuuming if the
 			 * table has indexes. There will only be newly-freed space if we
 			 * held the cleanup lock and lazy_scan_prune() was called.
+			 *
+			 * During parallel lazy heap scanning, only the leader process
+			 * vacuums the FSM. However, we cannot vacuum the FSM for blocks
+			 * up to 'blk' because there may be un-scanned blocks or blocks
+			 * being processed by workers before this point. Instead, parallel
+			 * workers advertise the block numbers they have just processed,
+			 * and the leader vacuums the FSM up to the smallest block number
+			 * among them. This approach ensures we vacuum the FSM for
+			 * consecutive processed blocks.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				if (IsParallelWorker())
+				{
+					pg_atomic_write_u32(&(vacrel->plvstate->scanworker->last_blkno),
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				}
+				else
+				{
+					BlockNumber fsmvac_upto = blkno;
+
+					if (ParallelHeapVacuumIsActive(vacrel))
+						fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+					FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+											fsmvac_upto);
+				}
+
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
@@ -1508,50 +1760,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/*
-	 * Report that everything is now scanned. We never skip scanning the last
-	 * block in the relation, so we can pass rel_pages here.
-	 */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
-								 rel_pages);
-
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_data->scanned_pages,
-													 vacrel->scan_data->live_tuples);
-
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
-		vacrel->scan_data->missed_dead_tuples;
-
 	read_stream_end(stream);
-
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
-
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 * We can pass rel_pages here because we never skip scanning the last
-	 * block of the relation.
-	 */
-	if (rel_pages > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, rel_pages);
-
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
-
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
 }
 
 /*
@@ -1565,7 +1774,8 @@ lazy_scan_heap(LVRelState *vacrel)
  * heap_vac_scan_next_block() uses the visibility map, vacuum options, and
  * various thresholds to skip blocks which do not need to be processed and
  * returns the next block to process or InvalidBlockNumber if there are no
- * remaining blocks.
+ * remaining blocks or the space of dead_items TIDs reaches the limit (only
+ * in parallel lazy vacuum cases).
  *
  * The visibility status of the next block to process and whether or not it
  * was eager scanned is set in the per_buffer_data.
@@ -1587,11 +1797,37 @@ heap_vac_scan_next_block(ReadStream *stream,
 	LVRelState *vacrel = callback_private_data;
 	uint8		blk_info = 0;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+retry:
+	next_block = InvalidBlockNumber;
+
+	/* Get the next block to process */
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		/*
+		 * Stop returning the next block to the read stream if we are close to
+		 * overrunning the available space for dead_items TIDs so that the
+		 * read stream returns pinned buffers in its buffers queue until the
+		 * stream is exhausted. See the comments atop this file for details.
+		 */
+		if (!dead_items_check_memory_limit(vacrel))
+		{
+			/*
+			 * table_block_parallelscan_nextpage() returns InvalidBlockNumber
+			 * if there are no remaining blocks.
+			 */
+			next_block = table_block_parallelscan_nextpage(vacrel->rel,
+														   vacrel->plvstate->pbscanwork,
+														   vacrel->plvstate->pbscan);
+		}
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
+	}
 
 	/* Have we reached the end of the relation? */
-	if (next_block >= vacrel->scan_data->rel_pages)
+	if (!BlockNumberIsValid(next_block) || next_block >= vacrel->scan_data->rel_pages)
 	{
 		if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
 		{
@@ -1613,8 +1849,42 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * visibility map.
 		 */
 		bool		skipsallvis;
+		bool		found;
+		BlockNumber end_block;
+		BlockNumber nblocks_skip;
+
+		if (ParallelHeapVacuumIsActive(vacrel))
+		{
+			/* We look for the next unskippable block within the chunk */
+			end_block = next_block +
+				vacrel->plvstate->pbscanwork->phsw_chunk_remaining + 1;
+		}
+		else
+			end_block = vacrel->scan_data->rel_pages;
+
+		found = find_next_unskippable_block(vacrel, &skipsallvis, next_block, end_block);
+
+		/*
+		 * We must have found the next unskippable block within the specified
+		 * range in non-parallel cases as the end_block is always the last
+		 * block + 1 and we must scan the last block.
+		 */
+		Assert(found || ParallelHeapVacuumIsActive(vacrel));
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		if (!found)
+		{
+			if (skipsallvis)
+				vacrel->scan_data->skippedallvis = true;
+
+			/*
+			 * Skip all remaining blocks in the current chunk, and retry with
+			 * the next chunk.
+			 */
+			vacrel->plvstate->pbscanwork->phsw_chunk_remaining = 0;
+			goto retry;
+		}
+
+		Assert(vacrel->next_unskippable_block < end_block);
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1631,11 +1901,20 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
 				vacrel->scan_data->skippedallvis = true;
+
+			/* Tell the parallel scans to skip blocks */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				vacrel->plvstate->pbscanwork->phsw_chunk_remaining -= nblocks_skip;
+				Assert(vacrel->plvstate->pbscanwork->phsw_chunk_remaining > 0);
+			}
+
+			next_block = vacrel->next_unskippable_block;
 		}
 	}
 
@@ -1671,9 +1950,11 @@ heap_vac_scan_next_block(ReadStream *stream,
 }
 
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
- * The next unskippable block and its visibility information is updated in
- * vacrel.
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in a range of 'start' (inclusive) and 'end' (exclusive).
+ *
+ * If found, the next unskippable block and its visibility information is updated
+ * in vacrel. Otherwise, return false and reset the information in vacrel.
  *
  * Note: our opinion of which blocks can be skipped can go stale immediately.
  * It's okay if caller "misses" a page whose all-visible or all-frozen marking
@@ -1683,22 +1964,32 @@ heap_vac_scan_next_block(ReadStream *stream,
  * older XIDs/MXIDs.  The *skippedallvis flag will be set here when the choice
  * to skip such a range is actually made, making everything safe.)
  */
-static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+static bool
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start, BlockNumber end)
 {
 	BlockNumber rel_pages = vacrel->scan_data->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block = start;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
 	bool		next_unskippable_allvis;
+	bool		found = true;
 
 	*skipsallvis = false;
 
 	for (;; next_unskippable_block++)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   &next_unskippable_vmbuffer);
+		uint8		mapbits;
+
+		/* Reach the end of range? */
+		if (next_unskippable_block >= end)
+		{
+			found = false;
+			break;
+		}
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_unskippable_block,
+										   &next_unskippable_vmbuffer);
 
 		next_unskippable_allvis = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
 
@@ -1774,11 +2065,274 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 		*skipsallvis = true;
 	}
 
-	/* write the local variables back to vacrel */
-	vacrel->next_unskippable_block = next_unskippable_block;
-	vacrel->next_unskippable_allvis = next_unskippable_allvis;
-	vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
-	vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	if (found)
+	{
+		/* write the local variables back to vacrel */
+		vacrel->next_unskippable_block = next_unskippable_block;
+		vacrel->next_unskippable_allvis = next_unskippable_allvis;
+		vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
+		vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	}
+	else
+	{
+		if (BufferIsValid(next_unskippable_vmbuffer))
+			ReleaseBuffer(next_unskippable_vmbuffer);
+
+		/*
+		 * There is not unskippable block in the specified range. Reset the
+		 * related fields in vacrel.
+		 */
+		vacrel->next_unskippable_block = InvalidBlockNumber;
+		vacrel->next_unskippable_allvis = InvalidBlockNumber;
+		vacrel->next_unskippable_eager_scanned = false;
+		vacrel->next_unskippable_vmbuffer = InvalidBuffer;
+	}
+
+	return found;
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches
+ * parallel workers to scan the heap in parallel.
+*/
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	ParallelBlockTableScanWorkerData pbscanworkdata;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/*
+	 * Setup the parallel scan description for the leader to join as a worker.
+	 */
+	table_block_parallelscan_startblock_init(vacrel->rel,
+											 &pbscanworkdata,
+											 vacrel->plvstate->pbscan);
+	vacrel->plvstate->pbscanwork = &pbscanworkdata;
+
+	for (;;)
+	{
+		BlockNumber fsmvac_upto;
+
+		/* Launch parallel workers */
+		parallel_lazy_scan_heap_begin(vacrel);
+
+		/*
+		 * Do lazy heap scan until the read stream is exhausted. We will stop
+		 * retrieving new blocks for the read stream once the space of
+		 * dead_items TIDs exceeds the limit.
+		 */
+		do_lazy_scan_heap(vacrel, false);
+
+		/* Wait for parallel workers to finish and gather scan results */
+		parallel_lazy_scan_heap_end(vacrel);
+
+		if (!dead_items_check_memory_limit(vacrel))
+			break;
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		/* Compute the smallest processed block number */
+		fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (fsmvac_upto > vacrel->next_fsm_block_to_vacuum)
+		{
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									fsmvac_upto);
+			vacrel->next_fsm_block_to_vacuum = fsmvac_upto;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process launched fewer workers in the next cycle.
+	 */
+	complete_unfinished_lazy_scan_heap(vacrel);
+}
+
+/*
+ * Return the smallest block number that the leader and workers have scanned.
+ */
+static BlockNumber
+parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel)
+{
+	BlockNumber min_blk;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/* Initialized with the leader's value */
+	min_blk = vacrel->last_blkno;
+
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		ParallelLVScanWorker *scanworker = &(vacrel->leader->scanworkers[i]);
+		BlockNumber blkno;
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!scanworker->scan_inited)
+			continue;
+
+		blkno = pg_atomic_read_u32(&(scanworker->last_blkno));
+
+		if (!BlockNumberIsValid(min_blk) || min_blk > blkno)
+			min_blk = blkno;
+	}
+
+	Assert(BlockNumberIsValid(min_blk));
+
+	return min_blk;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+complete_unfinished_lazy_scan_heap(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		ParallelLVScanWorker *scanworker = &(vacrel->leader->scanworkers[i]);
+
+		if (!scanworker->scan_inited)
+			continue;
+
+		if (scanworker->pbscanworkdata.phsw_chunk_remaining == 0)
+			continue;
+
+		/* Attach the worker's scan state */
+		vacrel->plvstate->pbscanwork = &(scanworker->pbscanworkdata);
+
+		/*
+		 * Complete the unfinished scan. Note that we might perform multiple
+		 * cycles of index and heap vacuuming while completing the scans.
+		 */
+		vacrel->next_fsm_block_to_vacuum = pg_atomic_read_u32(&(scanworker->last_blkno));
+		do_lazy_scan_heap(vacrel, true);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_begin(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->leader->nworkers_launched = parallel_vacuum_collect_dead_items_begin(vacrel->pvs);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for collecting dead tuples (planned: %d)",
+							 "launched %d parallel vacuum workers for collecting dead tuples (planned: %d)",
+							 vacrel->leader->nworkers_launched),
+					vacrel->leader->nworkers_launched,
+					parallel_vacuum_get_nworkers_table(vacrel->pvs))));
+}
+
+/*
+ * Helper routine to finish the parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_collect_dead_items_end(vacrel->pvs);
+
+	/* Gather the workers' scan results */
+	parallel_lazy_scan_gather_scan_results(vacrel);
+}
+
+/*
+ * Accumulate each worker's scan results into the leader's.
+*/
+static void
+parallel_lazy_scan_gather_scan_results(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		LVScanData *data = &(vacrel->leader->scanworkers[i].scandata);
+
+		/* Accumulate the counters collected by workers */
+#define ACCUM_COUNT(item) vacrel->scan_data->item += data->item
+		ACCUM_COUNT(scanned_pages);
+		ACCUM_COUNT(removed_pages);
+		ACCUM_COUNT(new_frozen_tuple_pages);
+		ACCUM_COUNT(vm_new_visible_pages);
+		ACCUM_COUNT(vm_new_visible_frozen_pages);
+		ACCUM_COUNT(vm_new_frozen_pages);
+		ACCUM_COUNT(lpdead_item_pages);
+		ACCUM_COUNT(missed_dead_pages);
+		ACCUM_COUNT(tuples_deleted);
+		ACCUM_COUNT(tuples_frozen);
+		ACCUM_COUNT(lpdead_items);
+		ACCUM_COUNT(live_tuples);
+		ACCUM_COUNT(recently_dead_tuples);
+		ACCUM_COUNT(missed_dead_tuples);
+#undef ACCUM_COUNT
+
+		/*
+		 * Track the greatest non-empty page among values the workers
+		 * collected as it's used to cut-off point of heap truncation.
+		 */
+		if (vacrel->scan_data->nonempty_pages < data->nonempty_pages)
+			vacrel->scan_data->nonempty_pages = data->nonempty_pages;
+
+		/*
+		 * All workers must have initialized both values with the values
+		 * passed by the leader.
+		 */
+		Assert(TransactionIdIsValid(data->NewRelfrozenXid));
+		Assert(MultiXactIdIsValid(data->NewRelminMxid));
+
+		/*
+		 * During parallel lazy scanning, since different workers process
+		 * separate blocks, they may observe different existing XIDs and
+		 * MXIDs. Therefore, we compute the oldest XID and MXID from the
+		 * values observed by each worker (including the leader). These
+		 * computations are crucial for correctly advancing both relfrozenxid
+		 * and relmminmxid values.
+		 */
+
+		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
+			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
+
+		/* Has any one of workers skipped all-visible page? */
+		vacrel->scan_data->skippedallvis |= data->skippedallvis;
+	}
 }
 
 /*
@@ -2067,7 +2621,8 @@ lazy_scan_prune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2440,7 +2995,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -3504,12 +4060,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -3527,11 +4079,17 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize the parallel vacuum state for either lazy heap
+			 * scan, index vacuuming, or both.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3571,15 +4129,35 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
 	pgstat_progress_update_multi_param(2, prog_index, prog_val);
 }
 
+/*
+ * Check the memory usage of the collected dead items and return true
+ * if we are close to overrunning the available space for dead_items TIDs.
+ * However, let's force at least one page-worth of tuples to be stored as
+ * to ensure we do at least some work when the memory configured is so low
+ * that we run out before storing anything.
+ */
+static bool
+dead_items_check_memory_limit(LVRelState *vacrel)
+{
+	return vacrel->dead_items_info->num_items > 0 &&
+		TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes;
+}
+
 /*
  * Forget all collected dead items.
  */
@@ -3775,14 +4353,224 @@ update_relstats_all_indexes(LVRelState *vacrel)
 
 /*
  * Compute the number of workers for parallel heap vacuum.
- *
- * Return 0 to disable parallel vacuum.
  */
 int
 heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
 									 void *state)
 {
-	return 0;
+	int			parallel_workers = 0;
+
+	if (nworkers_requested == 0)
+	{
+		LVRelState *vacrel = (LVRelState *) state;
+		int			heap_parallel_threshold;
+		int			heap_pages;
+		BlockNumber allvisible;
+		BlockNumber allfrozen;
+
+		/*
+		 * Estimate the number of blocks that we're going to scan during
+		 * lazy_scan_heap().
+		 */
+		visibilitymap_count(rel, &allvisible, &allfrozen);
+		heap_pages = RelationGetNumberOfBlocks(rel) -
+			(vacrel->aggressive ? allfrozen : allvisible);
+
+		Assert(heap_pages >= 0);
+
+		/*
+		 * Select the number of workers based on the log of the number of
+		 * pages to scan. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nworkers_requested;
+
+	return parallel_workers;
+}
+
+/*
+ * Estimate shared memory size required for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							  void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	Size		size = 0;
+
+	vacrel->leader = palloc(sizeof(ParallelLVLeader));
+
+	/* Estimate space for ParallelLVShared */
+	size = add_size(size, sizeof(ParallelLVShared));
+	vacrel->leader->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for ParallelBlockTableScanDesc */
+	vacrel->leader->pbscan_len = table_block_parallelscan_estimate(rel);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->pbscan_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of ParallelLVScanWorker */
+	vacrel->leader->scanworker_len = mul_size(sizeof(ParallelLVScanWorker), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scanworker_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ParallelLVShared *shared;
+	ParallelBlockTableScanDesc pbscan;
+	ParallelLVScanWorker *scanworkers;
+
+	vacrel->plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Initialize ParallelLVShared */
+	shared = shm_toc_allocate(pcxt->toc, vacrel->leader->shared_len);
+	MemSet(shared, 0, vacrel->leader->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_data->NewRelminMxid;
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SHARED, shared);
+	vacrel->plvstate->shared = shared;
+
+	/* Initialize ParallelBlockTableScanDesc */
+	pbscan = shm_toc_allocate(pcxt->toc, vacrel->leader->pbscan_len);
+	table_block_parallelscan_initialize(rel, (ParallelTableScanDesc) pbscan);
+	pbscan->base.phs_syncscan = false;	/* always start from the first block */
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANDESC, pbscan);
+	vacrel->plvstate->pbscan = pbscan;
+
+	/* Initialize the array of ParallelLVScanWorker */
+	scanworkers = shm_toc_allocate(pcxt->toc, vacrel->leader->scanworker_len);
+	MemSet(scanworkers, 0, vacrel->leader->scanworker_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANWORKER, scanworkers);
+	vacrel->leader->scanworkers = scanworkers;
+}
+
+/*
+ * Initialize lazy vacuum state with the information retrieved from
+ * shared memory.
+ */
+void
+heap_parallel_vacuum_initialize_worker(Relation rel, ParallelVacuumState *pvs,
+									   ParallelWorkerContext *pwcxt,
+									   void **state_out)
+{
+	LVRelState *vacrel;
+	ParallelLVState *plvstate;
+	ParallelLVShared *shared;
+	ParallelLVScanWorker *scanworker;
+	ParallelBlockTableScanDesc pbscan;
+
+	/* Initialize ParallelLVState and prepare the related objects */
+
+	plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Prepare ParallelLVShared */
+	shared = (ParallelLVShared *) shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SHARED, false);
+	plvstate->shared = shared;
+
+	/* Prepare ParallelBlockTableScanWorkerData */
+	pbscan = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANDESC, false);
+	plvstate->pbscan = pbscan;
+
+	/* Prepare ParallelLVScanWorker */
+	scanworker = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANWORKER, false);
+	plvstate->scanworker = &(scanworker[ParallelWorkerNumber]);
+	plvstate->pbscanwork = &(plvstate->scanworker->pbscanworkdata);
+
+	/* Initialize LVRelState and prepare fields required by lazy scan heap */
+	vacrel = palloc0(sizeof(LVRelState));
+	vacrel->rel = rel;
+	vacrel->indrels = parallel_vacuum_get_table_indexes(pvs,
+														&vacrel->nindexes);
+	vacrel->bstrategy = parallel_vacuum_get_bstrategy(pvs);
+	vacrel->pvs = pvs;
+	vacrel->aggressive = shared->aggressive;
+	vacrel->skipwithvm = shared->skipwithvm;
+	vacrel->vistest = GlobalVisTestFor(rel);
+	vacrel->cutoffs = shared->cutoffs;
+	vacrel->dead_items = parallel_vacuum_get_dead_items(pvs,
+														&vacrel->dead_items_info);
+	vacrel->plvstate = plvstate;
+	vacrel->scan_data = &(plvstate->scanworker->scandata);
+	MemSet(vacrel->scan_data, 0, sizeof(LVScanData));
+	vacrel->scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = shared->NewRelminMxid;
+	vacrel->scan_data->skippedallvis = false;
+	vacrel->scan_data->rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Initialize the scan state if not yet. The chunk of blocks will be
+	 * allocated when to get the scan block for the first time.
+	 */
+	if (!vacrel->plvstate->scanworker->scan_inited)
+	{
+		vacrel->plvstate->scanworker->scan_inited = true;
+		table_block_parallelscan_startblock_init(rel,
+												 vacrel->plvstate->pbscanwork,
+												 vacrel->plvstate->pbscan);
+		pg_atomic_init_u32(&(vacrel->plvstate->scanworker->last_blkno),
+						   InvalidBlockNumber);
+	}
+
+	*state_out = (void *) vacrel;
+}
+
+/*
+ * Parallel heap vacuum callback for collecting dead items (i.e., lazy heap scan).
+ */
+void
+heap_parallel_vacuum_collect_dead_items(Relation rel, ParallelVacuumState *pvs,
+										void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ErrorContextCallback errcallback;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel->dbname = get_database_name(MyDatabaseId);
+	vacrel->relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Join the parallel heap vacuum */
+	do_lazy_scan_heap(vacrel, false);
+
+	/* Advertise the last processed block number */
+	pg_atomic_write_u32(&(vacrel->plvstate->scanworker->last_blkno), vacrel->last_blkno);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
 }
 
 /*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 28997918f1c..770e0395a96 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -504,6 +504,35 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
+/*
+ * Return the number of parallel workers initialized for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/*
+ * Return the buffer strategy for parallel vacuum.
+ */
+BufferAccessStrategy
+parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs)
+{
+	return pvs->bstrategy;
+}
+
 /*
  * Returns the dead items space and dead items information.
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 6a1ca5d5ca7..b94d783c31e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -15,6 +15,7 @@
 #define HEAPAM_H
 
 #include "access/heapam_xlog.h"
+#include "access/parallel.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
@@ -407,10 +408,20 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 
 /* in heap/vacuumlazy.c */
 struct VacuumParams;
+struct ParallelVacuumState;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
 extern int	heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
 												 void *state);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+										  void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+												   ParallelWorkerContext *pwcxt,
+												   void **state_out);
+extern void heap_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+													void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e785a4a583f..849cb4dcc74 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -385,6 +385,9 @@ extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels
 												 BufferAccessStrategy bstrategy,
 												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
+extern BufferAccessStrategy parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 0abcc99989e..f92c3f73c29 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -160,6 +160,11 @@ UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
 VACUUM (PARALLEL -1) pvactst; -- error
 ERROR:  parallel workers for vacuum must be between 0 and 1024
 LINE 1: VACUUM (PARALLEL -1) pvactst;
@@ -185,6 +190,7 @@ VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 WARNING:  disabling parallel option of vacuum on "tmp" --- cannot vacuum temporary tables in parallel
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
 DROP TABLE pvactst2;
 -- INDEX_CLEANUP option
diff --git a/src/test/regress/sql/vacuum.sql b/src/test/regress/sql/vacuum.sql
index a72bdb5b619..b8abab28ea9 100644
--- a/src/test/regress/sql/vacuum.sql
+++ b/src/test/regress/sql/vacuum.sql
@@ -129,6 +129,12 @@ VACUUM (PARALLEL 2) pvactst;
 UPDATE pvactst SET i = i WHERE i < 1000;
 VACUUM (PARALLEL 0) pvactst; -- disable parallel vacuum
 
+-- VACUUM invokes parallel heap vacuum.
+SET min_parallel_table_scan_size to 0;
+VACUUM (PARALLEL 2, FREEZE) pvactst2;
+UPDATE pvactst2 SET i = i WHERE i < 1000;
+VACUUM (PARALLEL 1) pvactst2;
+
 VACUUM (PARALLEL -1) pvactst; -- error
 VACUUM (PARALLEL 2, INDEX_CLEANUP FALSE) pvactst;
 VACUUM (PARALLEL 2, FULL TRUE) pvactst; -- error, cannot use both PARALLEL and FULL
@@ -148,6 +154,7 @@ CREATE INDEX tmp_idx1 ON tmp (a);
 VACUUM (PARALLEL 1, FULL FALSE) tmp; -- parallel vacuum disabled for temp tables
 VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled (even though that's implied by FULL)
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 DROP TABLE pvactst;
 DROP TABLE pvactst2;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4833bc70710..69dc9ad43f9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1940,6 +1940,10 @@ PLpgSQL_type
 PLpgSQL_type_type
 PLpgSQL_var
 PLpgSQL_variable
+ParallelLVLeader
+ParallelLVScanWorker
+ParallelLVShared
+ParallelLVState
 PLwdatum
 PLword
 PLyArrayToOb
-- 
2.43.5

#95

Melanie Plageman

melanieplageman@gmail.com

9 months ago

In reply to: Masahiko Sawada (#94)

Re: Parallel heap vacuum

On Tue, Apr 1, 2025 at 5:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new version patch. There are no major changes; I
fixed some typos, improved the comment, and removed duplicated codes.
Also, I've updated the commit messages.

I haven't looked closely at this version but I did notice that you do
not document that parallel vacuum disables eager scanning. Imagine you
are a user who has set the eager freeze related table storage option
(vacuum_max_eager_freeze_failure_rate) and you schedule a regular
parallel vacuum. Now that table storage option does nothing.

- Melanie

#96

Masahiko Sawada

sawada.mshk@gmail.com

9 months ago

In reply to: Melanie Plageman (#95)

Re: Parallel heap vacuum

On Fri, Apr 4, 2025 at 11:05 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Apr 1, 2025 at 5:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new version patch. There are no major changes; I
fixed some typos, improved the comment, and removed duplicated codes.
Also, I've updated the commit messages.

I haven't looked closely at this version but I did notice that you do
not document that parallel vacuum disables eager scanning. Imagine you
are a user who has set the eager freeze related table storage option
(vacuum_max_eager_freeze_failure_rate) and you schedule a regular
parallel vacuum. Now that table storage option does nothing.

Good point. That restriction should be mentioned in the documentation.
I'll update the patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#97

Melanie Plageman

melanieplageman@gmail.com

9 months ago

In reply to: Masahiko Sawada (#96)

Re: Parallel heap vacuum

On Fri, Apr 4, 2025 at 5:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I haven't looked closely at this version but I did notice that you do
not document that parallel vacuum disables eager scanning. Imagine you
are a user who has set the eager freeze related table storage option
(vacuum_max_eager_freeze_failure_rate) and you schedule a regular
parallel vacuum. Now that table storage option does nothing.

Good point. That restriction should be mentioned in the documentation.
I'll update the patch.

Yea, I mean, to be honest, when I initially replied to the thread
saying I thought temporarily disabling eager scanning for parallel
heap vacuuming was viable, I hadn't looked at the patch yet and
thought that there was a separate way to enable the new parallel heap
vacuum (separate from the parallel option for the existing parallel
index vacuuming). I don't like that this disables functionality that
worked when I pushed the eager scanning feature.

- Melanie

#98

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Masahiko Sawada (#96)

Re: Parallel heap vacuum

Hi,

On 2025-04-04 14:34:53 -0700, Masahiko Sawada wrote:

On Fri, Apr 4, 2025 at 11:05 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Apr 1, 2025 at 5:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new version patch. There are no major changes; I
fixed some typos, improved the comment, and removed duplicated codes.
Also, I've updated the commit messages.

I haven't looked closely at this version but I did notice that you do
not document that parallel vacuum disables eager scanning. Imagine you
are a user who has set the eager freeze related table storage option
(vacuum_max_eager_freeze_failure_rate) and you schedule a regular
parallel vacuum. Now that table storage option does nothing.

Good point. That restriction should be mentioned in the documentation.
I'll update the patch.

I don't think we commonly accept that a new feature B regresses a pre-existing
feature A, particularly not if feature B is enabled by default. Why would that
be OK here?

The justification in the code:
+        * One might think that it would make sense to use the eager scanning even
+        * during parallel lazy vacuum, but parallel vacuum is available only in
+        * VACUUM command and would not be something that happens frequently,
+        * which seems not fit to the purpose of the eager scanning. Also, it
+        * would require making the code complex. So it would make sense to
+        * disable it for now.

feels not at all convincing to me. There e.g. are lots of places that run
nightly vacuums. I don't think it's ok to just disable eager scanning in such
a case, as it would mean that the "freeze cliff" would end up being *higher*
because of the nightly vacuums than if just plain autovacuum would have been
used.

I think it was already a mistake to allow the existing vacuum parallelism to
be introduced without integrating it with autovacuum. I don't think we should
go further down that road.

Greetings,

Andres Freund

#99

Masahiko Sawada

sawada.mshk@gmail.com

9 months ago

In reply to: Andres Freund (#98)

Re: Parallel heap vacuum

On Sat, Apr 5, 2025 at 1:32 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-04-04 14:34:53 -0700, Masahiko Sawada wrote:

On Fri, Apr 4, 2025 at 11:05 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Apr 1, 2025 at 5:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new version patch. There are no major changes; I
fixed some typos, improved the comment, and removed duplicated codes.
Also, I've updated the commit messages.

I haven't looked closely at this version but I did notice that you do
not document that parallel vacuum disables eager scanning. Imagine you
are a user who has set the eager freeze related table storage option
(vacuum_max_eager_freeze_failure_rate) and you schedule a regular
parallel vacuum. Now that table storage option does nothing.

Good point. That restriction should be mentioned in the documentation.
I'll update the patch.

I don't think we commonly accept that a new feature B regresses a pre-existing
feature A, particularly not if feature B is enabled by default. Why would that
be OK here?

The eager freeze scan is the pre-existing feature but it's pretty new
code that was pushed just a couple months ago. I didn't want to make
the newly introduced code complex further in one major release
especially if it's in a vacuum area. While I agree that disabling
eager freeze scans during parallel heap vacuum is not very
user-friendly, there are still many cases where parallel heap vacuum
helps even without the eager freeze scan. FYI the parallel heap scan
can be disabled by setting min_parallel_table_scan_size. So I thought
we can incrementally improve this part.

The justification in the code:
+        * One might think that it would make sense to use the eager scanning even
+        * during parallel lazy vacuum, but parallel vacuum is available only in
+        * VACUUM command and would not be something that happens frequently,
+        * which seems not fit to the purpose of the eager scanning. Also, it
+        * would require making the code complex. So it would make sense to
+        * disable it for now.
feels not at all convincing to me. There e.g. are lots of places that run
nightly vacuums. I don't think it's ok to just disable eager scanning in such
a case, as it would mean that the "freeze cliff" would end up being *higher*
because of the nightly vacuums than if just plain autovacuum would have been
used.

That's a fair argument.

I think it was already a mistake to allow the existing vacuum parallelism to
be introduced without integrating it with autovacuum. I don't think we should
go further down that road.

Okay, I think we can consider how to proceed with this patch including
the above point in the v19 development.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#100

Melanie Plageman

melanieplageman@gmail.com

9 months ago

In reply to: Masahiko Sawada (#99)

Re: Parallel heap vacuum

On Sun, Apr 6, 2025 at 1:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The eager freeze scan is the pre-existing feature but it's pretty new
code that was pushed just a couple months ago. I didn't want to make
the newly introduced code complex further in one major release
especially if it's in a vacuum area. While I agree that disabling
eager freeze scans during parallel heap vacuum is not very
user-friendly, there are still many cases where parallel heap vacuum
helps even without the eager freeze scan. FYI the parallel heap scan
can be disabled by setting min_parallel_table_scan_size. So I thought
we can incrementally improve this part.

min_parallel_table_scan_size also affects parallel sequential scans,
though. And AFAIK, that can only be configured globally (i.e. not for
a specific table).
I wonder if there is a clear way to make specific phases of vacuum
using parallelism configurable.

- Melanie

#101

Peter Smith

smithpb2250@gmail.com

9 months ago

In reply to: Peter Smith (#63)

Re: Parallel heap vacuum

Hi Sawada-san.

I was revisiting this thread after a long time. I found most of my
previous review comments from v11-0001 were not yet addressed. I can't
tell if they are deliberately left out, or if they are accidentally
overlooked. Please see the details below.

On Mon, Mar 10, 2025 at 3:05 PM Peter Smith <smithpb2250@gmail.com> wrote:

======
src/backend/access/table/tableamapi.c

2.
Assert(routine->relation_vacuum != NULL);
+ Assert(routine->parallel_vacuum_compute_workers != NULL);
Assert(routine->scan_analyze_next_block != NULL);

Is it better to keep these Asserts in the same order that the
TableAmRoutine fields are assigned (in heapam_handler.c)?

Not yet addressed?

~~~

3.
+ /*
+ * Callbacks for parallel vacuum are also optional (except for
+ * parallel_vacuum_compute_workers). But one callback implies presence of
+ * the others.
+ */
+ Assert(((((routine->parallel_vacuum_estimate == NULL) ==
+   (routine->parallel_vacuum_initialize == NULL)) ==
+ (routine->parallel_vacuum_initialize_worker == NULL)) ==
+ (routine->parallel_vacuum_collect_dead_items == NULL)));

/also optional/optional/

Not yet addressed?

======
src/include/access/heapam.h

+extern int heap_parallel_vacuum_compute_workers(Relation rel, int
nworkers_requested);

4.
wrong tab/space after 'int'.

Not yet addressed?

======
src/include/access/tableam.h
5.
+ /*
+ * Compute the number of parallel workers for parallel table vacuum. The
+ * parallel degree for parallel vacuum is further limited by
+ * max_parallel_maintenance_workers. The function must return 0 to disable
+ * parallel table vacuum.
+ *
+ * 'nworkers_requested' is a >=0 number and the requested number of
+ * workers. This comes from the PARALLEL option. 0 means to choose the
+ * parallel degree based on the table AM specific factors such as table
+ * size.
+ */
+ int (*parallel_vacuum_compute_workers) (Relation rel,
+ int nworkers_requested);
The comment here says "This comes from the PARALLEL option." and "0
means to choose the parallel degree...". But, the PG docs [1] says "To
disable this feature, one can use PARALLEL option and specify parallel
workers as zero.".

These two descriptions "disable this feature" (PG docs) and letting
the system "choose the parallel degree" (code comment) don't sound the
same. Should this 0001 patch update the PG documentation for the
effect of setting PARALLEL value zero?

Not yet addressed?

~~~
6.
+ /*
+ * Initialize DSM space for parallel table vacuum.
+ *
+ * Not called if parallel table vacuum is disabled.
+ *
+ * Optional callback, but either all other parallel vacuum callbacks need
+ * to exist, or neither.
+ */
"or neither"?

Also, saying "all other" seems incorrect because
parallel_vacuum_compute_workers callback must exist event if
parallel_vacuum_initialize does not exist.

IMO you meant to say "all optional", and "or none".

SUGGESTION:
Optional callback. Either all optional parallel vacuum callbacks need
to exist, or none.

(this same issue is repeated in multiple places).

Not yet addressed?

======

[1] https://www.postgresql.org/docs/devel/sql-vacuum.html

Kind Regards,
Peter Smith.
Fujitsu Australia

#102

Peter Smith

smithpb2250@gmail.com

9 months ago

In reply to: Masahiko Sawada (#94)

Re: Parallel heap vacuum

Hi Sawada-san,

Here are some review comments for the patch v16-0002

======
Commit message

1.
Missing period.
/cleanup, ParallelVacuumState/cleanup. ParallelVacuumState/

~~~

2.
Typo /paralel/parallel/

~~~

3.
Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

IIUC, this "disabling" was implemented in patch 0001, so maybe this
comment also belongs in patch 0001.

======
src/backend/commands/vacuumparallel.c

4.
  * This file contains routines that are intended to support setting up, using,
- * and tearing down a ParallelVacuumState.
+ * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared
+ * information as well as the memory space for storing dead items allocated in
+ * the DSA area. We launch
  *

What's the "We launch" incomplete sentence meant to say?

~~~

5.
+typedef enum
+{
+ PV_WORK_PHASE_PROCESS_INDEXES, /* index vacuuming or cleanup */
+ PV_WORK_PHASE_COLLECT_DEAD_ITEMS, /* collect dead tuples */
+} PVWorkPhase;
+

AFAIK, these different "phases" are the ones already described in the
vacuumlazy.c header comment -- there they are named as I, II, III.

IMO it might be better to try to keep these phase enums together with
the phase descriptions, as well as giving the values all names similar
to those existing descriptions.

Maybe something like:
PHASE_I_VACUUM_RELATION_COLLECT_DEAD
PHASE_II_VACUUM_INDEX_COLLECT_DEAD
PHASE_III_VACUUM_REAP_DEAD

~~~

6.
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int
nrequested,
- bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+ int nrequested, int *nworkers_table_p,
+ bool *idx_will_parallel_vacuum, void *state)

I had asked this once before ([1]/messages/by-id/CAHut+Ps9yTGk2TWdjgLQEzGfPTWafKT0H-Q6WvYh5UKNW0pCvQ@mail.gmail.com comment #5)

It would seem tidier to make this a void function, and introduce
another parameter 'nworkers_index_p', similar to 'nworkers_table_p'.
The caller can do the parallel_workers = Max(nworkers_table_p,
nworkers_index_p);

~~~

parallel_vacuum_process_all_indexes

7.
  * to finish, or we might get incomplete data.)
  */
  if (nworkers > 0)
- {
- /* Wait for all vacuum workers to finish */
- WaitForParallelWorkersToFinish(pvs->pcxt);
-
- for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
- }
+ parallel_vacuum_end_worke_phase(pvs);

7a.
The code comment that precedes this was describing code that is now
all removed (moved into the parallel_vacuum_end_worke_phase), so
probably that comment needs to be changed.

7b.
Typo: parallel_vacuum_end_worke_phase (worke?)

~~~

parallel_vacuum_collect_dead_items_end:

8.
+ /* Wait for parallel workers to finish */
+ parallel_vacuum_end_worke_phase(pvs);

typo (worke)

~~~

9.
+ /* Decrement the worker count for the leader itself */
+ if (VacuumActiveNWorkers)
+ pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);

Is this OK? Or did the prior parallel_vacuum_end_worke_phase call
already set VacuumActiveNWorkers to NULL.

~~~

parallel_vacuum_process_table:

10.
+ /*
+ * We have completed the table vacuum so decrement the active worker
+ * count.
+ */
+ pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);

Is this OK? (Similar question to the previous review comment #9) Can
the table_parallel_vacuum_collect_dead_items end up calling
parallel_vacuum_end_worke_phase, so by time we get here the
VacuumActiveNWorkers might be NULL?

~~~

parallel_vacuum_begin_work_phase:

11.
+/*
+ * Launch parallel vacuum workers for the given phase. If at least one
+ * worker launched, enable the shared vacuum delay costing.
+ */

Maybe this comment should also say a call to this needs to be balanced
by a call to the other function: parallel_vacuum_end_work_phase.

~~~

12.
+static void
+parallel_vacuum_end_worke_phase(ParallelVacuumState *pvs)

Typo (worke)

======
[1]: /messages/by-id/CAHut+Ps9yTGk2TWdjgLQEzGfPTWafKT0H-Q6WvYh5UKNW0pCvQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#103

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

9 months ago

In reply to: Masahiko Sawada (#94)

RE: Parallel heap vacuum

Dear Sawada-san,

Thanks for updating the patch. I have been reviewing and below are comments for now.

01.
Sorry if I forgot something, but is there a reason that
parallel_vacuum_compute_workers() is mandatory? My fresh eye felt that the function
can check and regards zero if the API is NULL.

02. table_parallel_vacuum_estimate
We can assert that state is not NULL.

03. table_parallel_vacuum_initialize

Same as 02.. Also, I think we should ensure here that table_parallel_vacuum_estimate()
has already been called before, and current assertion might not enough because
others can set random value here. Other functions like table_rescan_tidrange()
checks the internal flag of the data structure, which is good for me. How do you
feel to check pcxt->estimator has non-zero value?

04. table_parallel_vacuum_initialize_worker
Same as 02. Also, can we esure that table_parallel_vacuum_initialize() has been
done?

05. table_parallel_vacuum_collect_dead_items
Same as 02. Also, can we esure that table_parallel_vacuum_initialize_worker()
has been done?

06. table_parallel_vacuum_initialize_worker
Comments atop function needs to be updated.

07.
While playing, I found that the parallel vacuum worker can be launched more than
pages:

```
postgres=# CREATE TABLE var (id int);
CREATE TABLE
postgres=# INSERT INTO var VALUES (generate_series(1, 10000));
INSERT 0 10000
postgres=# VACUUM (parallel 80, verbose) var ;
INFO: vacuuming "postgres.public.var"
INFO: launched 80 parallel vacuum workers for collecting dead tuples (planned: 80)
INFO: finished vacuuming "postgres.public.var": index scans: 0
pages: 0 removed, 45 remain, 45 scanned (100.00% of total), 0 eagerly scanned
...
```

I hope the minimum chunk size is a page so that this situation can reduce the performance.
How do you feel to cap the value with rel::rd_rel::relpages in heap_parallel_vacuum_compute_workers()?
This value is not always up-to-date but seems good candidate.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#104

Masahiko Sawada

sawada.mshk@gmail.com

9 months ago

In reply to: Peter Smith (#101)

Re: Parallel heap vacuum

On Sun, Apr 6, 2025 at 10:27 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-san.

I was revisiting this thread after a long time. I found most of my
previous review comments from v11-0001 were not yet addressed. I can't
tell if they are deliberately left out, or if they are accidentally
overlooked. Please see the details below.

Sorry I missed to address or reply these comments.

On Mon, Mar 10, 2025 at 3:05 PM Peter Smith <smithpb2250@gmail.com> wrote:

======
src/backend/access/table/tableamapi.c

2.
Assert(routine->relation_vacuum != NULL);
+ Assert(routine->parallel_vacuum_compute_workers != NULL);
Assert(routine->scan_analyze_next_block != NULL);

Is it better to keep these Asserts in the same order that the
TableAmRoutine fields are assigned (in heapam_handler.c)?

Not yet addressed?

Will fix.

~~~

3.
+ /*
+ * Callbacks for parallel vacuum are also optional (except for
+ * parallel_vacuum_compute_workers). But one callback implies presence of
+ * the others.
+ */
+ Assert(((((routine->parallel_vacuum_estimate == NULL) ==
+   (routine->parallel_vacuum_initialize == NULL)) ==
+ (routine->parallel_vacuum_initialize_worker == NULL)) ==
+ (routine->parallel_vacuum_collect_dead_items == NULL)));

/also optional/optional/

Not yet addressed?

Will fix.

======
src/include/access/heapam.h

+extern int heap_parallel_vacuum_compute_workers(Relation rel, int
nworkers_requested);

4.
wrong tab/space after 'int'.

Not yet addressed?

Will fix.

======
src/include/access/tableam.h
5.
+ /*
+ * Compute the number of parallel workers for parallel table vacuum. The
+ * parallel degree for parallel vacuum is further limited by
+ * max_parallel_maintenance_workers. The function must return 0 to disable
+ * parallel table vacuum.
+ *
+ * 'nworkers_requested' is a >=0 number and the requested number of
+ * workers. This comes from the PARALLEL option. 0 means to choose the
+ * parallel degree based on the table AM specific factors such as table
+ * size.
+ */
+ int (*parallel_vacuum_compute_workers) (Relation rel,
+ int nworkers_requested);
The comment here says "This comes from the PARALLEL option." and "0
means to choose the parallel degree...". But, the PG docs [1] says "To
disable this feature, one can use PARALLEL option and specify parallel
workers as zero.".

These two descriptions "disable this feature" (PG docs) and letting
the system "choose the parallel degree" (code comment) don't sound the
same. Should this 0001 patch update the PG documentation for the
effect of setting PARALLEL value zero?
Not yet addressed?

It often happens that we treat a value differently when the user
inputs the value and when we use it internally. I think the comment
should follow how to use VacuumParams.nworkers internally, which is
described in the comment as follow:

/*
* The number of parallel vacuum workers. 0 by default which means choose
* based on the number of indexes. -1 indicates parallel vacuum is
* disabled.
*/
int nworkers;

So it seems no problem to me.

~~~
6.
+ /*
+ * Initialize DSM space for parallel table vacuum.
+ *
+ * Not called if parallel table vacuum is disabled.
+ *
+ * Optional callback, but either all other parallel vacuum callbacks need
+ * to exist, or neither.
+ */
"or neither"?

Also, saying "all other" seems incorrect because
parallel_vacuum_compute_workers callback must exist event if
parallel_vacuum_initialize does not exist.

IMO you meant to say "all optional", and "or none".

SUGGESTION:
Optional callback. Either all optional parallel vacuum callbacks need
to exist, or none.

(this same issue is repeated in multiple places).
Not yet addressed?

Will fix.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#105

Masahiko Sawada

sawada.mshk@gmail.com

9 months ago

In reply to: Peter Smith (#102)

Re: Parallel heap vacuum

On Mon, Apr 7, 2025 at 5:20 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Sawada-san,

Here are some review comments for the patch v16-0002

======
Commit message

1.
Missing period.
/cleanup, ParallelVacuumState/cleanup. ParallelVacuumState/
~~~

2.
Typo /paralel/parallel/

~~~

3.
Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

IIUC, this "disabling" was implemented in patch 0001, so maybe this
comment also belongs in patch 0001.

======
src/backend/commands/vacuumparallel.c
4.
* This file contains routines that are intended to support setting up, using,
- * and tearing down a ParallelVacuumState.
+ * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared
+ * information as well as the memory space for storing dead items allocated in
+ * the DSA area. We launch
*
What's the "We launch" incomplete sentence meant to say?

Will fix the above points..

5.
+typedef enum
+{
+ PV_WORK_PHASE_PROCESS_INDEXES, /* index vacuuming or cleanup */
+ PV_WORK_PHASE_COLLECT_DEAD_ITEMS, /* collect dead tuples */
+} PVWorkPhase;
+
AFAIK, these different "phases" are the ones already described in the
vacuumlazy.c header comment -- there they are named as I, II, III.

IMO it might be better to try to keep these phase enums together with
the phase descriptions, as well as giving the values all names similar
to those existing descriptions.

Maybe something like:
PHASE_I_VACUUM_RELATION_COLLECT_DEAD
PHASE_II_VACUUM_INDEX_COLLECT_DEAD
PHASE_III_VACUUM_REAP_DEAD

I'm not sure it's better. vacuumparallel.c is independent of heap AM
so it's not necessarily true for every table AM that phase I, II, and
III correspond to lazy vacuum's phases. For example, some table AM
might want to have parallelism for its own phase that cannot
correspond to none of lazy vacuum's phases.

~~~
6.
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int
nrequested,
- bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+ int nrequested, int *nworkers_table_p,
+ bool *idx_will_parallel_vacuum, void *state)
I had asked this once before ([1] comment #5)

IIUC the returns for this function seem inconsistent. AFAIK, it
previously would return the number of workers for parallel index
vacuuming. But now (after this patch) the return value is returned
Max(nworkers_table, nworkers_index). Meanwhile, the number of workers
for parallel table vacuuming is returned as a by-reference parameter
'nworkers_table_p'. In other words, it is returning the number of
workers in 2 different ways.

It would seem tidier to make this a void function, and introduce
another parameter 'nworkers_index_p', similar to 'nworkers_table_p'.
The caller can do the parallel_workers = Max(nworkers_table_p,
nworkers_index_p);

Hmm, parallel_vacuum_compute_workers() is a non-exposed function and
its sole caller parallel_vacuum_init() would not use nworkers_index_p.
I think we can do that change you suggested when the caller needs to
use the number of parallel workers for index vacuuming.

~~~

parallel_vacuum_process_all_indexes
7.
* to finish, or we might get incomplete data.)
*/
if (nworkers > 0)
- {
- /* Wait for all vacuum workers to finish */
- WaitForParallelWorkersToFinish(pvs->pcxt);
-
- for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
- }
+ parallel_vacuum_end_worke_phase(pvs);
7a.
The code comment that precedes this was describing code that is now
all removed (moved into the parallel_vacuum_end_worke_phase), so
probably that comment needs to be changed.

~

7b.
Typo: parallel_vacuum_end_worke_phase (worke?)

~~~

parallel_vacuum_collect_dead_items_end:
8.
+ /* Wait for parallel workers to finish */
+ parallel_vacuum_end_worke_phase(pvs);
typo (worke)

~~~

Will fix the above comments.

9.
+ /* Decrement the worker count for the leader itself */
+ if (VacuumActiveNWorkers)
+ pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
Is this OK? Or did the prior parallel_vacuum_end_worke_phase call
already set VacuumActiveNWorkers to NULL.

True. Will remove this part.

~~~

parallel_vacuum_process_table:
10.
+ /*
+ * We have completed the table vacuum so decrement the active worker
+ * count.
+ */
+ pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
Is this OK? (Similar question to the previous review comment #9) Can
the table_parallel_vacuum_collect_dead_items end up calling
parallel_vacuum_end_worke_phase, so by time we get here the
VacuumActiveNWorkers might be NULL?

True. Will remote this part.

~~~

parallel_vacuum_begin_work_phase:
11.
+/*
+ * Launch parallel vacuum workers for the given phase. If at least one
+ * worker launched, enable the shared vacuum delay costing.
+ */
Maybe this comment should also say a call to this needs to be balanced
by a call to the other function: parallel_vacuum_end_work_phase.

Will fix.

~~~

12.
+static void
+parallel_vacuum_end_worke_phase(ParallelVacuumState *pvs)

Typo (worke)

Will fix.

Thank you for the comments!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#106

Masahiko Sawada

sawada.mshk@gmail.com

9 months ago

In reply to: Hayato Kuroda (Fujitsu) (#103)

Re: Parallel heap vacuum

On Fri, Apr 18, 2025 at 2:49 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

Thanks for updating the patch. I have been reviewing and below are comments for now.

Thank you for reviewing the patch!

01.
Sorry if I forgot something, but is there a reason that
parallel_vacuum_compute_workers() is mandatory? My fresh eye felt that the function
can check and regards zero if the API is NULL.

I thought that making this function mandatory would be a good way for
table AM authors to indicate their intentions in terms of parallel
vacuum. On the other hand, I see your point; it would not bother table
AMs that are not going to support parallel vacuum. I'd like to hear
further opinions on this.

02. table_parallel_vacuum_estimate
We can assert that state is not NULL.

I think the state can be NULL depending on the caller. It should not
be mandatory.

03. table_parallel_vacuum_initialize

Same as 02.. Also, I think we should ensure here that table_parallel_vacuum_estimate()
has already been called before, and current assertion might not enough because
others can set random value here. Other functions like table_rescan_tidrange()
checks the internal flag of the data structure, which is good for me. How do you
feel to check pcxt->estimator has non-zero value?

04. table_parallel_vacuum_initialize_worker
Same as 02. Also, can we esure that table_parallel_vacuum_initialize() has been
done?

05. table_parallel_vacuum_collect_dead_items
Same as 02. Also, can we esure that table_parallel_vacuum_initialize_worker()
has been done?

IIUC table_parallel_vacuum_initialize() is called only by
vacuumparallel.c. which is not controllable outside of it. How can
others set a random value to nworkers?

06. table_parallel_vacuum_initialize_worker
Comments atop function needs to be updated.

Did you refer to the following comment?

/*
* Initialize AM-specific vacuum state for worker processes.
*/
static inline void
table_parallel_vacuum_initialize_worker(Relation rel, struct
ParallelVacuumState *pvs,
struct ParallelWorkerContext *pwcxt,
void **state_out)

Which part of the comment do you think we need to update?

07.
While playing, I found that the parallel vacuum worker can be launched more than
pages:

```
postgres=# CREATE TABLE var (id int);
CREATE TABLE
postgres=# INSERT INTO var VALUES (generate_series(1, 10000));
INSERT 0 10000
postgres=# VACUUM (parallel 80, verbose) var ;
INFO: vacuuming "postgres.public.var"
INFO: launched 80 parallel vacuum workers for collecting dead tuples (planned: 80)
INFO: finished vacuuming "postgres.public.var": index scans: 0
pages: 0 removed, 45 remain, 45 scanned (100.00% of total), 0 eagerly scanned
...
```

I hope the minimum chunk size is a page so that this situation can reduce the performance.
How do you feel to cap the value with rel::rd_rel::relpages in heap_parallel_vacuum_compute_workers()?
This value is not always up-to-date but seems good candidate.

Thank you for testing the patch. Yes, we should avoid that. I think it
would be better to cap the number of workers based on the number of
chunks of blocks. I'm going to propose a new parallel scan method for
parallel lazy scan so it would make sense to choose a saner value
based on it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#107

Masahiko Sawada

sawada.mshk@gmail.com

9 months ago

In reply to: Melanie Plageman (#97)

Re: Parallel heap vacuum

On Sat, Apr 5, 2025 at 1:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Apr 4, 2025 at 5:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I haven't looked closely at this version but I did notice that you do
not document that parallel vacuum disables eager scanning. Imagine you
are a user who has set the eager freeze related table storage option
(vacuum_max_eager_freeze_failure_rate) and you schedule a regular
parallel vacuum. Now that table storage option does nothing.

Good point. That restriction should be mentioned in the documentation.
I'll update the patch.

Yea, I mean, to be honest, when I initially replied to the thread
saying I thought temporarily disabling eager scanning for parallel
heap vacuuming was viable, I hadn't looked at the patch yet and
thought that there was a separate way to enable the new parallel heap
vacuum (separate from the parallel option for the existing parallel
index vacuuming). I don't like that this disables functionality that
worked when I pushed the eager scanning feature.

Thank you for sharing your opinion. I think this is one of the main
points that we need to deal with to complete this patch.

After considering various approaches to integrating parallel heap
vacuum and eager freeze scanning, one viable solution would be to
implement a dedicated parallel scan mechanism for parallel lazy scan,
rather than relying on the table_block_parallelscan_xxx() facility.
This approach would involve dividing the table into chunks of 4,096
blocks, same as eager freeze scanning, where each parallel worker
would perform eager freeze scanning while maintaining its own local
failure count and a shared success count. This straightforward
approach offers an additional advantage: since the chunk size remains
constant, we can implement the SKIP_PAGES_THRESHOLD optimization
consistently throughout the table, including its final sections.

However, this approach does present certain challenges. First, we
would need to maintain a separate implementation of lazy vacuum's
parallel scan alongside the table_block_parallelscan_XXX() facility,
potentially increasing maintenance overhead. Additionally, the fixed
chunk size across the entire table might prove less efficient when
processing blocks near the table's end compared to the dynamic
approach used by table_block_parallelscan_nextpage().

Feedback is very welcome.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#108

Masahiko Sawada

sawada.mshk@gmail.com

7 months ago

In reply to: Masahiko Sawada (#107)

5 attachment(s)

Re: Parallel heap vacuum

On Mon, Apr 28, 2025 at 11:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Apr 5, 2025 at 1:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Apr 4, 2025 at 5:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I haven't looked closely at this version but I did notice that you do
not document that parallel vacuum disables eager scanning. Imagine you
are a user who has set the eager freeze related table storage option
(vacuum_max_eager_freeze_failure_rate) and you schedule a regular
parallel vacuum. Now that table storage option does nothing.

Good point. That restriction should be mentioned in the documentation.
I'll update the patch.

Yea, I mean, to be honest, when I initially replied to the thread
saying I thought temporarily disabling eager scanning for parallel
heap vacuuming was viable, I hadn't looked at the patch yet and
thought that there was a separate way to enable the new parallel heap
vacuum (separate from the parallel option for the existing parallel
index vacuuming). I don't like that this disables functionality that
worked when I pushed the eager scanning feature.

Thank you for sharing your opinion. I think this is one of the main
points that we need to deal with to complete this patch.

After considering various approaches to integrating parallel heap
vacuum and eager freeze scanning, one viable solution would be to
implement a dedicated parallel scan mechanism for parallel lazy scan,
rather than relying on the table_block_parallelscan_xxx() facility.
This approach would involve dividing the table into chunks of 4,096
blocks, same as eager freeze scanning, where each parallel worker
would perform eager freeze scanning while maintaining its own local
failure count and a shared success count. This straightforward
approach offers an additional advantage: since the chunk size remains
constant, we can implement the SKIP_PAGES_THRESHOLD optimization
consistently throughout the table, including its final sections.

However, this approach does present certain challenges. First, we
would need to maintain a separate implementation of lazy vacuum's
parallel scan alongside the table_block_parallelscan_XXX() facility,
potentially increasing maintenance overhead. Additionally, the fixed
chunk size across the entire table might prove less efficient when
processing blocks near the table's end compared to the dynamic
approach used by table_block_parallelscan_nextpage().

I've attached the updated patches for parallel heap vacuum. This
version includes several updates:

We can use eager scanning mechanisms even during parallel heap vacuum.
The table is divided into a fixed size chunk (1024 blocks) each of
which is assigned to a parallel vacuum worker. The eager scanning
failure count is evenly divided into chunks as the sizes of region and
chunk are different.

The 0005 patches added a new parallel heap vacuum test to improve the
coverage. Specifically, it tests the case using injection points where
the leader launches fewer parallel workers during multiple index
scans, having the leader complete the unfinished scans at the end of
the lazy scan heap.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v17-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchapplication/octet-stream; name=v17-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchDownload

From 9064252c3df3148b5e407d993f8542791d6b3196 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v17 1/5] Introduces table AM APIs for parallel table
 vacuuming.

This commit introduces the following new table AM APIs for parallel
table vacuuming:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

All callbacks are optional. parallel_vacuum_compute_workers needs to
return 0 to disable parallel table vacuuming.

There is no code using these new APIs for now. Upcoming parallel
vacuum patches utilize these APIs.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/include/access/tableam.h | 138 +++++++++++++++++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..2d58dd91dd5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -35,6 +35,9 @@ extern PGDLLIMPORT bool synchronize_seqscans;
 
 struct BulkInsertStateData;
 struct IndexInfo;
+struct ParallelContext;
+struct ParallelVacuumState;
+struct ParallelWorkerContext;
 struct SampleScanState;
 struct VacuumParams;
 struct ValidateIndexState;
@@ -648,6 +651,79 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * parallel degree for parallel vacuum is further limited by
+	 * max_parallel_maintenance_workers. The function must return 0 to disable
+	 * parallel table vacuum.
+	 *
+	 * 'nworkers_requested' is a >=0 number and the requested number of
+	 * workers. This comes from the PARALLEL option. 0 means to choose the
+	 * parallel degree based on the table AM specific factors such as table
+	 * size.
+	 *
+	 * Optional callback.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel,
+													int nworkers_requested,
+													void *state);
+
+	/*
+	 * Estimate the size of shared memory needed for a parallel table vacuum
+	 * of this relation.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 struct ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   struct ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * Initialize AM-specific vacuum state for worker processes.
+	 *
+	 * The state_out is the output parameter so that arbitrary data can be
+	 * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_initialize_worker) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  struct ParallelWorkerContext *pwcxt,
+													  void **state_out);
+
+	/*
+	 * Execute a parallel scan to collect dead items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_collect_dead_items) (Relation rel,
+													   struct ParallelVacuumState *pvs,
+													   void *state);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1670,6 +1746,68 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Compute the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									  void *state)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel,
+															nworkers_requested,
+															state);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, struct ParallelContext *pcxt,
+							   int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, struct ParallelContext *pcxt,
+								 int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize AM-specific vacuum state for worker processes.
+ */
+static inline void
+table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+										struct ParallelWorkerContext *pwcxt,
+										void **state_out)
+{
+	rel->rd_tableam->parallel_vacuum_initialize_worker(rel, pvs, pwcxt, state_out);
+}
+
+/*
+ * Execute a parallel vacuum scan to collect dead items.
+ */
+static inline void
+table_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+										 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_collect_dead_items(rel, pvs, state);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
-- 
2.43.5

v17-0005-Add-more-parallel-vacuum-tests.patchapplication/octet-stream; name=v17-0005-Add-more-parallel-vacuum-tests.patchDownload

From 64046c5edd327396bd82decc0617f1fccc18ca1b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Jun 2025 10:58:18 -0700
Subject: [PATCH v17 5/5] Add more parallel vacuum tests.

---
 src/backend/access/heap/vacuumlazy.c          | 22 ++++-
 src/backend/commands/vacuumparallel.c         | 21 +++-
 .../injection_points/t/002_parallel_vacuum.pl | 97 +++++++++++++++++++
 3 files changed, 135 insertions(+), 5 deletions(-)
 create mode 100644 src/test/modules/injection_points/t/002_parallel_vacuum.pl

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 71c4a363916..aa5f6913866 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -193,6 +193,7 @@
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/read_stream.h"
+#include "utils/injection_point.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_rusage.h"
 #include "utils/timestamp.h"
@@ -467,6 +468,14 @@ typedef struct ParallelLVLeader
 	/* The number of workers launched for parallel lazy heap scan */
 	int			nworkers_launched;
 
+	/*
+	 * Will the leader participate to parallel lazy heap scan?
+	 *
+	 * This is a parameter for testing and always true unless it is disabled
+	 * explicitly by the injection point.
+	 */
+	bool		leaderparticipate;
+
 	/*
 	 * These fields point to the arrays of all per-worker scan states stored
 	 * in DSM.
@@ -2251,7 +2260,8 @@ do_parallel_lazy_scan_heap(LVRelState *vacrel)
 		 * retrieving new blocks for the read stream once the space of
 		 * dead_items TIDs exceeds the limit.
 		 */
-		do_lazy_scan_heap(vacrel, false);
+		if (vacrel->leader->leaderparticipate)
+			do_lazy_scan_heap(vacrel, false);
 
 		/* Wait for parallel workers to finish and gather scan results */
 		parallel_lazy_scan_heap_end(vacrel);
@@ -4554,6 +4564,7 @@ heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
 {
 	LVRelState *vacrel = (LVRelState *) state;
 	Size		size = 0;
+	bool		leaderparticipate = true;
 
 	vacrel->leader = palloc(sizeof(ParallelLVLeader));
 
@@ -4578,6 +4589,12 @@ heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
 	vacrel->leader->scandata_len = mul_size(sizeof(LVScanData), nworkers);
 	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scandata_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("parallel-heap-vacuum-disable-leader-participation"))
+		leaderparticipate = false;
+#endif
+	vacrel->leader->leaderparticipate = leaderparticipate;
 }
 
 /*
@@ -4615,7 +4632,8 @@ heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworker
 
 	/* including the leader too */
 	shared->eager_scan_remaining_successes_per_worker =
-		vacrel->eager_scan_remaining_successes / (nworkers + 1);
+		vacrel->eager_scan_remaining_successes /
+		(vacrel->leader->leaderparticipate ? nworkers + 1 : nworkers);
 
 	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SHARED, shared);
 	vacrel->plvstate->shared = shared;
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 49e43b95132..7f0869ee4dc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -39,6 +39,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "tcop/tcopprot.h"
+#include "utils/injection_point.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
@@ -1035,14 +1036,28 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 int
 parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
 {
+	int			nworkers = pvs->nworkers_for_table;
+#ifdef USE_INJECTION_POINTS
+	static int	ntimes = 0;
+#endif
+
 	Assert(!IsParallelWorker());
 
-	if (pvs->nworkers_for_table == 0)
+	if (nworkers == 0)
 		return 0;
 
 	/* Start parallel vacuum workers for collecting dead items */
-	Assert(pvs->nworkers_for_table <= pvs->pcxt->nworkers);
-	parallel_vacuum_begin_work_phase(pvs, pvs->nworkers_for_table,
+	Assert(nworkers <= pvs->pcxt->nworkers);
+
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("parallel-vacuum-ramp-down-workers"))
+	{
+		nworkers = pvs->nworkers_for_table - Min(ntimes, pvs->nworkers_for_table);
+		ntimes++;
+	}
+#endif
+
+	parallel_vacuum_begin_work_phase(pvs, nworkers,
 									 PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
 
 	/* Include the worker count for the leader itself */
diff --git a/src/test/modules/injection_points/t/002_parallel_vacuum.pl b/src/test/modules/injection_points/t/002_parallel_vacuum.pl
new file mode 100644
index 00000000000..f0ef33ed86b
--- /dev/null
+++ b/src/test/modules/injection_points/t/002_parallel_vacuum.pl
@@ -0,0 +1,97 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for parallel heap vacuum.
+
+use strict;
+use warnings FATAL => 'all';
+use locale;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test persistency of statistics generated for injection points.
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', qq[create extension injection_points;]);
+
+$node->safe_psql('postgres', qq[
+create table t (i int) with (autovacuum_enabled = off);
+create index on t (i);
+		 ]);
+my $nrows = 1_000_000;
+my $first = int($nrows * rand());
+my $second = $nrows - $first;
+
+my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+# Begin the transaciton that holds xmin.
+$psql->query_safe('begin; select pg_current_xact_id();');
+
+# consume some xids
+$node->safe_psql('postgres', qq[
+select pg_current_xact_id();
+select pg_current_xact_id();
+select pg_current_xact_id();
+select pg_current_xact_id();
+select pg_current_xact_id();
+		 ]);
+
+# While inserting $nrows tuples into the table with an older XID,
+# we inject some tuples with a newer XID filling one page somewhere
+# in the table.
+
+# Insert the first part of rows.
+$psql->query_safe(qq[insert into t select generate_series(1, $first);]);
+
+# Insert some rows with a newer XID, which needs to fill at least
+# one page to prevent the page from begin frozen in the following
+# vacuum.
+my $xid = $node->safe_psql('postgres', qq[
+begin;
+insert into t select 0 from generate_series(1, 300);
+select pg_current_xact_id()::xid;
+commit;
+]);
+
+# Insert remaining rows and commit.
+$psql->query_safe(qq[insert into t select generate_series($first, $nrows);]);
+$psql->query_safe(qq[commit;]);
+
+# Delete some rows.
+$node->safe_psql('postgres', qq[delete from t where i between 1 and 20000;]);
+
+# Execute parallel vacuum that freezes all rows except for the
+# tuple inserted by $psql. We should update the relfrozenxid up to
+# that XID. Setting a lower value to maintenance_work_mem invokes
+# multiple rounds of heap scanning and the number of parallel workers
+# will ramp-down thanks to the injection points.
+$node->safe_psql('postgres', qq[
+set vacuum_freeze_min_age to 5;
+set max_parallel_maintenance_workers TO 5;
+set maintenance_work_mem TO 256;
+select injection_points_set_local();
+select injection_points_attach('parallel-vacuum-ramp-down-workers', 'notice');
+select injection_points_attach('parallel-heap-vacuum-disable-leader-participation', 'notice');
+vacuum (parallel 5, verbose) t;
+		 ]);
+
+is( $node->safe_psql('postgres', qq[select relfrozenxid from pg_class where relname = 't';]),
+    "$xid", "relfrozenxid is updated as expected");
+
+# Check if we have successfully frozen the table in the previous
+# vacuum by scanning all tuples.
+$node->safe_psql('postgres', qq[vacuum (freeze, parallel 0, verbose, disable_page_skipping) t;]);
+is( $node->safe_psql('postgres', qq[select $xid < relfrozenxid::text::int from pg_class where relname = 't';]),
+    "t", "all rows are frozen");
+
+$node->stop;
+done_testing();
+
-- 
2.43.5

v17-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchapplication/octet-stream; name=v17-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchDownload

From c947b97cf431e65e2fdc011a789fcc6726573bec Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 18 Feb 2025 17:45:36 -0800
Subject: [PATCH v17 2/5] vacuumparallel.c: Support parallel vacuuming for
 tables to collect dead items.

Previously, parallel vacuum was available only for index vacuuming and
index cleanup, ParallelVacuumState was initialized only when the table
has at least two indexes that are eligible for parallel index
vacuuming and cleanup.

This commit extends vacuumparallel.c to support parallel table
vacuuming. parallel_vacuum_init() now initializes ParallelVacuumState
to perform parallel heap scan to collect dead items, or paralel index
vacuuming/cleanup, or both. During the initialization, it asks the
table AM for the number of parallel workers required for parallel
table vacuuming. If >0, it enables parallel table vacuuming and calls
further table AM APIs such as parallel_vacuum_estimate.

For parallel table vacuuming, this commit introduces
parallel_vacuum_collect_dead_items_begin() function, which can be used
to collect dead items in the table (for example, the first pass over
heap table in lazy vacuum for heap tables).

Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c  |   2 +-
 src/backend/commands/vacuumparallel.c | 393 +++++++++++++++++++-------
 src/include/commands/vacuum.h         |   5 +-
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 293 insertions(+), 108 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 09416450af9..7bab1a87879 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3520,7 +3520,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..3726fb41028 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -4,17 +4,18 @@
  *	  Support routines for parallel vacuum execution.
  *
  * This file contains routines that are intended to support setting up, using,
- * and tearing down a ParallelVacuumState.
+ * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared
+ * information as well as the memory space for storing dead items allocated in
+ * the DSA area. We launch
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan, index bulk-deletion, index
+ * cleanup, or all of them with parallel worker processes depending on the
+ * number of parallel workers required for each phase. So different numbers of
+ * workers might be required for the table scanning and index processing.
+ * We launch parallel worker processes at the start of a phase, and once we
+ * complete all work in the phase, parallel workers exit. Each time we process
+ * table or indexes in parallel, the parallel context is re-initialized so that
+ * the same DSM can be used for multiple passes of each phase.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -26,8 +27,10 @@
  */
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -50,6 +53,13 @@
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		4
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
+/* The kind of parallel vacuum phases */
+typedef enum
+{
+	PV_WORK_PHASE_PROCESS_INDEXES,	/* index vacuuming or cleanup */
+	PV_WORK_PHASE_COLLECT_DEAD_ITEMS,	/* collect dead tuples */
+} PVWorkPhase;
+
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
  * segment.
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	int64		queryid;
 
+	/*
+	 * Tell parallel workers what phase to perform: processing indexes or
+	 * collecting dead tuples from the table.
+	 */
+	PVWorkPhase work_phase;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +180,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Do we need to reinitialize parallel DSM? */
+	bool		need_reinitialize_dsm;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -178,7 +197,7 @@ struct ParallelVacuumState
 	 * Shared index statistics among parallel vacuum workers. The array
 	 * element is allocated for every index, even those indexes where parallel
 	 * index vacuuming is unsafe or not worthwhile (e.g.,
-	 * will_parallel_vacuum[] is false).  During parallel vacuum,
+	 * idx_will_parallel_vacuum[] is false).  During parallel vacuum,
 	 * IndexBulkDeleteResult of each index is kept in DSM and is copied into
 	 * local memory at the end of parallel vacuum.
 	 */
@@ -193,12 +212,18 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/*
+	 * The number of workers for parallel table vacuuming. If 0, the parallel
+	 * table vacuum is disabled.
+	 */
+	int			nworkers_for_table;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
 	 * min_parallel_index_scan_size cutoff.
 	 */
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
@@ -221,8 +246,10 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static int	parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											bool *idx_will_parallel_vacuum,
+											void *state);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -231,18 +258,25 @@ static void parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation
 											  PVIndStats *indstats);
 static bool parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 												   bool vacuum);
+static void parallel_vacuum_begin_work_phase(ParallelVacuumState *pvs, int nworkers,
+											 PVWorkPhase work_phase);
+static void parallel_vacuum_end_worker_phase(ParallelVacuumState *pvs);
 static void parallel_vacuum_error_callback(void *arg);
 
 /*
  * Try to enter parallel mode and create a parallel context.  Then initialize
  * shared memory state.
  *
+ * nrequested_workers is the requested parallel degree. 0 means that the parallel
+ * degrees for table and indexes vacuum are decided differently. See the comments
+ * of parallel_vacuum_compute_workers() for details.
+ *
  * On success, return parallel vacuum state.  Otherwise return NULL.
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -251,38 +285,38 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
 	int			querylen;
 
-	/*
-	 * A parallel vacuum must be requested and there must be indexes on the
-	 * relation
-	 */
+	/* A parallel vacuum must be requested */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
-	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
+	idx_will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
+	parallel_workers = parallel_vacuum_compute_workers(rel, indrels, nindexes,
 													   nrequested_workers,
-													   will_parallel_vacuum);
+													   &nworkers_for_table,
+													   idx_will_parallel_vacuum,
+													   state);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
-		pfree(will_parallel_vacuum);
+		pfree(idx_will_parallel_vacuum);
 		return NULL;
 	}
 
 	pvs = (ParallelVacuumState *) palloc0(sizeof(ParallelVacuumState));
 	pvs->indrels = indrels;
 	pvs->nindexes = nindexes;
-	pvs->will_parallel_vacuum = will_parallel_vacuum;
+	pvs->idx_will_parallel_vacuum = idx_will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
 
@@ -291,6 +325,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 parallel_workers);
 	Assert(pcxt->nworkers > 0);
 	pvs->pcxt = pcxt;
+	pvs->need_reinitialize_dsm = false;
+	pvs->nworkers_for_table = nworkers_for_table;
 
 	/* Estimate size for index vacuum stats -- PARALLEL_VACUUM_KEY_INDEX_STATS */
 	est_indstats_len = mul_size(sizeof(PVIndStats), nindexes);
@@ -327,6 +363,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, pvs->nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -345,7 +385,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 			   ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) == 0));
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
-		if (!will_parallel_vacuum[i])
+		if (!idx_will_parallel_vacuum[i])
 			continue;
 
 		if (indrel->rd_indam->amusemaintenanceworkmem)
@@ -419,6 +459,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Initialize AM-specific DSM space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, pvs->nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -456,7 +500,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
-	pfree(pvs->will_parallel_vacuum);
+	pfree(pvs->idx_will_parallel_vacuum);
 	pfree(pvs);
 }
 
@@ -533,26 +577,35 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.  Return the maximum number of parallel
+ * workers for table vacuuming and index vacuuming.
+ *
+ * nrequested is the number of parallel workers that user requested, which
+ * applies to both the number of workers for table vacuum and index vacuum.
+ * If nrequested is 0, we compute the parallel degree for them differently
+ * as described below.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel table vacuum, we ask AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to nworkers_table_p.
+ *
+ * For parallel index vacuum, the index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. This function sets
+ * idx_will_parallel_vacuum to remember indexes that participate in parallel vacuum.
  */
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table_p,
+								bool *idx_will_parallel_vacuum, void *state)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			nworkers_table = 0;
+	int			nworkers_index = 0;
+
+	*nworkers_table_p = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -561,6 +614,14 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;
 
+	/* Compute the number of workers for parallel table scan */
+	if (rel->rd_tableam->parallel_vacuum_compute_workers != NULL)
+		nworkers_table = table_parallel_vacuum_compute_workers(rel, nrequested,
+															   state);
+
+	/* Cap by max_parallel_maintenance_workers */
+	nworkers_table = Min(nworkers_table, max_parallel_maintenance_workers);
+
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
@@ -574,7 +635,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		will_parallel_vacuum[i] = true;
+		idx_will_parallel_vacuum[i] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -589,18 +650,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	/* The leader process takes one index */
 	nindexes_parallel--;
 
-	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table_p = nworkers_table;
+	return Max(nworkers_table, nworkers_index);
 }
 
 /*
@@ -657,7 +718,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL);
 		indstats->status = new_status;
 		indstats->parallel_workers_can_process =
-			(pvs->will_parallel_vacuum[i] &&
+			(pvs->idx_will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
 													num_index_scans,
 													vacuum));
@@ -669,40 +730,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
-		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
-			ReinitializeParallelDSM(pvs->pcxt);
-
-		/*
-		 * Set up shared cost balance and the number of active workers for
-		 * vacuum delay.  We need to do this before launching workers as
-		 * otherwise, they might not see the updated values for these
-		 * parameters.
-		 */
-		pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
-		pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
-
-		/*
-		 * The number of workers can vary between bulkdelete and cleanup
-		 * phase.
-		 */
-		ReinitializeParallelWorkers(pvs->pcxt, nworkers);
-
-		LaunchParallelWorkers(pvs->pcxt);
-
-		if (pvs->pcxt->nworkers_launched > 0)
-		{
-			/*
-			 * Reset the local cost values for leader backend as we have
-			 * already accumulated the remaining balance of heap.
-			 */
-			VacuumCostBalance = 0;
-			VacuumCostBalanceLocal = 0;
-
-			/* Enable shared cost balance for leader backend */
-			VacuumSharedCostBalance = &(pvs->shared->cost_balance);
-			VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
-		}
+		/* Start parallel vacuum workers for processing indexes */
+		parallel_vacuum_begin_work_phase(pvs, nworkers,
+										 PV_WORK_PHASE_PROCESS_INDEXES);
 
 		if (vacuum)
 			ereport(pvs->shared->elevel,
@@ -732,13 +762,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	 * to finish, or we might get incomplete data.)
 	 */
 	if (nworkers > 0)
-	{
-		/* Wait for all vacuum workers to finish */
-		WaitForParallelWorkersToFinish(pvs->pcxt);
-
-		for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
-			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
-	}
+		parallel_vacuum_end_worker_phase(pvs);
 
 	/*
 	 * Reset all index status back to initial (while checking that we have
@@ -755,15 +779,8 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		indstats->status = PARALLEL_INDVAC_STATUS_INITIAL;
 	}
 
-	/*
-	 * Carry the shared balance value to heap scan and disable shared costing
-	 */
-	if (VacuumSharedCostBalance)
-	{
-		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
-		VacuumSharedCostBalance = NULL;
-		VacuumActiveNWorkers = NULL;
-	}
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
 }
 
 /*
@@ -979,6 +996,77 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Begin the parallel scan to collect dead items. Return the number of
+ * launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_collect_dead_items_end() to finish
+ * the parallel scan.
+ */
+int
+parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	/* Start parallel vacuum workers for collecting dead items */
+	Assert(pvs->nworkers_for_table <= pvs->pcxt->nworkers);
+	parallel_vacuum_begin_work_phase(pvs, pvs->nworkers_for_table,
+									 PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Include the worker count for the leader itself */
+	if (pvs->pcxt->nworkers_launched > 0)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel vacuum workers launched by
+ * parallel_vacuum_collect_dead_items_begin(), and gather workers' statistics.
+ */
+void
+parallel_vacuum_collect_dead_items_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	/* Wait for parallel workers to finish */
+	parallel_vacuum_end_worker_phase(pvs);
+
+	/* Decrement the worker count for the leader itself */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * The function is for parallel workers to execute the parallel scan to
+ * collect dead tuples.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs, void *state)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do the parallel scan to collect dead tuples */
+	table_parallel_vacuum_collect_dead_items(pvs->heaprel, pvs, state);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -998,6 +1086,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
+	void	   *state;
 	ErrorContextCallback errcallback;
 
 	/*
@@ -1030,7 +1119,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	/*
 	 * Apply the desired value of maintenance_work_mem within this process.
@@ -1076,6 +1164,17 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
+	/* Initialize AM-specific vacuum state for parallel table vacuuming */
+	if (shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS)
+	{
+		ParallelWorkerContext pwcxt;
+
+		pwcxt.toc = toc;
+		pwcxt.seg = seg;
+		table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+												&state);
+	}
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = parallel_vacuum_error_callback;
 	errcallback.arg = &pvs;
@@ -1085,8 +1184,19 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	switch (pvs.shared->work_phase)
+	{
+		case PV_WORK_PHASE_COLLECT_DEAD_ITEMS:
+			/* Scan the table to collect dead items */
+			parallel_vacuum_process_table(&pvs, state);
+			break;
+		case PV_WORK_PHASE_PROCESS_INDEXES:
+			/* Process indexes to perform vacuum/cleanup */
+			parallel_vacuum_process_safe_indexes(&pvs);
+			break;
+		default:
+			elog(ERROR, "unrecognized parallel vacuum phase %d", pvs.shared->work_phase);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
@@ -1109,6 +1219,77 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	FreeAccessStrategy(pvs.bstrategy);
 }
 
+/*
+ * Launch parallel vacuum workers for the given phase. If at least one
+ * worker launched, enable the shared vacuum delay costing.
+ */
+static void
+parallel_vacuum_begin_work_phase(ParallelVacuumState *pvs, int nworkers,
+								 PVWorkPhase work_phase)
+{
+	/* Set the work phase */
+	pvs->shared->work_phase = work_phase;
+
+	/* Reinitialize parallel context to relaunch parallel workers */
+	if (pvs->need_reinitialize_dsm)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * Set up shared cost balance and the number of active workers for vacuum
+	 * delay.  We need to do this before launching workers as otherwise, they
+	 * might not see the updated values for these parameters.
+	 */
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	/*
+	 * The number of workers can vary between bulkdelete and cleanup phase.
+	 */
+	ReinitializeParallelWorkers(pvs->pcxt, nworkers);
+
+	LaunchParallelWorkers(pvs->pcxt);
+
+	/* Enable shared vacuum costing if we are able to launch any worker */
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+	}
+}
+
+/*
+ * Wait for parallel vacuum workers to finish, accumulate the statistics,
+ * and disable shared vacuum delay costing if enabled.
+ */
+static void
+parallel_vacuum_end_worker_phase(ParallelVacuumState *pvs)
+{
+	/* Wait for all vacuum workers to finish */
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/* Carry the shared balance value and disable shared costing */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
+}
+
 /*
  * Error context callback for errors occurring during parallel index vacuum.
  * The error context messages should match the messages set in the lazy vacuum
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index bc37a80dc74..e785a4a583f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -382,7 +382,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -394,6 +395,8 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_collect_dead_items_end(ParallelVacuumState *pvs);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8346cda633..66e3d82f812 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2028,6 +2028,7 @@ PVIndStats
 PVIndVacStatus
 PVOID
 PVShared
+PVWorkPhase
 PX_Alias
 PX_Cipher
 PX_Combo
-- 
2.43.5

v17-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchapplication/octet-stream; name=v17-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchDownload

From 447539c5781990e39d5d5eefeea0189a29a3f13c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 26 Feb 2025 11:31:55 -0800
Subject: [PATCH v17 3/5] Move lazy heap scan related variables to new struct
 LVScanData.

This is a pure refactoring for upcoming parallel heap scan, which
requires storing relation statistics and relation data such as extant
oldest XID/MXID collected during lazy heap scan to a shared memory
area.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 312 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 165 insertions(+), 148 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 7bab1a87879..b3aa151b497 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -256,6 +256,54 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * Data and counters updated during lazy heap scan.
+ */
+typedef struct LVScanData
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+
+	/*
+	 * Count of all-visible blocks eagerly scanned (for logging only). This
+	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
+	 */
+	BlockNumber eager_scanned_pages;
+
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVScanData;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -282,10 +330,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -311,34 +355,9 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 
-	/*
-	 * Count of all-visible blocks eagerly scanned (for logging only). This
-	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
-	 */
-	BlockNumber eager_scanned_pages;
-
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
-
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
-
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	/* Data and counters updated during lazy heap scan */
+	LVScanData *scan_data;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -348,13 +367,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -616,6 +628,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVScanData *scan_data;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -730,14 +743,24 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->eager_scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	scan_data = palloc(sizeof(LVScanData));
+	scan_data->scanned_pages = 0;
+	scan_data->eager_scanned_pages = 0;
+	scan_data->removed_pages = 0;
+	scan_data->new_frozen_tuple_pages = 0;
+	scan_data->lpdead_item_pages = 0;
+	scan_data->missed_dead_pages = 0;
+	scan_data->nonempty_pages = 0;
+	scan_data->tuples_deleted = 0;
+	scan_data->tuples_frozen = 0;
+	scan_data->lpdead_items = 0;
+	scan_data->live_tuples = 0;
+	scan_data->recently_dead_tuples = 0;
+	scan_data->missed_dead_tuples = 0;
+	scan_data->vm_new_visible_pages = 0;
+	scan_data->vm_new_visible_frozen_pages = 0;
+	scan_data->vm_new_frozen_pages = 0;
+	vacrel->scan_data = scan_data;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -747,16 +770,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
 
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -779,15 +794,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vistest = GlobalVisTestFor(rel);
 
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_data->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_data->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 
 	/*
 	 * Initialize state related to tracking all-visible page skipping. This is
 	 * very important to determine whether or not it is safe to advance the
 	 * relfrozenxid/relminmxid.
 	 */
-	vacrel->skippedallvis = false;
+	vacrel->scan_data->skippedallvis = false;
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -875,15 +890,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_data->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_data->NewRelfrozenXid));
+	Assert(vacrel->scan_data->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_data->NewRelminMxid));
+	if (vacrel->scan_data->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -891,8 +906,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_data->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_data->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
@@ -922,7 +937,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, new_rel_allfrozen,
 						vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_data->NewRelfrozenXid,
+						vacrel->scan_data->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -938,8 +954,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples,
+						 vacrel->scan_data->recently_dead_tuples +
+						 vacrel->scan_data->missed_dead_tuples,
 						 starttime);
 	pgstat_progress_end_command();
 
@@ -1013,23 +1029,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u eagerly scanned\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_data->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_data->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages /
+							 100.0 * vacrel->scan_data->scanned_pages /
 							 orig_rel_pages,
-							 vacrel->eager_scanned_pages);
+							 vacrel->scan_data->eager_scanned_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %" PRId64 " removed, %" PRId64 " remain, %" PRId64 " are dead but not yet removable\n"),
-							 vacrel->tuples_deleted,
+							 vacrel->scan_data->tuples_deleted,
 							 (int64) vacrel->new_rel_tuples,
-							 vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 vacrel->scan_data->recently_dead_tuples);
+			if (vacrel->scan_data->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %" PRId64 " dead from %u pages not removed due to cleanup lock contention\n"),
-								 vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 vacrel->scan_data->missed_dead_tuples,
+								 vacrel->scan_data->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -1037,33 +1053,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_data->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_data->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_data->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_data->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %" PRId64 " tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_data->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_data->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 vacrel->tuples_frozen);
+							 vacrel->scan_data->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_data->vm_new_visible_pages,
+							 vacrel->scan_data->vm_new_visible_frozen_pages +
+							 vacrel->scan_data->vm_new_frozen_pages,
+							 vacrel->scan_data->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -1083,10 +1099,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %" PRId64 " dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_data->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 vacrel->lpdead_items);
+							 100.0 * vacrel->scan_data->lpdead_item_pages / orig_rel_pages,
+							 vacrel->scan_data->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -1261,8 +1277,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages > 0 &&
-			vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_data->scanned_pages > 0 &&
+			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1317,9 +1333,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		blkno = BufferGetBlockNumber(buf);
 
-		vacrel->scanned_pages++;
+		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
-			vacrel->eager_scanned_pages++;
+			vacrel->scan_data->eager_scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -1506,16 +1522,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
 
 	read_stream_end(stream);
 
@@ -1562,7 +1578,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * callback_private_data contains a reference to the LVRelState, passed to the
  * read stream API during stream setup. The LVRelState is an in/out parameter
  * here (locally named `vacrel`). Vacuum options and information about the
- * relation are read from it. vacrel->skippedallvis is set if we skip a block
+ * relation are read from it. vacrel->scan_data->skippedallvis is set if we skip a block
  * that's all-visible but not all-frozen (to ensure that we don't update
  * relfrozenxid in that case). vacrel also holds information about the next
  * unskippable block -- as bookkeeping for this function.
@@ -1624,7 +1640,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_data->skippedallvis = true;
 		}
 	}
 
@@ -1906,11 +1922,11 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			 */
 			if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 			{
-				vacrel->vm_new_visible_pages++;
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 			}
 			else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-				vacrel->vm_new_frozen_pages++;
+				vacrel->scan_data->vm_new_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1985,10 +2001,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_data->NewRelfrozenXid, &vacrel->scan_data->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_data->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_data->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1998,7 +2014,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_data->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -2033,7 +2049,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -2048,15 +2064,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_data->tuples_deleted += presult.ndeleted;
+	vacrel->scan_data->tuples_frozen += presult.nfrozen;
+	vacrel->scan_data->lpdead_items += presult.lpdead_items;
+	vacrel->scan_data->live_tuples += presult.live_tuples;
+	vacrel->scan_data->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2105,17 +2121,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (presult.all_frozen)
 			{
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 				*vm_page_frozen = true;
 			}
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2203,8 +2219,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 
@@ -2214,7 +2230,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		else
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2255,8 +2271,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_data->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2383,8 +2399,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_data->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -2411,25 +2427,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_data->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_data->live_tuples += live_tuples;
+	vacrel->scan_data->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_data->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_data->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -2458,7 +2474,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_data->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -2492,7 +2508,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_data->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2519,7 +2535,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		bypass = (vacrel->scan_data->lpdead_item_pages < threshold &&
 				  TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);
 	}
 
@@ -2657,7 +2673,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2819,8 +2835,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
+			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %" PRId64 " dead item identifiers in %u pages",
@@ -2936,14 +2952,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (all_frozen)
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 all_frozen)
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -3019,7 +3035,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_data->scanned_pages < vacrel->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -3200,7 +3216,7 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->scan_data->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
 		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
@@ -3226,7 +3242,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_data->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -3327,7 +3343,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_data->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
@@ -3335,7 +3351,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_data->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -3363,7 +3379,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_data->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -3475,7 +3491,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_data->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 66e3d82f812..931951aa44d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1523,6 +1523,7 @@ LSEG
 LUID
 LVRelState
 LVSavedErrInfo
+LVScanData
 LWLock
 LWLockHandle
 LWLockMode
-- 
2.43.5

v17-0004-Support-parallelism-for-collecting-dead-items-du.patchapplication/octet-stream; name=v17-0004-Support-parallelism-for-collecting-dead-items-du.patchDownload

From 2d498cbb0e2e5c5438e1bb204b8708831c4d4c3c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Feb 2025 13:41:35 -0800
Subject: [PATCH v17 4/5] Support parallelism for collecting dead items during
 lazy vacuum.

This feature allows the vacuum to leverage multiple CPUs in order to
collect dead items (i.e. the first pass over heap table) with parallel
workers. The parallel degree for parallel heap vacuuming is determined
based on the number of blocks to vacuum unless PARALLEL option of
VACUUM command is specified, and further limited by
max_parallel_maintenance_workers.

For the parallel heap scan to collect dead items, we utilize a
parallel block table scan, controlled by ParallelBlockTableScanDesc,
in conjunction with the read stream. The workers' parallel scan
descriptions are stored in the DSM space, enabling different parallel
workers to resume the heap scan (phase 1) after a cycle of heap
vacuuming and index vacuuming (phase 2 and 3) from their previous
state. However, due to the potential presence of pinned buffers loaded
by the read stream's look-ahead mechanism, we cannot abruptly stop
phase 1 even when the space of dead_items TIDs exceeds the
limit. Therefore, once the space of dead_items TIDs exceeds the limit,
we begin processing pages without attempting to retrieve additional
blocks by look-ahead mechanism until the read stream is exhausted,
even if the the memory limit is surpassed. While this approach may
increase the memory usage, it typically doesn't pose a significant
problem, as processing a few 10s-100s buffers doesn't substantially
increase the size of dead_items TIDs.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 doc/src/sgml/ref/vacuum.sgml                  |   54 +-
 src/backend/access/heap/heapam_handler.c      |    8 +-
 src/backend/access/heap/vacuumlazy.c          | 1182 +++++++++++++++--
 src/backend/commands/vacuumparallel.c         |   29 +
 src/include/access/heapam.h                   |   13 +
 src/include/commands/vacuum.h                 |    3 +
 src/test/regress/expected/vacuum_parallel.out |    7 +
 src/test/regress/sql/vacuum_parallel.sql      |    8 +
 src/tools/pgindent/typedefs.list              |    5 +
 9 files changed, 1198 insertions(+), 111 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..294494877d9 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -280,25 +280,41 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
      <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
+      Perform scanning heap, index vacuum, and index cleanup phases of
+      <command>VACUUM</command> in parallel using
+      <replaceable class="parameter">integer</replaceable> background workers
+      (for the details of each vacuum phase, please refer to
+      <xref linkend="vacuum-phases"/>).
+     </para>
+     <para>
+      For heap tables, the number of workers used to perform the scanning
+      heap is determined based on the size of table. A table can participate in
+      parallel scanning heap if and only if the size of the table is more than
+      <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+      the heap table's blocks will be divided into ranges and shared among the
+      cooperating processes. Each worker process will complete the scanning of
+      its given range of blocks before requesting an additional range of blocks.
+     </para>
+     <para>
+      The number of workers used to perform parallel index vacuum and index
+      cleanup is equal to the number of indexes on the relation that support
+      parallel vacuum. An index can participate in parallel vacuum if and only
+      if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+      Only one worker can be used per index. So parallel workers for index vacuum
+      and index cleanup are launched only when there are at least <literal>2</literal>
+      indexes in the table.
+     </para>
+     <para>
+      Workers for vacuum are launched before the start of each phase and exit
+      at the end of the phase. The number of workers for each phase is limited by
+      the number of workers specified with <literal>PARALLEL</literal> option if
+      any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+      Please note that in any parallel vacuum phase, it is not guaanteed that the
+      number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+      will be used during execution. It is possible for a vacuum to run with fewer
+      workers than specified, or even with no workers at all. These behaviors might
+      change in a future release. This option can't be used with the <literal>FULL</literal>
+      option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cb4bc35c93e..1db6dc5b1db 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2668,7 +2668,13 @@ static const TableAmRoutine heapam_methods = {
 
 	.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
 	.scan_sample_next_block = heapam_scan_sample_next_block,
-	.scan_sample_next_tuple = heapam_scan_sample_next_tuple
+	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_initialize_worker = heap_parallel_vacuum_initialize_worker,
+	.parallel_vacuum_collect_dead_items = heap_parallel_vacuum_collect_dead_items
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b3aa151b497..71c4a363916 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -99,6 +99,44 @@
  * After pruning and freezing, pages that are newly all-visible and all-frozen
  * are marked as such in the visibility map.
  *
+ * Parallel Vacuum:
+ *
+ * Lazy vacuum on heap tables supports parallel processing for phase I and
+ * phase II. Before starting phase I, we initialize parallel vacuum state,
+ * ParallelVacuumState, and allocate the TID store in a DSA area if we can
+ * use parallel mode for any of these two phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size and number of indexes. Parallel
+ * workers are launched at the beginning of each phase and exit at the end of
+ * each phase.
+ *
+ * While vacuum cutoffs are shared between leader and worker processes, each
+ * individual process uses its own GlobalVisState, potentially causing some
+ * workers to remove fewer tuples than optimal. During parallel lazy heap scans,
+ * each worker tracks the oldest existing XID and MXID. The leader computes the
+ * globally oldest existing XID and MXID after the parallel scan, while
+ * gathering table data too.
+ *
+ * The parallel lazy heap scan (i.e. parallel phase I) is controlled by
+ * ParallelLVScanDesc in conjunction with the read stream. The table is split
+ * into multiple chunks, which are then distributed among parallel workers.
+ * Due to the potential presence of pinned buffers loaded by the read stream's
+ * look-ahead mechanism, we cannot abruptly stop phase I even hen the space
+ * of dead_items TIDs exceeds the limit. Instead, once this threshold is
+ * surpassed, we begin processing pages without attempting to retrieve additional
+ * blocks until the read stream is exhausted. While this approach may increase
+ * the memory usage, it typically doesn't pose a significant problem, as
+ * processing a few 10s-100s buffers doesn't substantially increase the size
+ * of dead_items TIDs. The workers' parallel scan descriptions,
+ * ParallelLVScanWorkerData, are stored in the DSM space, enabling different
+ * parallel workers to resume phase I from their previous state.
+ *
+ * If the leader launches fewer workers than the previous time to resume the
+ * parallel lazy heap scan, some block within chunks may remain un-scanned.
+ * To address this, the leader completes workers' unfinished scans at the end
+ * of the parallel lazy heap scan (see complete_unfinished_lazy_scan_heap()).
+ *
  * Dead TID Storage:
  *
  * The major space usage for vacuuming is storage for the dead tuple IDs that
@@ -147,6 +185,7 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"	/* for min_parallel_table_scan_size */
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -214,11 +253,22 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * DSM keys for parallel lazy vacuum. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define PARALLEL_LV_KEY_SHARED				0xFFFF0001
+#define PARALLEL_LV_KEY_SCANDESC			0xFFFF0002
+#define PARALLEL_LV_KEY_SCANWORKER			0xFFFF0003
+#define PARALLEL_LV_KEY_SCANDATA			0xFFFF0004
+
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
  * parallel mode and the DSM segment is initialized.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->plvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -249,6 +299,12 @@ typedef enum
  */
 #define EAGER_SCAN_REGION_SIZE 4096
 
+/*
+ * During parallel lazy scans, each worker (including the leader) retrieves
+ * a chunk consisting of PARALLEL_LV_CHUNK_SIZE blocks.
+ */
+#define PARALLEL_LV_CHUNK_SIZE	1024
+
 /*
  * heap_vac_scan_next_block() sets these flags to communicate information
  * about the block it read to the caller.
@@ -304,6 +360,121 @@ typedef struct LVScanData
 	bool		skippedallvis;
 } LVScanData;
 
+/*
+ * Struct for information that needs to be shared among parallel workers
+ * for parallel lazy vacuum. All fields are static, set by the leader
+ * process.
+ */
+typedef struct ParallelLVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+
+	/*
+	 * The first chunk size varies depending on the first eager scan region
+	 * size. If eager scan is disabled, we use the default chunk size
+	 * PARALLEL_LV_CHUNK_SIZE for the first chunk.
+	 */
+	BlockNumber initial_chunk_size;
+
+	/*
+	 * Similar to LVRelState.eager_scan_max_fails_per_region but this is a
+	 * per-chunk failure counter.
+	 */
+	BlockNumber eager_scan_max_fails_per_chunk;
+
+	/*
+	 * Similar to LVRelState.eager_scan_remaining_successes but this is a
+	 * success counter per parallel worker.
+	 */
+	BlockNumber eager_scan_remaining_successes_per_worker;
+} ParallelLVShared;
+
+/*
+ * Shared scan description for parallel lazy scan.
+ */
+typedef struct ParallelLVScanDesc
+{
+	/* Number of blocks of the table at start of scan */
+	BlockNumber nblocks;
+
+	/* Number of blocks in to allocate in each I/O chunk */
+	BlockNumber chunk_size;
+
+	/* Number of blocks allocated to workers so far */
+	pg_atomic_uint64 nallocated;
+} ParallelLVScanDesc;
+
+/*
+ * Per-worker data for scan description, statistics counters, and
+ * miscellaneous data need to be shared with the leader.
+ */
+typedef struct ParallelLVScanWorkerData
+{
+	bool		inited;
+
+	/* Current number of blocks into the scan */
+	BlockNumber nallocated;
+
+	/* Number of blocks per chunk */
+	BlockNumber chunk_size;
+
+	/* Number of blocks left in this chunk */
+	uint32		chunk_remaining;
+
+	/* The last processed block number */
+	pg_atomic_uint32 last_blkno;
+
+	/* Eager scan state for resuming the scan */
+	BlockNumber remaining_fails_save;
+	BlockNumber remaining_successes_save;
+	BlockNumber next_region_start_save;
+} ParallelLVScanWorkerData;
+
+/*
+ * Struct to store parallel lazy vacuum working state.
+ */
+typedef struct ParallelLVState
+{
+	/* Shared static information */
+	ParallelLVShared *shared;
+
+	/* Parallel scan description shared among parallel workers */
+	ParallelLVScanDesc *scandesc;
+
+	/* Per-worker scan data */
+	ParallelLVScanWorkerData *scanwork;
+} ParallelLVState;
+
+/*
+ * Struct for the leader process in parallel lazy vacuum.
+ */
+typedef struct ParallelLVLeader
+{
+	/* Shared memory size for each shared object */
+	Size		shared_len;
+	Size		scandesc_len;
+	Size		scanwork_len;
+	Size		scandata_len;
+
+	/* The number of workers launched for parallel lazy heap scan */
+	int			nworkers_launched;
+
+	/*
+	 * These fields point to the arrays of all per-worker scan states stored
+	 * in DSM.
+	 */
+	ParallelLVScanWorkerData *scanwork_array;
+	LVScanData *scandata_array;
+} ParallelLVLeader;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -368,6 +539,12 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 
+	/* Last processed block number */
+	BlockNumber last_blkno;
+
+	/* Next block to check for FSM vacuum */
+	BlockNumber next_fsm_block_to_vacuum;
+
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
 	BlockNumber next_unskippable_block; /* next unskippable block */
@@ -375,6 +552,16 @@ typedef struct LVRelState
 	bool		next_unskippable_eager_scanned; /* if it was eagerly scanned */
 	Buffer		next_unskippable_vmbuffer;	/* buffer containing its VM bit */
 
+	/* Fields used for parallel lazy vacuum */
+
+	/* Parallel lazy vacuum working state */
+	ParallelLVState *plvstate;
+
+	/*
+	 * The leader state for parallel lazy vacuum. NULL for parallel workers.
+	 */
+	ParallelLVLeader *leader;
+
 	/* State related to managing eager scanning of all-visible pages */
 
 	/*
@@ -434,12 +621,19 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void do_lazy_scan_heap(LVRelState *vacrel, bool check_mem_usage);
 static void heap_vacuum_eager_scan_setup(LVRelState *vacrel,
 										 VacuumParams *params);
 static BlockNumber heap_vac_scan_next_block(ReadStream *stream,
 											void *callback_private_data,
 											void *per_buffer_data);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static void parallel_lazy_scan_init_scan_worker(ParallelLVScanWorkerData *scanwork,
+												BlockNumber initial_chunk_size);
+static BlockNumber parallel_lazy_scan_get_nextpage(LVRelState *vacrel, Relation rel,
+												   ParallelLVScanDesc *scandesc,
+												   ParallelLVScanWorkerData *scanwork);
+static bool find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -450,6 +644,12 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
 							  bool *has_lpdead_items);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static BlockNumber parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel);
+static void complete_unfinished_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_begin(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
+static void parallel_lazy_scan_gather_results(LVRelState *vacrel);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -474,6 +674,7 @@ static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
 static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 						   int num_offsets);
+static bool dead_items_check_memory_limit(LVRelState *vacrel);
 static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
@@ -770,6 +971,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
+	vacrel->next_fsm_block_to_vacuum = 0;
 
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
@@ -1215,13 +1417,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno = 0,
-				next_fsm_block_to_vacuum = 0;
-	BlockNumber orig_eager_scan_success_limit =
-		vacrel->eager_scan_remaining_successes; /* for logging */
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1242,6 +1438,80 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/* Do the actual work */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel, true);
+
+	/*
+	 * Report that everything is now scanned. We never skip scanning the last
+	 * block in the relation, so we can pass rel_pages here.
+	 */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
+								 rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * We can pass rel_pages here because we never skip scanning the last
+	 * block of the relation.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * If check_mem_usage is true, we check the memory usage during the heap scan.
+ * If the space of dead_items TIDs exceeds the limit, we stop the lazy heap scan
+ * and invoke a cycle of index vacuuming and heap vacuuming, and then resume the
+ * scan. If it's false, we continue doing lazy heap scan until the read stream
+ * is exhausted.
+ */
+static void
+do_lazy_scan_heap(LVRelState *vacrel, bool check_mem_usage)
+{
+	ReadStream *stream;
+	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber orig_eager_scan_success_limit =
+		vacrel->eager_scan_remaining_successes; /* for logging */
+	Buffer		vmbuffer = InvalidBuffer;
+
+	/*
+	 * We should not set check_mem_usage to false unless during parallel heap
+	 * vacuum.
+	 */
+	Assert(check_mem_usage || ParallelHeapVacuumIsActive(vacrel));
+
 	/*
 	 * Set up the read stream for vacuum's first pass through the heap.
 	 *
@@ -1276,8 +1546,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * that point.  This check also provides failsafe coverage for the
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
+		 *
+		 * The failsafe check is done only by the leader process.
 		 */
-		if (vacrel->scan_data->scanned_pages > 0 &&
+		if (!IsParallelWorker() &&
+			vacrel->scan_data->scanned_pages > 0 &&
 			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
@@ -1285,12 +1558,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Consider if we definitely have enough space to process TIDs on page
 		 * already.  If we are close to overrunning the available space for
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page. However, let's force at least one page-worth of tuples
-		 * to be stored as to ensure we do at least some work when the memory
-		 * configured is so low that we run out before storing anything.
+		 * this page.
 		 */
-		if (vacrel->dead_items_info->num_items > 0 &&
-			TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
+		if (check_mem_usage && dead_items_check_memory_limit(vacrel))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1313,15 +1583,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * upper-level FSM pages. Note that blkno is the previously
 			 * processed block.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 									blkno + 1);
-			next_fsm_block_to_vacuum = blkno;
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 		}
 
+		/* Read the next block to process */
 		buf = read_stream_next_buffer(stream, &per_buffer_data);
 
 		/* The relation is exhausted. */
@@ -1331,7 +1602,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		blk_info = *((uint8 *) per_buffer_data);
 		CheckBufferIsPinnedOnce(buf);
 		page = BufferGetPage(buf);
-		blkno = BufferGetBlockNumber(buf);
+		blkno = vacrel->last_blkno = BufferGetBlockNumber(buf);
 
 		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
@@ -1496,13 +1767,34 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * visible on upper FSM pages. This is done after vacuuming if the
 			 * table has indexes. There will only be newly-freed space if we
 			 * held the cleanup lock and lazy_scan_prune() was called.
+			 *
+			 * During parallel lazy heap scanning, only the leader process
+			 * vacuums the FSM. However, we cannot vacuum the FSM for blocks
+			 * up to 'blk' because there may be un-scanned blocks or blocks
+			 * being processed by workers before this point. Instead, parallel
+			 * workers advertise the block numbers they have just processed,
+			 * and the leader vacuums the FSM up to the smallest block number
+			 * among them. This approach ensures we vacuum the FSM for
+			 * consecutive processed blocks.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				if (IsParallelWorker())
+					pg_atomic_write_u32(&(vacrel->plvstate->scanwork->last_blkno),
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				else
+				{
+					BlockNumber fsmvac_upto = blkno;
+
+					if (ParallelHeapVacuumIsActive(vacrel))
+						fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+					FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+											fsmvac_upto);
+				}
+
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
@@ -1513,50 +1805,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/*
-	 * Report that everything is now scanned. We never skip scanning the last
-	 * block in the relation, so we can pass rel_pages here.
-	 */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
-								 rel_pages);
-
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_data->scanned_pages,
-													 vacrel->scan_data->live_tuples);
-
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
-		vacrel->scan_data->missed_dead_tuples;
-
 	read_stream_end(stream);
-
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
-
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 * We can pass rel_pages here because we never skip scanning the last
-	 * block of the relation.
-	 */
-	if (rel_pages > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, rel_pages);
-
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
-
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
 }
 
 /*
@@ -1570,7 +1819,8 @@ lazy_scan_heap(LVRelState *vacrel)
  * heap_vac_scan_next_block() uses the visibility map, vacuum options, and
  * various thresholds to skip blocks which do not need to be processed and
  * returns the next block to process or InvalidBlockNumber if there are no
- * remaining blocks.
+ * remaining blocks or the space of dead_items TIDs reaches the limit (only
+ * in parallel lazy vacuum cases).
  *
  * The visibility status of the next block to process and whether or not it
  * was eager scanned is set in the per_buffer_data.
@@ -1592,8 +1842,42 @@ heap_vac_scan_next_block(ReadStream *stream,
 	LVRelState *vacrel = callback_private_data;
 	uint8		blk_info = 0;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+retry:
+	next_block = InvalidBlockNumber;
+
+	/* Get the next block to process */
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		/*
+		 * Stop returning the next block to the read stream if we are close to
+		 * overrunning the available space for dead_items TIDs so that the
+		 * read stream returns pinned buffers in its buffers queue until the
+		 * stream is exhausted. See the comments atop this file for details.
+		 */
+		if (dead_items_check_memory_limit(vacrel))
+		{
+			if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
+			{
+				ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
+				vacrel->next_unskippable_vmbuffer = InvalidBuffer;
+			}
+
+			return InvalidBlockNumber;
+
+		}
+
+		next_block = parallel_lazy_scan_get_nextpage(vacrel,
+													 vacrel->rel,
+													 vacrel->plvstate->scandesc,
+													 vacrel->plvstate->scanwork);
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
+	}
+
+	Assert(BlockNumberIsValid(next_block));
 
 	/* Have we reached the end of the relation? */
 	if (next_block >= vacrel->rel_pages)
@@ -1618,8 +1902,41 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * visibility map.
 		 */
 		bool		skipsallvis;
+		bool		found;
+		BlockNumber end_block;
+		BlockNumber nblocks_skip;
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		if (ParallelHeapVacuumIsActive(vacrel))
+		{
+			/* We look for the next unskippable block within the chunk */
+			end_block = next_block + vacrel->plvstate->scanwork->chunk_remaining + 1;
+		}
+		else
+			end_block = vacrel->rel_pages;
+
+		found = find_next_unskippable_block(vacrel, &skipsallvis, next_block, end_block);
+
+		/*
+		 * We must have found the next unskippable block within the specified
+		 * range in non-parallel cases as the end_block is always the last
+		 * block + 1 and we must scan the last block.
+		 */
+		Assert(found || ParallelHeapVacuumIsActive(vacrel));
+
+		if (!found)
+		{
+			if (skipsallvis)
+				vacrel->scan_data->skippedallvis = true;
+
+			/*
+			 * Skip all remaining blocks in the current chunk, and retry with
+			 * the next chunk.
+			 */
+			vacrel->plvstate->scanwork->chunk_remaining = 0;
+			goto retry;
+		}
+
+		Assert(vacrel->next_unskippable_block < end_block);
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1636,11 +1953,21 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
 				vacrel->scan_data->skippedallvis = true;
+
+			/* Tell the parallel scans to skip blocks */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				vacrel->plvstate->scanwork->chunk_remaining -= nblocks_skip;
+				vacrel->plvstate->scanwork->nallocated += nblocks_skip;
+				Assert(vacrel->plvstate->scanwork->chunk_remaining > 0);
+			}
+
+			next_block = vacrel->next_unskippable_block;
 		}
 	}
 
@@ -1675,10 +2002,86 @@ heap_vac_scan_next_block(ReadStream *stream,
 	}
 }
 
+
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
- * The next unskippable block and its visibility information is updated in
- * vacrel.
+ * Initialize scan state of the given ParallelLVScanWorkerData.
+ */
+static void
+parallel_lazy_scan_init_scan_worker(ParallelLVScanWorkerData *scanwork,
+									BlockNumber initial_chunk_size)
+{
+	Assert(BlockNumberIsValid(initial_chunk_size));
+
+	scanwork->inited = true;
+	scanwork->nallocated = 0;
+	scanwork->chunk_size = initial_chunk_size;
+	scanwork->chunk_remaining = 0;
+	pg_atomic_init_u32(&(scanwork->last_blkno), InvalidBlockNumber);
+}
+
+/*
+ * Return the next page to process for parallel lazy scan.
+ *
+ * If there is no block to scan for the worker, return the number of blocks in
+ * the relation.
+ */
+static BlockNumber
+parallel_lazy_scan_get_nextpage(LVRelState *vacrel, Relation rel,
+								ParallelLVScanDesc *scandesc,
+								ParallelLVScanWorkerData *scanwork)
+{
+	uint64		nallocated;
+
+	if (scanwork->chunk_remaining > 0)
+	{
+		/*
+		 * Give them the next block in the range and update the remaining
+		 * number of blocks.
+		 */
+		nallocated = ++scanwork->nallocated;
+		scanwork->chunk_remaining--;
+	}
+	else
+	{
+		/* Get the new chunk */
+		nallocated = scanwork->nallocated =
+			pg_atomic_fetch_add_u64(&scandesc->nallocated, scanwork->chunk_size);
+
+		/*
+		 * Set the remaining number of blocks in this chunk so that subsequent
+		 * calls from this worker continue on with this chunk until it's done.
+		 */
+		scanwork->chunk_remaining = scanwork->chunk_size - 1;
+
+		/* We use the fixed size chunk for subsequent scans */
+		scanwork->chunk_size = PARALLEL_LV_CHUNK_SIZE;
+
+		/*
+		 * Getting the new chunk also means to start the new eager scan
+		 * region.
+		 *
+		 * Update next_eager_scan_region_start to the first block in the chunk
+		 * so that we can reset the remaining_fails counter when checking the
+		 * visibility of the first block in this chunk in
+		 * find_next_unskippable_block().
+		 */
+		vacrel->next_eager_scan_region_start = nallocated;
+
+	}
+
+	/* Clear the chunk_remaining if there is no more blocks to process */
+	if (nallocated >= scandesc->nblocks)
+		scanwork->chunk_remaining = 0;
+
+	return Min(nallocated, scandesc->nblocks);
+}
+
+/*
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in a range of 'start' (inclusive) and 'end' (exclusive).
+ *
+ * If found, the next unskippable block and its visibility information is updated
+ * in vacrel. Otherwise, return false and reset the information in vacrel.
  *
  * Note: our opinion of which blocks can be skipped can go stale immediately.
  * It's okay if caller "misses" a page whose all-visible or all-frozen marking
@@ -1688,22 +2091,32 @@ heap_vac_scan_next_block(ReadStream *stream,
  * older XIDs/MXIDs.  The *skippedallvis flag will be set here when the choice
  * to skip such a range is actually made, making everything safe.)
  */
-static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+static bool
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start, BlockNumber end)
 {
 	BlockNumber rel_pages = vacrel->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block = start;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
 	bool		next_unskippable_allvis;
+	bool		found = true;
 
 	*skipsallvis = false;
 
 	for (;; next_unskippable_block++)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   &next_unskippable_vmbuffer);
+		uint8		mapbits;
+
+		/* Reach the end of range? */
+		if (next_unskippable_block >= end)
+		{
+			found = false;
+			break;
+		}
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_unskippable_block,
+										   &next_unskippable_vmbuffer);
 
 		next_unskippable_allvis = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
 
@@ -1779,11 +2192,285 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 		*skipsallvis = true;
 	}
 
-	/* write the local variables back to vacrel */
-	vacrel->next_unskippable_block = next_unskippable_block;
-	vacrel->next_unskippable_allvis = next_unskippable_allvis;
-	vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
-	vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	if (found)
+	{
+		/* write the local variables back to vacrel */
+		vacrel->next_unskippable_block = next_unskippable_block;
+		vacrel->next_unskippable_allvis = next_unskippable_allvis;
+		vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
+		vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	}
+	else
+	{
+		if (BufferIsValid(next_unskippable_vmbuffer))
+			ReleaseBuffer(next_unskippable_vmbuffer);
+
+		/*
+		 * There is not unskippable block in the specified range. Reset the
+		 * related fields in vacrel.
+		 */
+		vacrel->next_unskippable_block = InvalidBlockNumber;
+		vacrel->next_unskippable_allvis = InvalidBlockNumber;
+		vacrel->next_unskippable_eager_scanned = false;
+		vacrel->next_unskippable_vmbuffer = InvalidBuffer;
+	}
+
+	return found;
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches
+ * parallel workers to scan the heap in parallel.
+*/
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	ParallelLVScanWorkerData scanwork;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Setup the parallel scan description for the leader to join as a worker */
+	parallel_lazy_scan_init_scan_worker(&scanwork,
+										vacrel->plvstate->shared->initial_chunk_size);
+	vacrel->plvstate->scanwork = &scanwork;
+
+	/* Adjust the eager scan's success counter as a worker */
+	vacrel->eager_scan_remaining_successes =
+		vacrel->plvstate->shared->eager_scan_remaining_successes_per_worker;
+
+	for (;;)
+	{
+		BlockNumber fsmvac_upto;
+
+		/* Launch parallel workers */
+		parallel_lazy_scan_heap_begin(vacrel);
+
+		/*
+		 * Do lazy heap scan until the read stream is exhausted. We will stop
+		 * retrieving new blocks for the read stream once the space of
+		 * dead_items TIDs exceeds the limit.
+		 */
+		do_lazy_scan_heap(vacrel, false);
+
+		/* Wait for parallel workers to finish and gather scan results */
+		parallel_lazy_scan_heap_end(vacrel);
+
+		if (!dead_items_check_memory_limit(vacrel))
+			break;
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		/* Compute the smallest processed block number */
+		fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (fsmvac_upto > vacrel->next_fsm_block_to_vacuum)
+		{
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									fsmvac_upto);
+			vacrel->next_fsm_block_to_vacuum = fsmvac_upto;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process launched fewer workers in the next cycle.
+	 */
+	complete_unfinished_lazy_scan_heap(vacrel);
+}
+
+/*
+ * Return the smallest block number that the leader and workers have scanned.
+ */
+static BlockNumber
+parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel)
+{
+	BlockNumber min_blk;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/* Initialized with the leader's value */
+	min_blk = vacrel->last_blkno;
+
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		ParallelLVScanWorkerData *scanwork = &(vacrel->leader->scanwork_array[i]);
+		BlockNumber blkno;
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!scanwork->inited)
+			continue;
+
+		blkno = pg_atomic_read_u32(&(scanwork->last_blkno));
+
+		if (!BlockNumberIsValid(min_blk) || min_blk > blkno)
+			min_blk = blkno;
+	}
+
+	Assert(BlockNumberIsValid(min_blk));
+
+	return min_blk;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+complete_unfinished_lazy_scan_heap(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		ParallelLVScanWorkerData *scanwork = &(vacrel->leader->scanwork_array[i]);
+
+		if (!scanwork->inited)
+			continue;
+
+		if (scanwork->chunk_remaining == 0)
+			continue;
+
+		/* Attach the worker's scan state */
+		vacrel->plvstate->scanwork = scanwork;
+
+		vacrel->next_fsm_block_to_vacuum = pg_atomic_read_u32(&(scanwork->last_blkno));
+		vacrel->next_eager_scan_region_start = scanwork->next_region_start_save;
+		vacrel->eager_scan_remaining_fails = scanwork->remaining_fails_save;
+
+		/*
+		 * Complete the unfinished scan. Note that we might perform multiple
+		 * cycles of index and heap vacuuming while completing the scan.
+		 */
+		do_lazy_scan_heap(vacrel, true);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_begin(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->leader->nworkers_launched = parallel_vacuum_collect_dead_items_begin(vacrel->pvs);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for collecting dead tuples (planned: %d)",
+							 "launched %d parallel vacuum workers for collecting dead tuples (planned: %d)",
+							 vacrel->leader->nworkers_launched),
+					vacrel->leader->nworkers_launched,
+					parallel_vacuum_get_nworkers_table(vacrel->pvs))));
+}
+
+/*
+ * Helper routine to finish the parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_collect_dead_items_end(vacrel->pvs);
+
+	/* Gather the workers' scan results */
+	parallel_lazy_scan_gather_results(vacrel);
+}
+
+/*
+ * Accumulate each worker's scan results into the leader's.
+*/
+static void
+parallel_lazy_scan_gather_results(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		LVScanData *data = &(vacrel->leader->scandata_array[i]);
+		ParallelLVScanWorkerData *scanwork = &(vacrel->leader->scanwork_array[i]);
+
+		/* Accumulate the counters collected by workers */
+#define ACCUM_COUNT(item) vacrel->scan_data->item += data->item
+		ACCUM_COUNT(scanned_pages);
+		ACCUM_COUNT(removed_pages);
+		ACCUM_COUNT(new_frozen_tuple_pages);
+		ACCUM_COUNT(vm_new_visible_pages);
+		ACCUM_COUNT(vm_new_visible_frozen_pages);
+		ACCUM_COUNT(vm_new_frozen_pages);
+		ACCUM_COUNT(lpdead_item_pages);
+		ACCUM_COUNT(missed_dead_pages);
+		ACCUM_COUNT(tuples_deleted);
+		ACCUM_COUNT(tuples_frozen);
+		ACCUM_COUNT(lpdead_items);
+		ACCUM_COUNT(live_tuples);
+		ACCUM_COUNT(recently_dead_tuples);
+		ACCUM_COUNT(missed_dead_tuples);
+#undef ACCUM_COUNT
+
+		/*
+		 * Track the greatest non-empty page among values the workers
+		 * collected as it's used to cut-off point of heap truncation.
+		 */
+		if (vacrel->scan_data->nonempty_pages < data->nonempty_pages)
+			vacrel->scan_data->nonempty_pages = data->nonempty_pages;
+
+		/*
+		 * All workers must have initialized both values with the values
+		 * passed by the leader.
+		 */
+		Assert(TransactionIdIsValid(data->NewRelfrozenXid));
+		Assert(MultiXactIdIsValid(data->NewRelminMxid));
+
+		/*
+		 * During parallel lazy scanning, since different workers process
+		 * separate blocks, they may observe different existing XIDs and
+		 * MXIDs. Therefore, we compute the oldest XID and MXID from the
+		 * values observed by each worker (including the leader). These
+		 * computations are crucial for correctly advancing both relfrozenxid
+		 * and relmminmxid values.
+		 */
+
+		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
+			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
+
+		/* Has any one of workers skipped all-visible page? */
+		vacrel->scan_data->skippedallvis |= data->skippedallvis;
+
+		/*
+		 * Gather the remaining success count so that we can distribute the
+		 * success counter again in the next parallel lazy scan.
+		 */
+		vacrel->eager_scan_remaining_successes += scanwork->remaining_successes_save;
+	}
 }
 
 /*
@@ -2072,7 +2759,8 @@ lazy_scan_prune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2445,7 +3133,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -3509,12 +4198,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -3532,11 +4217,17 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize the parallel vacuum state for either lazy heap
+			 * scan, index vacuuming, or both.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3576,15 +4267,35 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
 	pgstat_progress_update_multi_param(2, prog_index, prog_val);
 }
 
+/*
+ * Check the memory usage of the collected dead items and return true
+ * if we are close to overrunning the available space for dead_items TIDs.
+ * However, let's force at least one page-worth of tuples to be stored as
+ * to ensure we do at least some work when the memory configured is so low
+ * that we run out before storing anything.
+ */
+static bool
+dead_items_check_memory_limit(LVRelState *vacrel)
+{
+	return vacrel->dead_items_info->num_items > 0 &&
+		TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes;
+}
+
 /*
  * Forget all collected dead items.
  */
@@ -3778,6 +4489,295 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									 void *state)
+{
+	BlockNumber relpages = RelationGetNumberOfBlocks(rel);
+	int			parallel_workers = 0;
+
+	/*
+	 * Parallel heap vacuuming a small relation shouldn't take long. We use
+	 * two times the chunk size as the size cutoff because the leader is
+	 * assigned to one chunk.
+	 */
+	if (relpages < PARALLEL_LV_CHUNK_SIZE * 2 || relpages < min_parallel_table_scan_size)
+		return 0;
+
+	if (nworkers_requested == 0)
+	{
+		LVRelState *vacrel = (LVRelState *) state;
+		int			heap_parallel_threshold;
+		int			heap_pages;
+		BlockNumber allvisible;
+		BlockNumber allfrozen;
+
+		/*
+		 * Estimate the number of blocks that we're going to scan during
+		 * lazy_scan_heap().
+		 */
+		visibilitymap_count(rel, &allvisible, &allfrozen);
+		heap_pages = relpages - (vacrel->aggressive ? allfrozen : allvisible);
+
+		Assert(heap_pages >= 0);
+
+		/*
+		 * Select the number of workers based on the log of the number of
+		 * pages to scan. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = PARALLEL_LV_CHUNK_SIZE;
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nworkers_requested;
+
+	return parallel_workers;
+}
+
+/*
+ * Estimate shared memory size required for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							  void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	Size		size = 0;
+
+	vacrel->leader = palloc(sizeof(ParallelLVLeader));
+
+	/* Estimate space for ParallelLVShared */
+	size = add_size(size, sizeof(ParallelLVShared));
+	vacrel->leader->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for ParallelLVScanDesc */
+	vacrel->leader->scandesc_len = sizeof(ParallelLVScanDesc);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scandesc_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of ParallelLVScanWorkerData */
+	vacrel->leader->scanwork_len = mul_size(sizeof(ParallelLVScanWorkerData),
+											nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scanwork_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of LVScanData */
+	vacrel->leader->scandata_len = mul_size(sizeof(LVScanData), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scandata_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ParallelLVShared *shared;
+	ParallelLVScanDesc *scandesc;
+	ParallelLVScanWorkerData *scanwork;
+	LVScanData *scandata;
+
+	vacrel->plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Initialize ParallelLVShared */
+
+	shared = shm_toc_allocate(pcxt->toc, vacrel->leader->shared_len);
+	MemSet(shared, 0, vacrel->leader->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_data->NewRelminMxid;
+	shared->initial_chunk_size = BlockNumberIsValid(vacrel->next_eager_scan_region_start)
+		? vacrel->next_eager_scan_region_start
+		: PARALLEL_LV_CHUNK_SIZE;
+
+	/* Calculate the per-chunk maximum failure count */
+	shared->eager_scan_max_fails_per_chunk =
+		(BlockNumber) (vacrel->eager_scan_max_fails_per_region *
+					   ((float) PARALLEL_LV_CHUNK_SIZE / EAGER_SCAN_REGION_SIZE));
+
+	/* including the leader too */
+	shared->eager_scan_remaining_successes_per_worker =
+		vacrel->eager_scan_remaining_successes / (nworkers + 1);
+
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SHARED, shared);
+	vacrel->plvstate->shared = shared;
+
+	/* Initialize ParallelLVScanDesc */
+	scandesc = shm_toc_allocate(pcxt->toc, vacrel->leader->scandesc_len);
+	scandesc->nblocks = RelationGetNumberOfBlocks(rel);
+	pg_atomic_init_u64(&scandesc->nallocated, 0);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANDESC, scandesc);
+	vacrel->plvstate->scandesc = scandesc;
+
+	/* Initialize the array of ParallelLVScanWorkerData */
+	scanwork = shm_toc_allocate(pcxt->toc, vacrel->leader->scanwork_len);
+	MemSet(scanwork, 0, vacrel->leader->scanwork_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANWORKER, scanwork);
+	vacrel->leader->scanwork_array = scanwork;
+
+	/* Initialize the array of LVScanData */
+	scandata = shm_toc_allocate(pcxt->toc, vacrel->leader->scandata_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANDATA, scandata);
+	vacrel->leader->scandata_array = scandata;
+}
+
+/*
+ * Initialize lazy vacuum state with the information retrieved from
+ * shared memory.
+ */
+void
+heap_parallel_vacuum_initialize_worker(Relation rel, ParallelVacuumState *pvs,
+									   ParallelWorkerContext *pwcxt,
+									   void **state_out)
+{
+	LVRelState *vacrel;
+	ParallelLVState *plvstate;
+	ParallelLVShared *shared;
+	ParallelLVScanDesc *scandesc;
+	ParallelLVScanWorkerData *scanwork_array;
+	LVScanData *scandata_array;
+
+	/* Initialize ParallelLVState and prepare the related objects */
+
+	plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Prepare ParallelLVShared */
+	shared = (ParallelLVShared *) shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SHARED, false);
+	plvstate->shared = shared;
+
+	/* Prepare ParallelLVScanDesc */
+	scandesc = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANDESC, false);
+	plvstate->scandesc = scandesc;
+
+	/* Prepare ParallelLVScanWorkerData */
+	scanwork_array = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANWORKER, false);
+	plvstate->scanwork = &(scanwork_array[ParallelWorkerNumber]);
+
+	/* Initialize LVRelState and prepare fields required by lazy scan heap */
+	vacrel = palloc0(sizeof(LVRelState));
+	vacrel->rel = rel;
+	vacrel->indrels = parallel_vacuum_get_table_indexes(pvs,
+														&vacrel->nindexes);
+	vacrel->bstrategy = parallel_vacuum_get_bstrategy(pvs);
+	vacrel->pvs = pvs;
+	vacrel->aggressive = shared->aggressive;
+	vacrel->skipwithvm = shared->skipwithvm;
+	vacrel->vistest = GlobalVisTestFor(rel);
+	vacrel->cutoffs = shared->cutoffs;
+	vacrel->dead_items = parallel_vacuum_get_dead_items(pvs,
+														&vacrel->dead_items_info);
+	vacrel->rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Set the per-region failure counter and per-worker success counter,
+	 * which are not changed during parallel heap vacuum.
+	 */
+	vacrel->eager_scan_max_fails_per_region =
+		plvstate->shared->eager_scan_max_fails_per_chunk;
+	vacrel->eager_scan_remaining_successes =
+		plvstate->shared->eager_scan_remaining_successes_per_worker;
+
+	/* Does this worker have un-scanned blocks in a chunk? */
+	if (plvstate->scanwork->chunk_remaining > 0)
+	{
+		/*
+		 * We restore the previous eager scan state of the already allocated
+		 * chunk, if the worker's previous scan suspended due to the full of
+		 * dead_items TIDs space.
+		 */
+		vacrel->next_eager_scan_region_start = plvstate->scanwork->next_region_start_save;
+		vacrel->eager_scan_remaining_fails = plvstate->scanwork->remaining_fails_save;
+	}
+	else
+	{
+		/*
+		 * next_eager_scan_region_start will be set when the first chunk is
+		 * assigned
+		 */
+		vacrel->next_eager_scan_region_start = InvalidBlockNumber;
+		vacrel->eager_scan_remaining_fails = vacrel->eager_scan_max_fails_per_region;
+	}
+
+	vacrel->plvstate = plvstate;
+
+	/* Prepare LVScanData */
+	scandata_array = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANDATA, false);
+	vacrel->scan_data = &(scandata_array[ParallelWorkerNumber]);
+	MemSet(vacrel->scan_data, 0, sizeof(LVScanData));
+	vacrel->scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = shared->NewRelminMxid;
+	vacrel->scan_data->skippedallvis = false;
+
+	/*
+	 * Initialize the scan state if not yet. The chunk of blocks will be
+	 * allocated when to get the scan block for the first time.
+	 */
+	if (!vacrel->plvstate->scanwork->inited)
+		parallel_lazy_scan_init_scan_worker(vacrel->plvstate->scanwork,
+											vacrel->plvstate->shared->initial_chunk_size);
+
+	*state_out = (void *) vacrel;
+}
+
+/*
+ * Parallel heap vacuum callback for collecting dead items (i.e., lazy heap scan).
+ */
+void
+heap_parallel_vacuum_collect_dead_items(Relation rel, ParallelVacuumState *pvs,
+										void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ErrorContextCallback errcallback;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel->dbname = get_database_name(MyDatabaseId);
+	vacrel->relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Join the parallel heap vacuum */
+	do_lazy_scan_heap(vacrel, false);
+
+	/* Advertise the last processed block number */
+	pg_atomic_write_u32(&(vacrel->plvstate->scanwork->last_blkno), vacrel->last_blkno);
+
+	/* Save the eager scan state */
+	vacrel->plvstate->scanwork->remaining_fails_save = vacrel->eager_scan_remaining_fails;
+	vacrel->plvstate->scanwork->remaining_successes_save = vacrel->eager_scan_remaining_successes;
+	vacrel->plvstate->scanwork->next_region_start_save = vacrel->next_eager_scan_region_start;
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 3726fb41028..49e43b95132 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -504,6 +504,35 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
+/*
+ * Return the number of parallel workers initialized for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/*
+ * Return the buffer strategy for parallel vacuum.
+ */
+BufferAccessStrategy
+parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs)
+{
+	return pvs->bstrategy;
+}
+
 /*
  * Returns the dead items space and dead items information.
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e48fe434cd3..724cab73698 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -15,6 +15,7 @@
 #define HEAPAM_H
 
 #include "access/heapam_xlog.h"
+#include "access/parallel.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
@@ -397,8 +398,20 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 
 /* in heap/vacuumlazy.c */
 struct VacuumParams;
+struct ParallelVacuumState;
 extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
+extern int heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+												void *state);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+										  void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+												   ParallelWorkerContext *pwcxt,
+												   void **state_out);
+extern void heap_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+													void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e785a4a583f..849cb4dcc74 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -385,6 +385,9 @@ extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels
 												 BufferAccessStrategy bstrategy,
 												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
+extern BufferAccessStrategy parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
diff --git a/src/test/regress/expected/vacuum_parallel.out b/src/test/regress/expected/vacuum_parallel.out
index ddf0ee544b7..b793d8093c2 100644
--- a/src/test/regress/expected/vacuum_parallel.out
+++ b/src/test/regress/expected/vacuum_parallel.out
@@ -1,5 +1,6 @@
 SET max_parallel_maintenance_workers TO 4;
 SET min_parallel_index_scan_size TO '128kB';
+SET min_parallel_table_scan_size TO '128kB';
 -- Bug #17245: Make sure that we don't totally fail to VACUUM individual indexes that
 -- happen to be below min_parallel_index_scan_size during parallel VACUUM:
 CREATE TABLE parallel_vacuum_table (a int) WITH (autovacuum_enabled = off);
@@ -43,7 +44,13 @@ VACUUM (PARALLEL 4, INDEX_CLEANUP ON) parallel_vacuum_table;
 -- Since vacuum_in_leader_small_index uses deduplication, we expect an
 -- assertion failure with bug #17245 (in the absence of bugfix):
 INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 10000) i;
+-- Insert more tuples to use parallel heap vacuum.
+INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 500_000) i;
+VACUUM (PARALLEL 2) parallel_vacuum_table;
+DELETE FROM parallel_vacuum_table WHERE a < 1000;
+VACUUM (PARALLEL 1) parallel_vacuum_table;
 RESET max_parallel_maintenance_workers;
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 -- Deliberately don't drop table, to get further coverage from tools like
 -- pg_amcheck in some testing scenarios
diff --git a/src/test/regress/sql/vacuum_parallel.sql b/src/test/regress/sql/vacuum_parallel.sql
index 1d23f33e39c..5381023642f 100644
--- a/src/test/regress/sql/vacuum_parallel.sql
+++ b/src/test/regress/sql/vacuum_parallel.sql
@@ -1,5 +1,6 @@
 SET max_parallel_maintenance_workers TO 4;
 SET min_parallel_index_scan_size TO '128kB';
+SET min_parallel_table_scan_size TO '128kB';
 
 -- Bug #17245: Make sure that we don't totally fail to VACUUM individual indexes that
 -- happen to be below min_parallel_index_scan_size during parallel VACUUM:
@@ -39,8 +40,15 @@ VACUUM (PARALLEL 4, INDEX_CLEANUP ON) parallel_vacuum_table;
 -- assertion failure with bug #17245 (in the absence of bugfix):
 INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 10000) i;
 
+-- Insert more tuples to use parallel heap vacuum.
+INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 500_000) i;
+VACUUM (PARALLEL 2) parallel_vacuum_table;
+DELETE FROM parallel_vacuum_table WHERE a < 1000;
+VACUUM (PARALLEL 1) parallel_vacuum_table;
+
 RESET max_parallel_maintenance_workers;
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 
 -- Deliberately don't drop table, to get further coverage from tools like
 -- pg_amcheck in some testing scenarios
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 931951aa44d..eda476f0cdf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1956,6 +1956,11 @@ PLpgSQL_type
 PLpgSQL_type_type
 PLpgSQL_var
 PLpgSQL_variable
+ParallelLVLeader
+ParallelLVScanDesc
+ParallelLVScanWorkerData
+ParallelLVShared
+ParallelLVState
 PLwdatum
 PLword
 PLyArrayToOb
-- 
2.43.5

#109

Masahiko Sawada

sawada.mshk@gmail.com

7 months ago

In reply to: Masahiko Sawada (#108)

5 attachment(s)

Re: Parallel heap vacuum

On Sat, Jun 14, 2025 at 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Apr 28, 2025 at 11:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Apr 5, 2025 at 1:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Apr 4, 2025 at 5:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I haven't looked closely at this version but I did notice that you do
not document that parallel vacuum disables eager scanning. Imagine you
are a user who has set the eager freeze related table storage option
(vacuum_max_eager_freeze_failure_rate) and you schedule a regular
parallel vacuum. Now that table storage option does nothing.

Good point. That restriction should be mentioned in the documentation.
I'll update the patch.

Yea, I mean, to be honest, when I initially replied to the thread
saying I thought temporarily disabling eager scanning for parallel
heap vacuuming was viable, I hadn't looked at the patch yet and
thought that there was a separate way to enable the new parallel heap
vacuum (separate from the parallel option for the existing parallel
index vacuuming). I don't like that this disables functionality that
worked when I pushed the eager scanning feature.

Thank you for sharing your opinion. I think this is one of the main
points that we need to deal with to complete this patch.

After considering various approaches to integrating parallel heap
vacuum and eager freeze scanning, one viable solution would be to
implement a dedicated parallel scan mechanism for parallel lazy scan,
rather than relying on the table_block_parallelscan_xxx() facility.
This approach would involve dividing the table into chunks of 4,096
blocks, same as eager freeze scanning, where each parallel worker
would perform eager freeze scanning while maintaining its own local
failure count and a shared success count. This straightforward
approach offers an additional advantage: since the chunk size remains
constant, we can implement the SKIP_PAGES_THRESHOLD optimization
consistently throughout the table, including its final sections.

However, this approach does present certain challenges. First, we
would need to maintain a separate implementation of lazy vacuum's
parallel scan alongside the table_block_parallelscan_XXX() facility,
potentially increasing maintenance overhead. Additionally, the fixed
chunk size across the entire table might prove less efficient when
processing blocks near the table's end compared to the dynamic
approach used by table_block_parallelscan_nextpage().

I've attached the updated patches for parallel heap vacuum. This
version includes several updates:

We can use eager scanning mechanisms even during parallel heap vacuum.
The table is divided into a fixed size chunk (1024 blocks) each of
which is assigned to a parallel vacuum worker. The eager scanning
failure count is evenly divided into chunks as the sizes of region and
chunk are different.

The 0005 patches added a new parallel heap vacuum test to improve the
coverage. Specifically, it tests the case using injection points where
the leader launches fewer parallel workers during multiple index
scans, having the leader complete the unfinished scans at the end of
the lazy scan heap.

I've rebased the patches to the current HEAD.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v18-0005-Add-more-parallel-vacuum-tests.patchapplication/octet-stream; name=v18-0005-Add-more-parallel-vacuum-tests.patchDownload

From ee1dba11642437373290a1b57918b2528941066a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Jun 2025 10:58:18 -0700
Subject: [PATCH v18 5/5] Add more parallel vacuum tests.

---
 src/backend/access/heap/vacuumlazy.c          | 22 ++++-
 src/backend/commands/vacuumparallel.c         | 21 +++-
 .../injection_points/t/002_parallel_vacuum.pl | 97 +++++++++++++++++++
 3 files changed, 135 insertions(+), 5 deletions(-)
 create mode 100644 src/test/modules/injection_points/t/002_parallel_vacuum.pl

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index cd3784ea3e5..39ccb5d56df 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -193,6 +193,7 @@
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/read_stream.h"
+#include "utils/injection_point.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_rusage.h"
 #include "utils/timestamp.h"
@@ -467,6 +468,14 @@ typedef struct ParallelLVLeader
 	/* The number of workers launched for parallel lazy heap scan */
 	int			nworkers_launched;
 
+	/*
+	 * Will the leader participate to parallel lazy heap scan?
+	 *
+	 * This is a parameter for testing and always true unless it is disabled
+	 * explicitly by the injection point.
+	 */
+	bool		leaderparticipate;
+
 	/*
 	 * These fields point to the arrays of all per-worker scan states stored
 	 * in DSM.
@@ -2251,7 +2260,8 @@ do_parallel_lazy_scan_heap(LVRelState *vacrel)
 		 * retrieving new blocks for the read stream once the space of
 		 * dead_items TIDs exceeds the limit.
 		 */
-		do_lazy_scan_heap(vacrel, false);
+		if (vacrel->leader->leaderparticipate)
+			do_lazy_scan_heap(vacrel, false);
 
 		/* Wait for parallel workers to finish and gather scan results */
 		parallel_lazy_scan_heap_end(vacrel);
@@ -4533,6 +4543,7 @@ heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
 {
 	LVRelState *vacrel = (LVRelState *) state;
 	Size		size = 0;
+	bool		leaderparticipate = true;
 
 	vacrel->leader = palloc(sizeof(ParallelLVLeader));
 
@@ -4557,6 +4568,12 @@ heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
 	vacrel->leader->scandata_len = mul_size(sizeof(LVScanData), nworkers);
 	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scandata_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("parallel-heap-vacuum-disable-leader-participation"))
+		leaderparticipate = false;
+#endif
+	vacrel->leader->leaderparticipate = leaderparticipate;
 }
 
 /*
@@ -4594,7 +4611,8 @@ heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworker
 
 	/* including the leader too */
 	shared->eager_scan_remaining_successes_per_worker =
-		vacrel->eager_scan_remaining_successes / (nworkers + 1);
+		vacrel->eager_scan_remaining_successes /
+		(vacrel->leader->leaderparticipate ? nworkers + 1 : nworkers);
 
 	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SHARED, shared);
 	vacrel->plvstate->shared = shared;
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 49e43b95132..7f0869ee4dc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -39,6 +39,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "tcop/tcopprot.h"
+#include "utils/injection_point.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
@@ -1035,14 +1036,28 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 int
 parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
 {
+	int			nworkers = pvs->nworkers_for_table;
+#ifdef USE_INJECTION_POINTS
+	static int	ntimes = 0;
+#endif
+
 	Assert(!IsParallelWorker());
 
-	if (pvs->nworkers_for_table == 0)
+	if (nworkers == 0)
 		return 0;
 
 	/* Start parallel vacuum workers for collecting dead items */
-	Assert(pvs->nworkers_for_table <= pvs->pcxt->nworkers);
-	parallel_vacuum_begin_work_phase(pvs, pvs->nworkers_for_table,
+	Assert(nworkers <= pvs->pcxt->nworkers);
+
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("parallel-vacuum-ramp-down-workers"))
+	{
+		nworkers = pvs->nworkers_for_table - Min(ntimes, pvs->nworkers_for_table);
+		ntimes++;
+	}
+#endif
+
+	parallel_vacuum_begin_work_phase(pvs, nworkers,
 									 PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
 
 	/* Include the worker count for the leader itself */
diff --git a/src/test/modules/injection_points/t/002_parallel_vacuum.pl b/src/test/modules/injection_points/t/002_parallel_vacuum.pl
new file mode 100644
index 00000000000..f0ef33ed86b
--- /dev/null
+++ b/src/test/modules/injection_points/t/002_parallel_vacuum.pl
@@ -0,0 +1,97 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for parallel heap vacuum.
+
+use strict;
+use warnings FATAL => 'all';
+use locale;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test persistency of statistics generated for injection points.
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', qq[create extension injection_points;]);
+
+$node->safe_psql('postgres', qq[
+create table t (i int) with (autovacuum_enabled = off);
+create index on t (i);
+		 ]);
+my $nrows = 1_000_000;
+my $first = int($nrows * rand());
+my $second = $nrows - $first;
+
+my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+# Begin the transaciton that holds xmin.
+$psql->query_safe('begin; select pg_current_xact_id();');
+
+# consume some xids
+$node->safe_psql('postgres', qq[
+select pg_current_xact_id();
+select pg_current_xact_id();
+select pg_current_xact_id();
+select pg_current_xact_id();
+select pg_current_xact_id();
+		 ]);
+
+# While inserting $nrows tuples into the table with an older XID,
+# we inject some tuples with a newer XID filling one page somewhere
+# in the table.
+
+# Insert the first part of rows.
+$psql->query_safe(qq[insert into t select generate_series(1, $first);]);
+
+# Insert some rows with a newer XID, which needs to fill at least
+# one page to prevent the page from begin frozen in the following
+# vacuum.
+my $xid = $node->safe_psql('postgres', qq[
+begin;
+insert into t select 0 from generate_series(1, 300);
+select pg_current_xact_id()::xid;
+commit;
+]);
+
+# Insert remaining rows and commit.
+$psql->query_safe(qq[insert into t select generate_series($first, $nrows);]);
+$psql->query_safe(qq[commit;]);
+
+# Delete some rows.
+$node->safe_psql('postgres', qq[delete from t where i between 1 and 20000;]);
+
+# Execute parallel vacuum that freezes all rows except for the
+# tuple inserted by $psql. We should update the relfrozenxid up to
+# that XID. Setting a lower value to maintenance_work_mem invokes
+# multiple rounds of heap scanning and the number of parallel workers
+# will ramp-down thanks to the injection points.
+$node->safe_psql('postgres', qq[
+set vacuum_freeze_min_age to 5;
+set max_parallel_maintenance_workers TO 5;
+set maintenance_work_mem TO 256;
+select injection_points_set_local();
+select injection_points_attach('parallel-vacuum-ramp-down-workers', 'notice');
+select injection_points_attach('parallel-heap-vacuum-disable-leader-participation', 'notice');
+vacuum (parallel 5, verbose) t;
+		 ]);
+
+is( $node->safe_psql('postgres', qq[select relfrozenxid from pg_class where relname = 't';]),
+    "$xid", "relfrozenxid is updated as expected");
+
+# Check if we have successfully frozen the table in the previous
+# vacuum by scanning all tuples.
+$node->safe_psql('postgres', qq[vacuum (freeze, parallel 0, verbose, disable_page_skipping) t;]);
+is( $node->safe_psql('postgres', qq[select $xid < relfrozenxid::text::int from pg_class where relname = 't';]),
+    "t", "all rows are frozen");
+
+$node->stop;
+done_testing();
+
-- 
2.43.5

v18-0004-Support-parallelism-for-collecting-dead-items-du.patchapplication/octet-stream; name=v18-0004-Support-parallelism-for-collecting-dead-items-du.patchDownload

From a74853630215202295d85d062974ea5f8b11e774 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Feb 2025 13:41:35 -0800
Subject: [PATCH v18 4/5] Support parallelism for collecting dead items during
 lazy vacuum.

This feature allows the vacuum to leverage multiple CPUs in order to
collect dead items (i.e. the first pass over heap table) with parallel
workers. The parallel degree for parallel heap vacuuming is determined
based on the number of blocks to vacuum unless PARALLEL option of
VACUUM command is specified, and further limited by
max_parallel_maintenance_workers.

For the parallel heap scan to collect dead items, we utilize a
parallel block table scan, controlled by ParallelBlockTableScanDesc,
in conjunction with the read stream. The workers' parallel scan
descriptions are stored in the DSM space, enabling different parallel
workers to resume the heap scan (phase 1) after a cycle of heap
vacuuming and index vacuuming (phase 2 and 3) from their previous
state. However, due to the potential presence of pinned buffers loaded
by the read stream's look-ahead mechanism, we cannot abruptly stop
phase 1 even when the space of dead_items TIDs exceeds the
limit. Therefore, once the space of dead_items TIDs exceeds the limit,
we begin processing pages without attempting to retrieve additional
blocks by look-ahead mechanism until the read stream is exhausted,
even if the the memory limit is surpassed. While this approach may
increase the memory usage, it typically doesn't pose a significant
problem, as processing a few 10s-100s buffers doesn't substantially
increase the size of dead_items TIDs.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 doc/src/sgml/ref/vacuum.sgml                  |   54 +-
 src/backend/access/heap/heapam_handler.c      |    8 +-
 src/backend/access/heap/vacuumlazy.c          | 1182 +++++++++++++++--
 src/backend/commands/vacuumparallel.c         |   29 +
 src/include/access/heapam.h                   |   13 +
 src/include/commands/vacuum.h                 |    3 +
 src/test/regress/expected/vacuum_parallel.out |    7 +
 src/test/regress/sql/vacuum_parallel.sql      |    8 +
 src/tools/pgindent/typedefs.list              |    5 +
 9 files changed, 1198 insertions(+), 111 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..294494877d9 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -280,25 +280,41 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
      <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
+      Perform scanning heap, index vacuum, and index cleanup phases of
+      <command>VACUUM</command> in parallel using
+      <replaceable class="parameter">integer</replaceable> background workers
+      (for the details of each vacuum phase, please refer to
+      <xref linkend="vacuum-phases"/>).
+     </para>
+     <para>
+      For heap tables, the number of workers used to perform the scanning
+      heap is determined based on the size of table. A table can participate in
+      parallel scanning heap if and only if the size of the table is more than
+      <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+      the heap table's blocks will be divided into ranges and shared among the
+      cooperating processes. Each worker process will complete the scanning of
+      its given range of blocks before requesting an additional range of blocks.
+     </para>
+     <para>
+      The number of workers used to perform parallel index vacuum and index
+      cleanup is equal to the number of indexes on the relation that support
+      parallel vacuum. An index can participate in parallel vacuum if and only
+      if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+      Only one worker can be used per index. So parallel workers for index vacuum
+      and index cleanup are launched only when there are at least <literal>2</literal>
+      indexes in the table.
+     </para>
+     <para>
+      Workers for vacuum are launched before the start of each phase and exit
+      at the end of the phase. The number of workers for each phase is limited by
+      the number of workers specified with <literal>PARALLEL</literal> option if
+      any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+      Please note that in any parallel vacuum phase, it is not guaanteed that the
+      number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+      will be used during execution. It is possible for a vacuum to run with fewer
+      workers than specified, or even with no workers at all. These behaviors might
+      change in a future release. This option can't be used with the <literal>FULL</literal>
+      option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cb4bc35c93e..1db6dc5b1db 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2668,7 +2668,13 @@ static const TableAmRoutine heapam_methods = {
 
 	.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
 	.scan_sample_next_block = heapam_scan_sample_next_block,
-	.scan_sample_next_tuple = heapam_scan_sample_next_tuple
+	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_initialize_worker = heap_parallel_vacuum_initialize_worker,
+	.parallel_vacuum_collect_dead_items = heap_parallel_vacuum_collect_dead_items
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d7ad9c87451..cd3784ea3e5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -99,6 +99,44 @@
  * After pruning and freezing, pages that are newly all-visible and all-frozen
  * are marked as such in the visibility map.
  *
+ * Parallel Vacuum:
+ *
+ * Lazy vacuum on heap tables supports parallel processing for phase I and
+ * phase II. Before starting phase I, we initialize parallel vacuum state,
+ * ParallelVacuumState, and allocate the TID store in a DSA area if we can
+ * use parallel mode for any of these two phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size and number of indexes. Parallel
+ * workers are launched at the beginning of each phase and exit at the end of
+ * each phase.
+ *
+ * While vacuum cutoffs are shared between leader and worker processes, each
+ * individual process uses its own GlobalVisState, potentially causing some
+ * workers to remove fewer tuples than optimal. During parallel lazy heap scans,
+ * each worker tracks the oldest existing XID and MXID. The leader computes the
+ * globally oldest existing XID and MXID after the parallel scan, while
+ * gathering table data too.
+ *
+ * The parallel lazy heap scan (i.e. parallel phase I) is controlled by
+ * ParallelLVScanDesc in conjunction with the read stream. The table is split
+ * into multiple chunks, which are then distributed among parallel workers.
+ * Due to the potential presence of pinned buffers loaded by the read stream's
+ * look-ahead mechanism, we cannot abruptly stop phase I even hen the space
+ * of dead_items TIDs exceeds the limit. Instead, once this threshold is
+ * surpassed, we begin processing pages without attempting to retrieve additional
+ * blocks until the read stream is exhausted. While this approach may increase
+ * the memory usage, it typically doesn't pose a significant problem, as
+ * processing a few 10s-100s buffers doesn't substantially increase the size
+ * of dead_items TIDs. The workers' parallel scan descriptions,
+ * ParallelLVScanWorkerData, are stored in the DSM space, enabling different
+ * parallel workers to resume phase I from their previous state.
+ *
+ * If the leader launches fewer workers than the previous time to resume the
+ * parallel lazy heap scan, some block within chunks may remain un-scanned.
+ * To address this, the leader completes workers' unfinished scans at the end
+ * of the parallel lazy heap scan (see complete_unfinished_lazy_scan_heap()).
+ *
  * Dead TID Storage:
  *
  * The major space usage for vacuuming is storage for the dead tuple IDs that
@@ -147,6 +185,7 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"	/* for min_parallel_table_scan_size */
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -214,11 +253,22 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * DSM keys for parallel lazy vacuum. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define PARALLEL_LV_KEY_SHARED				0xFFFF0001
+#define PARALLEL_LV_KEY_SCANDESC			0xFFFF0002
+#define PARALLEL_LV_KEY_SCANWORKER			0xFFFF0003
+#define PARALLEL_LV_KEY_SCANDATA			0xFFFF0004
+
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
  * parallel mode and the DSM segment is initialized.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->plvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -249,6 +299,12 @@ typedef enum
  */
 #define EAGER_SCAN_REGION_SIZE 4096
 
+/*
+ * During parallel lazy scans, each worker (including the leader) retrieves
+ * a chunk consisting of PARALLEL_LV_CHUNK_SIZE blocks.
+ */
+#define PARALLEL_LV_CHUNK_SIZE	1024
+
 /*
  * heap_vac_scan_next_block() sets these flags to communicate information
  * about the block it read to the caller.
@@ -304,6 +360,121 @@ typedef struct LVScanData
 	bool		skippedallvis;
 } LVScanData;
 
+/*
+ * Struct for information that needs to be shared among parallel workers
+ * for parallel lazy vacuum. All fields are static, set by the leader
+ * process.
+ */
+typedef struct ParallelLVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+
+	/*
+	 * The first chunk size varies depending on the first eager scan region
+	 * size. If eager scan is disabled, we use the default chunk size
+	 * PARALLEL_LV_CHUNK_SIZE for the first chunk.
+	 */
+	BlockNumber initial_chunk_size;
+
+	/*
+	 * Similar to LVRelState.eager_scan_max_fails_per_region but this is a
+	 * per-chunk failure counter.
+	 */
+	BlockNumber eager_scan_max_fails_per_chunk;
+
+	/*
+	 * Similar to LVRelState.eager_scan_remaining_successes but this is a
+	 * success counter per parallel worker.
+	 */
+	BlockNumber eager_scan_remaining_successes_per_worker;
+} ParallelLVShared;
+
+/*
+ * Shared scan description for parallel lazy scan.
+ */
+typedef struct ParallelLVScanDesc
+{
+	/* Number of blocks of the table at start of scan */
+	BlockNumber nblocks;
+
+	/* Number of blocks in to allocate in each I/O chunk */
+	BlockNumber chunk_size;
+
+	/* Number of blocks allocated to workers so far */
+	pg_atomic_uint64 nallocated;
+} ParallelLVScanDesc;
+
+/*
+ * Per-worker data for scan description, statistics counters, and
+ * miscellaneous data need to be shared with the leader.
+ */
+typedef struct ParallelLVScanWorkerData
+{
+	bool		inited;
+
+	/* Current number of blocks into the scan */
+	BlockNumber nallocated;
+
+	/* Number of blocks per chunk */
+	BlockNumber chunk_size;
+
+	/* Number of blocks left in this chunk */
+	uint32		chunk_remaining;
+
+	/* The last processed block number */
+	pg_atomic_uint32 last_blkno;
+
+	/* Eager scan state for resuming the scan */
+	BlockNumber remaining_fails_save;
+	BlockNumber remaining_successes_save;
+	BlockNumber next_region_start_save;
+} ParallelLVScanWorkerData;
+
+/*
+ * Struct to store parallel lazy vacuum working state.
+ */
+typedef struct ParallelLVState
+{
+	/* Shared static information */
+	ParallelLVShared *shared;
+
+	/* Parallel scan description shared among parallel workers */
+	ParallelLVScanDesc *scandesc;
+
+	/* Per-worker scan data */
+	ParallelLVScanWorkerData *scanwork;
+} ParallelLVState;
+
+/*
+ * Struct for the leader process in parallel lazy vacuum.
+ */
+typedef struct ParallelLVLeader
+{
+	/* Shared memory size for each shared object */
+	Size		shared_len;
+	Size		scandesc_len;
+	Size		scanwork_len;
+	Size		scandata_len;
+
+	/* The number of workers launched for parallel lazy heap scan */
+	int			nworkers_launched;
+
+	/*
+	 * These fields point to the arrays of all per-worker scan states stored
+	 * in DSM.
+	 */
+	ParallelLVScanWorkerData *scanwork_array;
+	LVScanData *scandata_array;
+} ParallelLVLeader;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -368,6 +539,12 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 
+	/* Last processed block number */
+	BlockNumber last_blkno;
+
+	/* Next block to check for FSM vacuum */
+	BlockNumber next_fsm_block_to_vacuum;
+
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
 	BlockNumber next_unskippable_block; /* next unskippable block */
@@ -375,6 +552,16 @@ typedef struct LVRelState
 	bool		next_unskippable_eager_scanned; /* if it was eagerly scanned */
 	Buffer		next_unskippable_vmbuffer;	/* buffer containing its VM bit */
 
+	/* Fields used for parallel lazy vacuum */
+
+	/* Parallel lazy vacuum working state */
+	ParallelLVState *plvstate;
+
+	/*
+	 * The leader state for parallel lazy vacuum. NULL for parallel workers.
+	 */
+	ParallelLVLeader *leader;
+
 	/* State related to managing eager scanning of all-visible pages */
 
 	/*
@@ -434,12 +621,19 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void do_lazy_scan_heap(LVRelState *vacrel, bool check_mem_usage);
 static void heap_vacuum_eager_scan_setup(LVRelState *vacrel,
 										 const VacuumParams params);
 static BlockNumber heap_vac_scan_next_block(ReadStream *stream,
 											void *callback_private_data,
 											void *per_buffer_data);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static void parallel_lazy_scan_init_scan_worker(ParallelLVScanWorkerData *scanwork,
+												BlockNumber initial_chunk_size);
+static BlockNumber parallel_lazy_scan_get_nextpage(LVRelState *vacrel, Relation rel,
+												   ParallelLVScanDesc *scandesc,
+												   ParallelLVScanWorkerData *scanwork);
+static bool find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -450,6 +644,12 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
 							  bool *has_lpdead_items);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static BlockNumber parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel);
+static void complete_unfinished_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_begin(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
+static void parallel_lazy_scan_gather_results(LVRelState *vacrel);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -474,6 +674,7 @@ static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
 static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 						   int num_offsets);
+static bool dead_items_check_memory_limit(LVRelState *vacrel);
 static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
@@ -770,6 +971,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
+	vacrel->next_fsm_block_to_vacuum = 0;
 
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
@@ -1215,13 +1417,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno = 0,
-				next_fsm_block_to_vacuum = 0;
-	BlockNumber orig_eager_scan_success_limit =
-		vacrel->eager_scan_remaining_successes; /* for logging */
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1242,6 +1438,80 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/* Do the actual work */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel, true);
+
+	/*
+	 * Report that everything is now scanned. We never skip scanning the last
+	 * block in the relation, so we can pass rel_pages here.
+	 */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
+								 rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * We can pass rel_pages here because we never skip scanning the last
+	 * block of the relation.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * If check_mem_usage is true, we check the memory usage during the heap scan.
+ * If the space of dead_items TIDs exceeds the limit, we stop the lazy heap scan
+ * and invoke a cycle of index vacuuming and heap vacuuming, and then resume the
+ * scan. If it's false, we continue doing lazy heap scan until the read stream
+ * is exhausted.
+ */
+static void
+do_lazy_scan_heap(LVRelState *vacrel, bool check_mem_usage)
+{
+	ReadStream *stream;
+	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber orig_eager_scan_success_limit =
+		vacrel->eager_scan_remaining_successes; /* for logging */
+	Buffer		vmbuffer = InvalidBuffer;
+
+	/*
+	 * We should not set check_mem_usage to false unless during parallel heap
+	 * vacuum.
+	 */
+	Assert(check_mem_usage || ParallelHeapVacuumIsActive(vacrel));
+
 	/*
 	 * Set up the read stream for vacuum's first pass through the heap.
 	 *
@@ -1276,8 +1546,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * that point.  This check also provides failsafe coverage for the
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
+		 *
+		 * The failsafe check is done only by the leader process.
 		 */
-		if (vacrel->scan_data->scanned_pages > 0 &&
+		if (!IsParallelWorker() &&
+			vacrel->scan_data->scanned_pages > 0 &&
 			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
@@ -1285,12 +1558,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Consider if we definitely have enough space to process TIDs on page
 		 * already.  If we are close to overrunning the available space for
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page. However, let's force at least one page-worth of tuples
-		 * to be stored as to ensure we do at least some work when the memory
-		 * configured is so low that we run out before storing anything.
+		 * this page.
 		 */
-		if (vacrel->dead_items_info->num_items > 0 &&
-			TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
+		if (check_mem_usage && dead_items_check_memory_limit(vacrel))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1313,15 +1583,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * upper-level FSM pages. Note that blkno is the previously
 			 * processed block.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 									blkno + 1);
-			next_fsm_block_to_vacuum = blkno;
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 		}
 
+		/* Read the next block to process */
 		buf = read_stream_next_buffer(stream, &per_buffer_data);
 
 		/* The relation is exhausted. */
@@ -1331,7 +1602,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		blk_info = *((uint8 *) per_buffer_data);
 		CheckBufferIsPinnedOnce(buf);
 		page = BufferGetPage(buf);
-		blkno = BufferGetBlockNumber(buf);
+		blkno = vacrel->last_blkno = BufferGetBlockNumber(buf);
 
 		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
@@ -1496,13 +1767,34 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * visible on upper FSM pages. This is done after vacuuming if the
 			 * table has indexes. There will only be newly-freed space if we
 			 * held the cleanup lock and lazy_scan_prune() was called.
+			 *
+			 * During parallel lazy heap scanning, only the leader process
+			 * vacuums the FSM. However, we cannot vacuum the FSM for blocks
+			 * up to 'blk' because there may be un-scanned blocks or blocks
+			 * being processed by workers before this point. Instead, parallel
+			 * workers advertise the block numbers they have just processed,
+			 * and the leader vacuums the FSM up to the smallest block number
+			 * among them. This approach ensures we vacuum the FSM for
+			 * consecutive processed blocks.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				if (IsParallelWorker())
+					pg_atomic_write_u32(&(vacrel->plvstate->scanwork->last_blkno),
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				else
+				{
+					BlockNumber fsmvac_upto = blkno;
+
+					if (ParallelHeapVacuumIsActive(vacrel))
+						fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+					FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+											fsmvac_upto);
+				}
+
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
@@ -1513,50 +1805,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/*
-	 * Report that everything is now scanned. We never skip scanning the last
-	 * block in the relation, so we can pass rel_pages here.
-	 */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
-								 rel_pages);
-
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_data->scanned_pages,
-													 vacrel->scan_data->live_tuples);
-
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
-		vacrel->scan_data->missed_dead_tuples;
-
 	read_stream_end(stream);
-
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
-
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 * We can pass rel_pages here because we never skip scanning the last
-	 * block of the relation.
-	 */
-	if (rel_pages > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, rel_pages);
-
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
-
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
 }
 
 /*
@@ -1570,7 +1819,8 @@ lazy_scan_heap(LVRelState *vacrel)
  * heap_vac_scan_next_block() uses the visibility map, vacuum options, and
  * various thresholds to skip blocks which do not need to be processed and
  * returns the next block to process or InvalidBlockNumber if there are no
- * remaining blocks.
+ * remaining blocks or the space of dead_items TIDs reaches the limit (only
+ * in parallel lazy vacuum cases).
  *
  * The visibility status of the next block to process and whether or not it
  * was eager scanned is set in the per_buffer_data.
@@ -1592,8 +1842,42 @@ heap_vac_scan_next_block(ReadStream *stream,
 	LVRelState *vacrel = callback_private_data;
 	uint8		blk_info = 0;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+retry:
+	next_block = InvalidBlockNumber;
+
+	/* Get the next block to process */
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		/*
+		 * Stop returning the next block to the read stream if we are close to
+		 * overrunning the available space for dead_items TIDs so that the
+		 * read stream returns pinned buffers in its buffers queue until the
+		 * stream is exhausted. See the comments atop this file for details.
+		 */
+		if (dead_items_check_memory_limit(vacrel))
+		{
+			if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
+			{
+				ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
+				vacrel->next_unskippable_vmbuffer = InvalidBuffer;
+			}
+
+			return InvalidBlockNumber;
+
+		}
+
+		next_block = parallel_lazy_scan_get_nextpage(vacrel,
+													 vacrel->rel,
+													 vacrel->plvstate->scandesc,
+													 vacrel->plvstate->scanwork);
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
+	}
+
+	Assert(BlockNumberIsValid(next_block));
 
 	/* Have we reached the end of the relation? */
 	if (next_block >= vacrel->rel_pages)
@@ -1618,8 +1902,41 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * visibility map.
 		 */
 		bool		skipsallvis;
+		bool		found;
+		BlockNumber end_block;
+		BlockNumber nblocks_skip;
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		if (ParallelHeapVacuumIsActive(vacrel))
+		{
+			/* We look for the next unskippable block within the chunk */
+			end_block = next_block + vacrel->plvstate->scanwork->chunk_remaining + 1;
+		}
+		else
+			end_block = vacrel->rel_pages;
+
+		found = find_next_unskippable_block(vacrel, &skipsallvis, next_block, end_block);
+
+		/*
+		 * We must have found the next unskippable block within the specified
+		 * range in non-parallel cases as the end_block is always the last
+		 * block + 1 and we must scan the last block.
+		 */
+		Assert(found || ParallelHeapVacuumIsActive(vacrel));
+
+		if (!found)
+		{
+			if (skipsallvis)
+				vacrel->scan_data->skippedallvis = true;
+
+			/*
+			 * Skip all remaining blocks in the current chunk, and retry with
+			 * the next chunk.
+			 */
+			vacrel->plvstate->scanwork->chunk_remaining = 0;
+			goto retry;
+		}
+
+		Assert(vacrel->next_unskippable_block < end_block);
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1636,11 +1953,21 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
 				vacrel->scan_data->skippedallvis = true;
+
+			/* Tell the parallel scans to skip blocks */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				vacrel->plvstate->scanwork->chunk_remaining -= nblocks_skip;
+				vacrel->plvstate->scanwork->nallocated += nblocks_skip;
+				Assert(vacrel->plvstate->scanwork->chunk_remaining > 0);
+			}
+
+			next_block = vacrel->next_unskippable_block;
 		}
 	}
 
@@ -1675,10 +2002,86 @@ heap_vac_scan_next_block(ReadStream *stream,
 	}
 }
 
+
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
- * The next unskippable block and its visibility information is updated in
- * vacrel.
+ * Initialize scan state of the given ParallelLVScanWorkerData.
+ */
+static void
+parallel_lazy_scan_init_scan_worker(ParallelLVScanWorkerData *scanwork,
+									BlockNumber initial_chunk_size)
+{
+	Assert(BlockNumberIsValid(initial_chunk_size));
+
+	scanwork->inited = true;
+	scanwork->nallocated = 0;
+	scanwork->chunk_size = initial_chunk_size;
+	scanwork->chunk_remaining = 0;
+	pg_atomic_init_u32(&(scanwork->last_blkno), InvalidBlockNumber);
+}
+
+/*
+ * Return the next page to process for parallel lazy scan.
+ *
+ * If there is no block to scan for the worker, return the number of blocks in
+ * the relation.
+ */
+static BlockNumber
+parallel_lazy_scan_get_nextpage(LVRelState *vacrel, Relation rel,
+								ParallelLVScanDesc *scandesc,
+								ParallelLVScanWorkerData *scanwork)
+{
+	uint64		nallocated;
+
+	if (scanwork->chunk_remaining > 0)
+	{
+		/*
+		 * Give them the next block in the range and update the remaining
+		 * number of blocks.
+		 */
+		nallocated = ++scanwork->nallocated;
+		scanwork->chunk_remaining--;
+	}
+	else
+	{
+		/* Get the new chunk */
+		nallocated = scanwork->nallocated =
+			pg_atomic_fetch_add_u64(&scandesc->nallocated, scanwork->chunk_size);
+
+		/*
+		 * Set the remaining number of blocks in this chunk so that subsequent
+		 * calls from this worker continue on with this chunk until it's done.
+		 */
+		scanwork->chunk_remaining = scanwork->chunk_size - 1;
+
+		/* We use the fixed size chunk for subsequent scans */
+		scanwork->chunk_size = PARALLEL_LV_CHUNK_SIZE;
+
+		/*
+		 * Getting the new chunk also means to start the new eager scan
+		 * region.
+		 *
+		 * Update next_eager_scan_region_start to the first block in the chunk
+		 * so that we can reset the remaining_fails counter when checking the
+		 * visibility of the first block in this chunk in
+		 * find_next_unskippable_block().
+		 */
+		vacrel->next_eager_scan_region_start = nallocated;
+
+	}
+
+	/* Clear the chunk_remaining if there is no more blocks to process */
+	if (nallocated >= scandesc->nblocks)
+		scanwork->chunk_remaining = 0;
+
+	return Min(nallocated, scandesc->nblocks);
+}
+
+/*
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in a range of 'start' (inclusive) and 'end' (exclusive).
+ *
+ * If found, the next unskippable block and its visibility information is updated
+ * in vacrel. Otherwise, return false and reset the information in vacrel.
  *
  * Note: our opinion of which blocks can be skipped can go stale immediately.
  * It's okay if caller "misses" a page whose all-visible or all-frozen marking
@@ -1688,22 +2091,32 @@ heap_vac_scan_next_block(ReadStream *stream,
  * older XIDs/MXIDs.  The *skippedallvis flag will be set here when the choice
  * to skip such a range is actually made, making everything safe.)
  */
-static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+static bool
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start, BlockNumber end)
 {
 	BlockNumber rel_pages = vacrel->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block = start;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
 	bool		next_unskippable_allvis;
+	bool		found = true;
 
 	*skipsallvis = false;
 
 	for (;; next_unskippable_block++)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   &next_unskippable_vmbuffer);
+		uint8		mapbits;
+
+		/* Reach the end of range? */
+		if (next_unskippable_block >= end)
+		{
+			found = false;
+			break;
+		}
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_unskippable_block,
+										   &next_unskippable_vmbuffer);
 
 		next_unskippable_allvis = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
 
@@ -1779,11 +2192,285 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 		*skipsallvis = true;
 	}
 
-	/* write the local variables back to vacrel */
-	vacrel->next_unskippable_block = next_unskippable_block;
-	vacrel->next_unskippable_allvis = next_unskippable_allvis;
-	vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
-	vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	if (found)
+	{
+		/* write the local variables back to vacrel */
+		vacrel->next_unskippable_block = next_unskippable_block;
+		vacrel->next_unskippable_allvis = next_unskippable_allvis;
+		vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
+		vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	}
+	else
+	{
+		if (BufferIsValid(next_unskippable_vmbuffer))
+			ReleaseBuffer(next_unskippable_vmbuffer);
+
+		/*
+		 * There is not unskippable block in the specified range. Reset the
+		 * related fields in vacrel.
+		 */
+		vacrel->next_unskippable_block = InvalidBlockNumber;
+		vacrel->next_unskippable_allvis = InvalidBlockNumber;
+		vacrel->next_unskippable_eager_scanned = false;
+		vacrel->next_unskippable_vmbuffer = InvalidBuffer;
+	}
+
+	return found;
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches
+ * parallel workers to scan the heap in parallel.
+*/
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	ParallelLVScanWorkerData scanwork;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Setup the parallel scan description for the leader to join as a worker */
+	parallel_lazy_scan_init_scan_worker(&scanwork,
+										vacrel->plvstate->shared->initial_chunk_size);
+	vacrel->plvstate->scanwork = &scanwork;
+
+	/* Adjust the eager scan's success counter as a worker */
+	vacrel->eager_scan_remaining_successes =
+		vacrel->plvstate->shared->eager_scan_remaining_successes_per_worker;
+
+	for (;;)
+	{
+		BlockNumber fsmvac_upto;
+
+		/* Launch parallel workers */
+		parallel_lazy_scan_heap_begin(vacrel);
+
+		/*
+		 * Do lazy heap scan until the read stream is exhausted. We will stop
+		 * retrieving new blocks for the read stream once the space of
+		 * dead_items TIDs exceeds the limit.
+		 */
+		do_lazy_scan_heap(vacrel, false);
+
+		/* Wait for parallel workers to finish and gather scan results */
+		parallel_lazy_scan_heap_end(vacrel);
+
+		if (!dead_items_check_memory_limit(vacrel))
+			break;
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		/* Compute the smallest processed block number */
+		fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (fsmvac_upto > vacrel->next_fsm_block_to_vacuum)
+		{
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									fsmvac_upto);
+			vacrel->next_fsm_block_to_vacuum = fsmvac_upto;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process launched fewer workers in the next cycle.
+	 */
+	complete_unfinished_lazy_scan_heap(vacrel);
+}
+
+/*
+ * Return the smallest block number that the leader and workers have scanned.
+ */
+static BlockNumber
+parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel)
+{
+	BlockNumber min_blk;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/* Initialized with the leader's value */
+	min_blk = vacrel->last_blkno;
+
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		ParallelLVScanWorkerData *scanwork = &(vacrel->leader->scanwork_array[i]);
+		BlockNumber blkno;
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!scanwork->inited)
+			continue;
+
+		blkno = pg_atomic_read_u32(&(scanwork->last_blkno));
+
+		if (!BlockNumberIsValid(min_blk) || min_blk > blkno)
+			min_blk = blkno;
+	}
+
+	Assert(BlockNumberIsValid(min_blk));
+
+	return min_blk;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+complete_unfinished_lazy_scan_heap(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		ParallelLVScanWorkerData *scanwork = &(vacrel->leader->scanwork_array[i]);
+
+		if (!scanwork->inited)
+			continue;
+
+		if (scanwork->chunk_remaining == 0)
+			continue;
+
+		/* Attach the worker's scan state */
+		vacrel->plvstate->scanwork = scanwork;
+
+		vacrel->next_fsm_block_to_vacuum = pg_atomic_read_u32(&(scanwork->last_blkno));
+		vacrel->next_eager_scan_region_start = scanwork->next_region_start_save;
+		vacrel->eager_scan_remaining_fails = scanwork->remaining_fails_save;
+
+		/*
+		 * Complete the unfinished scan. Note that we might perform multiple
+		 * cycles of index and heap vacuuming while completing the scan.
+		 */
+		do_lazy_scan_heap(vacrel, true);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_begin(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->leader->nworkers_launched = parallel_vacuum_collect_dead_items_begin(vacrel->pvs);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for collecting dead tuples (planned: %d)",
+							 "launched %d parallel vacuum workers for collecting dead tuples (planned: %d)",
+							 vacrel->leader->nworkers_launched),
+					vacrel->leader->nworkers_launched,
+					parallel_vacuum_get_nworkers_table(vacrel->pvs))));
+}
+
+/*
+ * Helper routine to finish the parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_collect_dead_items_end(vacrel->pvs);
+
+	/* Gather the workers' scan results */
+	parallel_lazy_scan_gather_results(vacrel);
+}
+
+/*
+ * Accumulate each worker's scan results into the leader's.
+*/
+static void
+parallel_lazy_scan_gather_results(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		LVScanData *data = &(vacrel->leader->scandata_array[i]);
+		ParallelLVScanWorkerData *scanwork = &(vacrel->leader->scanwork_array[i]);
+
+		/* Accumulate the counters collected by workers */
+#define ACCUM_COUNT(item) vacrel->scan_data->item += data->item
+		ACCUM_COUNT(scanned_pages);
+		ACCUM_COUNT(removed_pages);
+		ACCUM_COUNT(new_frozen_tuple_pages);
+		ACCUM_COUNT(vm_new_visible_pages);
+		ACCUM_COUNT(vm_new_visible_frozen_pages);
+		ACCUM_COUNT(vm_new_frozen_pages);
+		ACCUM_COUNT(lpdead_item_pages);
+		ACCUM_COUNT(missed_dead_pages);
+		ACCUM_COUNT(tuples_deleted);
+		ACCUM_COUNT(tuples_frozen);
+		ACCUM_COUNT(lpdead_items);
+		ACCUM_COUNT(live_tuples);
+		ACCUM_COUNT(recently_dead_tuples);
+		ACCUM_COUNT(missed_dead_tuples);
+#undef ACCUM_COUNT
+
+		/*
+		 * Track the greatest non-empty page among values the workers
+		 * collected as it's used to cut-off point of heap truncation.
+		 */
+		if (vacrel->scan_data->nonempty_pages < data->nonempty_pages)
+			vacrel->scan_data->nonempty_pages = data->nonempty_pages;
+
+		/*
+		 * All workers must have initialized both values with the values
+		 * passed by the leader.
+		 */
+		Assert(TransactionIdIsValid(data->NewRelfrozenXid));
+		Assert(MultiXactIdIsValid(data->NewRelminMxid));
+
+		/*
+		 * During parallel lazy scanning, since different workers process
+		 * separate blocks, they may observe different existing XIDs and
+		 * MXIDs. Therefore, we compute the oldest XID and MXID from the
+		 * values observed by each worker (including the leader). These
+		 * computations are crucial for correctly advancing both relfrozenxid
+		 * and relmminmxid values.
+		 */
+
+		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
+			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
+
+		/* Has any one of workers skipped all-visible page? */
+		vacrel->scan_data->skippedallvis |= data->skippedallvis;
+
+		/*
+		 * Gather the remaining success count so that we can distribute the
+		 * success counter again in the next parallel lazy scan.
+		 */
+		vacrel->eager_scan_remaining_successes += scanwork->remaining_successes_save;
+	}
 }
 
 /*
@@ -2062,7 +2749,8 @@ lazy_scan_prune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2435,7 +3123,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -3488,12 +4177,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -3511,11 +4196,17 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize the parallel vacuum state for either lazy heap
+			 * scan, index vacuuming, or both.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3555,15 +4246,35 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
 	pgstat_progress_update_multi_param(2, prog_index, prog_val);
 }
 
+/*
+ * Check the memory usage of the collected dead items and return true
+ * if we are close to overrunning the available space for dead_items TIDs.
+ * However, let's force at least one page-worth of tuples to be stored as
+ * to ensure we do at least some work when the memory configured is so low
+ * that we run out before storing anything.
+ */
+static bool
+dead_items_check_memory_limit(LVRelState *vacrel)
+{
+	return vacrel->dead_items_info->num_items > 0 &&
+		TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes;
+}
+
 /*
  * Forget all collected dead items.
  */
@@ -3757,6 +4468,295 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									 void *state)
+{
+	BlockNumber relpages = RelationGetNumberOfBlocks(rel);
+	int			parallel_workers = 0;
+
+	/*
+	 * Parallel heap vacuuming a small relation shouldn't take long. We use
+	 * two times the chunk size as the size cutoff because the leader is
+	 * assigned to one chunk.
+	 */
+	if (relpages < PARALLEL_LV_CHUNK_SIZE * 2 || relpages < min_parallel_table_scan_size)
+		return 0;
+
+	if (nworkers_requested == 0)
+	{
+		LVRelState *vacrel = (LVRelState *) state;
+		int			heap_parallel_threshold;
+		int			heap_pages;
+		BlockNumber allvisible;
+		BlockNumber allfrozen;
+
+		/*
+		 * Estimate the number of blocks that we're going to scan during
+		 * lazy_scan_heap().
+		 */
+		visibilitymap_count(rel, &allvisible, &allfrozen);
+		heap_pages = relpages - (vacrel->aggressive ? allfrozen : allvisible);
+
+		Assert(heap_pages >= 0);
+
+		/*
+		 * Select the number of workers based on the log of the number of
+		 * pages to scan. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = PARALLEL_LV_CHUNK_SIZE;
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nworkers_requested;
+
+	return parallel_workers;
+}
+
+/*
+ * Estimate shared memory size required for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							  void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	Size		size = 0;
+
+	vacrel->leader = palloc(sizeof(ParallelLVLeader));
+
+	/* Estimate space for ParallelLVShared */
+	size = add_size(size, sizeof(ParallelLVShared));
+	vacrel->leader->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for ParallelLVScanDesc */
+	vacrel->leader->scandesc_len = sizeof(ParallelLVScanDesc);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scandesc_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of ParallelLVScanWorkerData */
+	vacrel->leader->scanwork_len = mul_size(sizeof(ParallelLVScanWorkerData),
+											nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scanwork_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of LVScanData */
+	vacrel->leader->scandata_len = mul_size(sizeof(LVScanData), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scandata_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ParallelLVShared *shared;
+	ParallelLVScanDesc *scandesc;
+	ParallelLVScanWorkerData *scanwork;
+	LVScanData *scandata;
+
+	vacrel->plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Initialize ParallelLVShared */
+
+	shared = shm_toc_allocate(pcxt->toc, vacrel->leader->shared_len);
+	MemSet(shared, 0, vacrel->leader->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_data->NewRelminMxid;
+	shared->initial_chunk_size = BlockNumberIsValid(vacrel->next_eager_scan_region_start)
+		? vacrel->next_eager_scan_region_start
+		: PARALLEL_LV_CHUNK_SIZE;
+
+	/* Calculate the per-chunk maximum failure count */
+	shared->eager_scan_max_fails_per_chunk =
+		(BlockNumber) (vacrel->eager_scan_max_fails_per_region *
+					   ((float) PARALLEL_LV_CHUNK_SIZE / EAGER_SCAN_REGION_SIZE));
+
+	/* including the leader too */
+	shared->eager_scan_remaining_successes_per_worker =
+		vacrel->eager_scan_remaining_successes / (nworkers + 1);
+
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SHARED, shared);
+	vacrel->plvstate->shared = shared;
+
+	/* Initialize ParallelLVScanDesc */
+	scandesc = shm_toc_allocate(pcxt->toc, vacrel->leader->scandesc_len);
+	scandesc->nblocks = RelationGetNumberOfBlocks(rel);
+	pg_atomic_init_u64(&scandesc->nallocated, 0);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANDESC, scandesc);
+	vacrel->plvstate->scandesc = scandesc;
+
+	/* Initialize the array of ParallelLVScanWorkerData */
+	scanwork = shm_toc_allocate(pcxt->toc, vacrel->leader->scanwork_len);
+	MemSet(scanwork, 0, vacrel->leader->scanwork_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANWORKER, scanwork);
+	vacrel->leader->scanwork_array = scanwork;
+
+	/* Initialize the array of LVScanData */
+	scandata = shm_toc_allocate(pcxt->toc, vacrel->leader->scandata_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANDATA, scandata);
+	vacrel->leader->scandata_array = scandata;
+}
+
+/*
+ * Initialize lazy vacuum state with the information retrieved from
+ * shared memory.
+ */
+void
+heap_parallel_vacuum_initialize_worker(Relation rel, ParallelVacuumState *pvs,
+									   ParallelWorkerContext *pwcxt,
+									   void **state_out)
+{
+	LVRelState *vacrel;
+	ParallelLVState *plvstate;
+	ParallelLVShared *shared;
+	ParallelLVScanDesc *scandesc;
+	ParallelLVScanWorkerData *scanwork_array;
+	LVScanData *scandata_array;
+
+	/* Initialize ParallelLVState and prepare the related objects */
+
+	plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Prepare ParallelLVShared */
+	shared = (ParallelLVShared *) shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SHARED, false);
+	plvstate->shared = shared;
+
+	/* Prepare ParallelLVScanDesc */
+	scandesc = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANDESC, false);
+	plvstate->scandesc = scandesc;
+
+	/* Prepare ParallelLVScanWorkerData */
+	scanwork_array = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANWORKER, false);
+	plvstate->scanwork = &(scanwork_array[ParallelWorkerNumber]);
+
+	/* Initialize LVRelState and prepare fields required by lazy scan heap */
+	vacrel = palloc0(sizeof(LVRelState));
+	vacrel->rel = rel;
+	vacrel->indrels = parallel_vacuum_get_table_indexes(pvs,
+														&vacrel->nindexes);
+	vacrel->bstrategy = parallel_vacuum_get_bstrategy(pvs);
+	vacrel->pvs = pvs;
+	vacrel->aggressive = shared->aggressive;
+	vacrel->skipwithvm = shared->skipwithvm;
+	vacrel->vistest = GlobalVisTestFor(rel);
+	vacrel->cutoffs = shared->cutoffs;
+	vacrel->dead_items = parallel_vacuum_get_dead_items(pvs,
+														&vacrel->dead_items_info);
+	vacrel->rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Set the per-region failure counter and per-worker success counter,
+	 * which are not changed during parallel heap vacuum.
+	 */
+	vacrel->eager_scan_max_fails_per_region =
+		plvstate->shared->eager_scan_max_fails_per_chunk;
+	vacrel->eager_scan_remaining_successes =
+		plvstate->shared->eager_scan_remaining_successes_per_worker;
+
+	/* Does this worker have un-scanned blocks in a chunk? */
+	if (plvstate->scanwork->chunk_remaining > 0)
+	{
+		/*
+		 * We restore the previous eager scan state of the already allocated
+		 * chunk, if the worker's previous scan suspended due to the full of
+		 * dead_items TIDs space.
+		 */
+		vacrel->next_eager_scan_region_start = plvstate->scanwork->next_region_start_save;
+		vacrel->eager_scan_remaining_fails = plvstate->scanwork->remaining_fails_save;
+	}
+	else
+	{
+		/*
+		 * next_eager_scan_region_start will be set when the first chunk is
+		 * assigned
+		 */
+		vacrel->next_eager_scan_region_start = InvalidBlockNumber;
+		vacrel->eager_scan_remaining_fails = vacrel->eager_scan_max_fails_per_region;
+	}
+
+	vacrel->plvstate = plvstate;
+
+	/* Prepare LVScanData */
+	scandata_array = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANDATA, false);
+	vacrel->scan_data = &(scandata_array[ParallelWorkerNumber]);
+	MemSet(vacrel->scan_data, 0, sizeof(LVScanData));
+	vacrel->scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = shared->NewRelminMxid;
+	vacrel->scan_data->skippedallvis = false;
+
+	/*
+	 * Initialize the scan state if not yet. The chunk of blocks will be
+	 * allocated when to get the scan block for the first time.
+	 */
+	if (!vacrel->plvstate->scanwork->inited)
+		parallel_lazy_scan_init_scan_worker(vacrel->plvstate->scanwork,
+											vacrel->plvstate->shared->initial_chunk_size);
+
+	*state_out = (void *) vacrel;
+}
+
+/*
+ * Parallel heap vacuum callback for collecting dead items (i.e., lazy heap scan).
+ */
+void
+heap_parallel_vacuum_collect_dead_items(Relation rel, ParallelVacuumState *pvs,
+										void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ErrorContextCallback errcallback;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel->dbname = get_database_name(MyDatabaseId);
+	vacrel->relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Join the parallel heap vacuum */
+	do_lazy_scan_heap(vacrel, false);
+
+	/* Advertise the last processed block number */
+	pg_atomic_write_u32(&(vacrel->plvstate->scanwork->last_blkno), vacrel->last_blkno);
+
+	/* Save the eager scan state */
+	vacrel->plvstate->scanwork->remaining_fails_save = vacrel->eager_scan_remaining_fails;
+	vacrel->plvstate->scanwork->remaining_successes_save = vacrel->eager_scan_remaining_successes;
+	vacrel->plvstate->scanwork->next_region_start_save = vacrel->next_eager_scan_region_start;
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 3726fb41028..49e43b95132 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -504,6 +504,35 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
+/*
+ * Return the number of parallel workers initialized for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/*
+ * Return the buffer strategy for parallel vacuum.
+ */
+BufferAccessStrategy
+parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs)
+{
+	return pvs->bstrategy;
+}
+
 /*
  * Returns the dead items space and dead items information.
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a2bd5a897f8..8811665375c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -15,6 +15,7 @@
 #define HEAPAM_H
 
 #include "access/heapam_xlog.h"
+#include "access/parallel.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
@@ -397,8 +398,20 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 									  OffsetNumber *unused, int nunused);
 
 /* in heap/vacuumlazy.c */
+struct ParallelVacuumState;
 extern void heap_vacuum_rel(Relation rel,
 							const VacuumParams params, BufferAccessStrategy bstrategy);
+extern int heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+												void *state);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+										  void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+												   ParallelWorkerContext *pwcxt,
+												   void **state_out);
+extern void heap_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+													void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1369377ea98..876cef301b7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -385,6 +385,9 @@ extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels
 												 BufferAccessStrategy bstrategy,
 												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
+extern BufferAccessStrategy parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
diff --git a/src/test/regress/expected/vacuum_parallel.out b/src/test/regress/expected/vacuum_parallel.out
index ddf0ee544b7..b793d8093c2 100644
--- a/src/test/regress/expected/vacuum_parallel.out
+++ b/src/test/regress/expected/vacuum_parallel.out
@@ -1,5 +1,6 @@
 SET max_parallel_maintenance_workers TO 4;
 SET min_parallel_index_scan_size TO '128kB';
+SET min_parallel_table_scan_size TO '128kB';
 -- Bug #17245: Make sure that we don't totally fail to VACUUM individual indexes that
 -- happen to be below min_parallel_index_scan_size during parallel VACUUM:
 CREATE TABLE parallel_vacuum_table (a int) WITH (autovacuum_enabled = off);
@@ -43,7 +44,13 @@ VACUUM (PARALLEL 4, INDEX_CLEANUP ON) parallel_vacuum_table;
 -- Since vacuum_in_leader_small_index uses deduplication, we expect an
 -- assertion failure with bug #17245 (in the absence of bugfix):
 INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 10000) i;
+-- Insert more tuples to use parallel heap vacuum.
+INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 500_000) i;
+VACUUM (PARALLEL 2) parallel_vacuum_table;
+DELETE FROM parallel_vacuum_table WHERE a < 1000;
+VACUUM (PARALLEL 1) parallel_vacuum_table;
 RESET max_parallel_maintenance_workers;
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 -- Deliberately don't drop table, to get further coverage from tools like
 -- pg_amcheck in some testing scenarios
diff --git a/src/test/regress/sql/vacuum_parallel.sql b/src/test/regress/sql/vacuum_parallel.sql
index 1d23f33e39c..5381023642f 100644
--- a/src/test/regress/sql/vacuum_parallel.sql
+++ b/src/test/regress/sql/vacuum_parallel.sql
@@ -1,5 +1,6 @@
 SET max_parallel_maintenance_workers TO 4;
 SET min_parallel_index_scan_size TO '128kB';
+SET min_parallel_table_scan_size TO '128kB';
 
 -- Bug #17245: Make sure that we don't totally fail to VACUUM individual indexes that
 -- happen to be below min_parallel_index_scan_size during parallel VACUUM:
@@ -39,8 +40,15 @@ VACUUM (PARALLEL 4, INDEX_CLEANUP ON) parallel_vacuum_table;
 -- assertion failure with bug #17245 (in the absence of bugfix):
 INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 10000) i;
 
+-- Insert more tuples to use parallel heap vacuum.
+INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 500_000) i;
+VACUUM (PARALLEL 2) parallel_vacuum_table;
+DELETE FROM parallel_vacuum_table WHERE a < 1000;
+VACUUM (PARALLEL 1) parallel_vacuum_table;
+
 RESET max_parallel_maintenance_workers;
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 
 -- Deliberately don't drop table, to get further coverage from tools like
 -- pg_amcheck in some testing scenarios
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cad3e3bcbfd..193b0827f94 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1961,6 +1961,11 @@ PLpgSQL_type
 PLpgSQL_type_type
 PLpgSQL_var
 PLpgSQL_variable
+ParallelLVLeader
+ParallelLVScanDesc
+ParallelLVScanWorkerData
+ParallelLVShared
+ParallelLVState
 PLwdatum
 PLword
 PLyArrayToOb
-- 
2.43.5

v18-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchapplication/octet-stream; name=v18-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchDownload

From 5e7253ecc2ba029ea1574897bf4825544e1a5ee7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 18 Feb 2025 17:45:36 -0800
Subject: [PATCH v18 2/5] vacuumparallel.c: Support parallel vacuuming for
 tables to collect dead items.

Previously, parallel vacuum was available only for index vacuuming and
index cleanup, ParallelVacuumState was initialized only when the table
has at least two indexes that are eligible for parallel index
vacuuming and cleanup.

This commit extends vacuumparallel.c to support parallel table
vacuuming. parallel_vacuum_init() now initializes ParallelVacuumState
to perform parallel heap scan to collect dead items, or paralel index
vacuuming/cleanup, or both. During the initialization, it asks the
table AM for the number of parallel workers required for parallel
table vacuuming. If >0, it enables parallel table vacuuming and calls
further table AM APIs such as parallel_vacuum_estimate.

For parallel table vacuuming, this commit introduces
parallel_vacuum_collect_dead_items_begin() function, which can be used
to collect dead items in the table (for example, the first pass over
heap table in lazy vacuum for heap tables).

Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c  |   2 +-
 src/backend/commands/vacuumparallel.c | 393 +++++++++++++++++++-------
 src/include/commands/vacuum.h         |   5 +-
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 293 insertions(+), 108 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 75979530897..265e2f5eb05 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3499,7 +3499,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..3726fb41028 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -4,17 +4,18 @@
  *	  Support routines for parallel vacuum execution.
  *
  * This file contains routines that are intended to support setting up, using,
- * and tearing down a ParallelVacuumState.
+ * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared
+ * information as well as the memory space for storing dead items allocated in
+ * the DSA area. We launch
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan, index bulk-deletion, index
+ * cleanup, or all of them with parallel worker processes depending on the
+ * number of parallel workers required for each phase. So different numbers of
+ * workers might be required for the table scanning and index processing.
+ * We launch parallel worker processes at the start of a phase, and once we
+ * complete all work in the phase, parallel workers exit. Each time we process
+ * table or indexes in parallel, the parallel context is re-initialized so that
+ * the same DSM can be used for multiple passes of each phase.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -26,8 +27,10 @@
  */
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -50,6 +53,13 @@
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		4
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
+/* The kind of parallel vacuum phases */
+typedef enum
+{
+	PV_WORK_PHASE_PROCESS_INDEXES,	/* index vacuuming or cleanup */
+	PV_WORK_PHASE_COLLECT_DEAD_ITEMS,	/* collect dead tuples */
+} PVWorkPhase;
+
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
  * segment.
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	int64		queryid;
 
+	/*
+	 * Tell parallel workers what phase to perform: processing indexes or
+	 * collecting dead tuples from the table.
+	 */
+	PVWorkPhase work_phase;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +180,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Do we need to reinitialize parallel DSM? */
+	bool		need_reinitialize_dsm;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -178,7 +197,7 @@ struct ParallelVacuumState
 	 * Shared index statistics among parallel vacuum workers. The array
 	 * element is allocated for every index, even those indexes where parallel
 	 * index vacuuming is unsafe or not worthwhile (e.g.,
-	 * will_parallel_vacuum[] is false).  During parallel vacuum,
+	 * idx_will_parallel_vacuum[] is false).  During parallel vacuum,
 	 * IndexBulkDeleteResult of each index is kept in DSM and is copied into
 	 * local memory at the end of parallel vacuum.
 	 */
@@ -193,12 +212,18 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/*
+	 * The number of workers for parallel table vacuuming. If 0, the parallel
+	 * table vacuum is disabled.
+	 */
+	int			nworkers_for_table;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
 	 * min_parallel_index_scan_size cutoff.
 	 */
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
@@ -221,8 +246,10 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static int	parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											bool *idx_will_parallel_vacuum,
+											void *state);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -231,18 +258,25 @@ static void parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation
 											  PVIndStats *indstats);
 static bool parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 												   bool vacuum);
+static void parallel_vacuum_begin_work_phase(ParallelVacuumState *pvs, int nworkers,
+											 PVWorkPhase work_phase);
+static void parallel_vacuum_end_worker_phase(ParallelVacuumState *pvs);
 static void parallel_vacuum_error_callback(void *arg);
 
 /*
  * Try to enter parallel mode and create a parallel context.  Then initialize
  * shared memory state.
  *
+ * nrequested_workers is the requested parallel degree. 0 means that the parallel
+ * degrees for table and indexes vacuum are decided differently. See the comments
+ * of parallel_vacuum_compute_workers() for details.
+ *
  * On success, return parallel vacuum state.  Otherwise return NULL.
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -251,38 +285,38 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
 	int			querylen;
 
-	/*
-	 * A parallel vacuum must be requested and there must be indexes on the
-	 * relation
-	 */
+	/* A parallel vacuum must be requested */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
-	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
+	idx_will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
+	parallel_workers = parallel_vacuum_compute_workers(rel, indrels, nindexes,
 													   nrequested_workers,
-													   will_parallel_vacuum);
+													   &nworkers_for_table,
+													   idx_will_parallel_vacuum,
+													   state);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
-		pfree(will_parallel_vacuum);
+		pfree(idx_will_parallel_vacuum);
 		return NULL;
 	}
 
 	pvs = (ParallelVacuumState *) palloc0(sizeof(ParallelVacuumState));
 	pvs->indrels = indrels;
 	pvs->nindexes = nindexes;
-	pvs->will_parallel_vacuum = will_parallel_vacuum;
+	pvs->idx_will_parallel_vacuum = idx_will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
 
@@ -291,6 +325,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 parallel_workers);
 	Assert(pcxt->nworkers > 0);
 	pvs->pcxt = pcxt;
+	pvs->need_reinitialize_dsm = false;
+	pvs->nworkers_for_table = nworkers_for_table;
 
 	/* Estimate size for index vacuum stats -- PARALLEL_VACUUM_KEY_INDEX_STATS */
 	est_indstats_len = mul_size(sizeof(PVIndStats), nindexes);
@@ -327,6 +363,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, pvs->nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -345,7 +385,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 			   ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) == 0));
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
-		if (!will_parallel_vacuum[i])
+		if (!idx_will_parallel_vacuum[i])
 			continue;
 
 		if (indrel->rd_indam->amusemaintenanceworkmem)
@@ -419,6 +459,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Initialize AM-specific DSM space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, pvs->nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -456,7 +500,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
-	pfree(pvs->will_parallel_vacuum);
+	pfree(pvs->idx_will_parallel_vacuum);
 	pfree(pvs);
 }
 
@@ -533,26 +577,35 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.  Return the maximum number of parallel
+ * workers for table vacuuming and index vacuuming.
+ *
+ * nrequested is the number of parallel workers that user requested, which
+ * applies to both the number of workers for table vacuum and index vacuum.
+ * If nrequested is 0, we compute the parallel degree for them differently
+ * as described below.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel table vacuum, we ask AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to nworkers_table_p.
+ *
+ * For parallel index vacuum, the index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. This function sets
+ * idx_will_parallel_vacuum to remember indexes that participate in parallel vacuum.
  */
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table_p,
+								bool *idx_will_parallel_vacuum, void *state)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			nworkers_table = 0;
+	int			nworkers_index = 0;
+
+	*nworkers_table_p = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -561,6 +614,14 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;
 
+	/* Compute the number of workers for parallel table scan */
+	if (rel->rd_tableam->parallel_vacuum_compute_workers != NULL)
+		nworkers_table = table_parallel_vacuum_compute_workers(rel, nrequested,
+															   state);
+
+	/* Cap by max_parallel_maintenance_workers */
+	nworkers_table = Min(nworkers_table, max_parallel_maintenance_workers);
+
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
@@ -574,7 +635,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		will_parallel_vacuum[i] = true;
+		idx_will_parallel_vacuum[i] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -589,18 +650,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	/* The leader process takes one index */
 	nindexes_parallel--;
 
-	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table_p = nworkers_table;
+	return Max(nworkers_table, nworkers_index);
 }
 
 /*
@@ -657,7 +718,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL);
 		indstats->status = new_status;
 		indstats->parallel_workers_can_process =
-			(pvs->will_parallel_vacuum[i] &&
+			(pvs->idx_will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
 													num_index_scans,
 													vacuum));
@@ -669,40 +730,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
-		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
-			ReinitializeParallelDSM(pvs->pcxt);
-
-		/*
-		 * Set up shared cost balance and the number of active workers for
-		 * vacuum delay.  We need to do this before launching workers as
-		 * otherwise, they might not see the updated values for these
-		 * parameters.
-		 */
-		pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
-		pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
-
-		/*
-		 * The number of workers can vary between bulkdelete and cleanup
-		 * phase.
-		 */
-		ReinitializeParallelWorkers(pvs->pcxt, nworkers);
-
-		LaunchParallelWorkers(pvs->pcxt);
-
-		if (pvs->pcxt->nworkers_launched > 0)
-		{
-			/*
-			 * Reset the local cost values for leader backend as we have
-			 * already accumulated the remaining balance of heap.
-			 */
-			VacuumCostBalance = 0;
-			VacuumCostBalanceLocal = 0;
-
-			/* Enable shared cost balance for leader backend */
-			VacuumSharedCostBalance = &(pvs->shared->cost_balance);
-			VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
-		}
+		/* Start parallel vacuum workers for processing indexes */
+		parallel_vacuum_begin_work_phase(pvs, nworkers,
+										 PV_WORK_PHASE_PROCESS_INDEXES);
 
 		if (vacuum)
 			ereport(pvs->shared->elevel,
@@ -732,13 +762,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	 * to finish, or we might get incomplete data.)
 	 */
 	if (nworkers > 0)
-	{
-		/* Wait for all vacuum workers to finish */
-		WaitForParallelWorkersToFinish(pvs->pcxt);
-
-		for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
-			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
-	}
+		parallel_vacuum_end_worker_phase(pvs);
 
 	/*
 	 * Reset all index status back to initial (while checking that we have
@@ -755,15 +779,8 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		indstats->status = PARALLEL_INDVAC_STATUS_INITIAL;
 	}
 
-	/*
-	 * Carry the shared balance value to heap scan and disable shared costing
-	 */
-	if (VacuumSharedCostBalance)
-	{
-		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
-		VacuumSharedCostBalance = NULL;
-		VacuumActiveNWorkers = NULL;
-	}
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
 }
 
 /*
@@ -979,6 +996,77 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Begin the parallel scan to collect dead items. Return the number of
+ * launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_collect_dead_items_end() to finish
+ * the parallel scan.
+ */
+int
+parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	/* Start parallel vacuum workers for collecting dead items */
+	Assert(pvs->nworkers_for_table <= pvs->pcxt->nworkers);
+	parallel_vacuum_begin_work_phase(pvs, pvs->nworkers_for_table,
+									 PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Include the worker count for the leader itself */
+	if (pvs->pcxt->nworkers_launched > 0)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel vacuum workers launched by
+ * parallel_vacuum_collect_dead_items_begin(), and gather workers' statistics.
+ */
+void
+parallel_vacuum_collect_dead_items_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	/* Wait for parallel workers to finish */
+	parallel_vacuum_end_worker_phase(pvs);
+
+	/* Decrement the worker count for the leader itself */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * The function is for parallel workers to execute the parallel scan to
+ * collect dead tuples.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs, void *state)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do the parallel scan to collect dead tuples */
+	table_parallel_vacuum_collect_dead_items(pvs->heaprel, pvs, state);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -998,6 +1086,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
+	void	   *state;
 	ErrorContextCallback errcallback;
 
 	/*
@@ -1030,7 +1119,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	/*
 	 * Apply the desired value of maintenance_work_mem within this process.
@@ -1076,6 +1164,17 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
+	/* Initialize AM-specific vacuum state for parallel table vacuuming */
+	if (shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS)
+	{
+		ParallelWorkerContext pwcxt;
+
+		pwcxt.toc = toc;
+		pwcxt.seg = seg;
+		table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+												&state);
+	}
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = parallel_vacuum_error_callback;
 	errcallback.arg = &pvs;
@@ -1085,8 +1184,19 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	switch (pvs.shared->work_phase)
+	{
+		case PV_WORK_PHASE_COLLECT_DEAD_ITEMS:
+			/* Scan the table to collect dead items */
+			parallel_vacuum_process_table(&pvs, state);
+			break;
+		case PV_WORK_PHASE_PROCESS_INDEXES:
+			/* Process indexes to perform vacuum/cleanup */
+			parallel_vacuum_process_safe_indexes(&pvs);
+			break;
+		default:
+			elog(ERROR, "unrecognized parallel vacuum phase %d", pvs.shared->work_phase);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
@@ -1109,6 +1219,77 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	FreeAccessStrategy(pvs.bstrategy);
 }
 
+/*
+ * Launch parallel vacuum workers for the given phase. If at least one
+ * worker launched, enable the shared vacuum delay costing.
+ */
+static void
+parallel_vacuum_begin_work_phase(ParallelVacuumState *pvs, int nworkers,
+								 PVWorkPhase work_phase)
+{
+	/* Set the work phase */
+	pvs->shared->work_phase = work_phase;
+
+	/* Reinitialize parallel context to relaunch parallel workers */
+	if (pvs->need_reinitialize_dsm)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * Set up shared cost balance and the number of active workers for vacuum
+	 * delay.  We need to do this before launching workers as otherwise, they
+	 * might not see the updated values for these parameters.
+	 */
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	/*
+	 * The number of workers can vary between bulkdelete and cleanup phase.
+	 */
+	ReinitializeParallelWorkers(pvs->pcxt, nworkers);
+
+	LaunchParallelWorkers(pvs->pcxt);
+
+	/* Enable shared vacuum costing if we are able to launch any worker */
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+	}
+}
+
+/*
+ * Wait for parallel vacuum workers to finish, accumulate the statistics,
+ * and disable shared vacuum delay costing if enabled.
+ */
+static void
+parallel_vacuum_end_worker_phase(ParallelVacuumState *pvs)
+{
+	/* Wait for all vacuum workers to finish */
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/* Carry the shared balance value and disable shared costing */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
+}
+
 /*
  * Error context callback for errors occurring during parallel index vacuum.
  * The error context messages should match the messages set in the lazy vacuum
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 14eeccbd718..1369377ea98 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -382,7 +382,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -394,6 +395,8 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_collect_dead_items_end(ParallelVacuumState *pvs);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 32d6e718adc..f05be2b32de 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2033,6 +2033,7 @@ PVIndStats
 PVIndVacStatus
 PVOID
 PVShared
+PVWorkPhase
 PX_Alias
 PX_Cipher
 PX_Combo
-- 
2.43.5

v18-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchapplication/octet-stream; name=v18-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchDownload

From 1da08fe0415e0a8698b6e083f15d495538c9fb72 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v18 1/5] Introduces table AM APIs for parallel table
 vacuuming.

This commit introduces the following new table AM APIs for parallel
table vacuuming:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

All callbacks are optional. parallel_vacuum_compute_workers needs to
return 0 to disable parallel table vacuuming.

There is no code using these new APIs for now. Upcoming parallel
vacuum patches utilize these APIs.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/include/access/tableam.h | 138 +++++++++++++++++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b1..84928bb25bb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -36,6 +36,9 @@ extern PGDLLIMPORT bool synchronize_seqscans;
 
 struct BulkInsertStateData;
 struct IndexInfo;
+struct ParallelContext;
+struct ParallelVacuumState;
+struct ParallelWorkerContext;
 struct SampleScanState;
 struct ValidateIndexState;
 
@@ -648,6 +651,79 @@ typedef struct TableAmRoutine
 									const VacuumParams params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * parallel degree for parallel vacuum is further limited by
+	 * max_parallel_maintenance_workers. The function must return 0 to disable
+	 * parallel table vacuum.
+	 *
+	 * 'nworkers_requested' is a >=0 number and the requested number of
+	 * workers. This comes from the PARALLEL option. 0 means to choose the
+	 * parallel degree based on the table AM specific factors such as table
+	 * size.
+	 *
+	 * Optional callback.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel,
+													int nworkers_requested,
+													void *state);
+
+	/*
+	 * Estimate the size of shared memory needed for a parallel table vacuum
+	 * of this relation.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 struct ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   struct ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * Initialize AM-specific vacuum state for worker processes.
+	 *
+	 * The state_out is the output parameter so that arbitrary data can be
+	 * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_initialize_worker) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  struct ParallelWorkerContext *pwcxt,
+													  void **state_out);
+
+	/*
+	 * Execute a parallel scan to collect dead items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_collect_dead_items) (Relation rel,
+													   struct ParallelVacuumState *pvs,
+													   void *state);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1670,6 +1746,68 @@ table_relation_vacuum(Relation rel, const VacuumParams params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Compute the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									  void *state)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel,
+															nworkers_requested,
+															state);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, struct ParallelContext *pcxt,
+							   int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, struct ParallelContext *pcxt,
+								 int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize AM-specific vacuum state for worker processes.
+ */
+static inline void
+table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+										struct ParallelWorkerContext *pwcxt,
+										void **state_out)
+{
+	rel->rd_tableam->parallel_vacuum_initialize_worker(rel, pvs, pwcxt, state_out);
+}
+
+/*
+ * Execute a parallel vacuum scan to collect dead items.
+ */
+static inline void
+table_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+										 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_collect_dead_items(rel, pvs, state);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
-- 
2.43.5

v18-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchapplication/octet-stream; name=v18-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchDownload

From 35cf6eb7a2d391a32db743f924d4a96d34e58d97 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 26 Feb 2025 11:31:55 -0800
Subject: [PATCH v18 3/5] Move lazy heap scan related variables to new struct
 LVScanData.

This is a pure refactoring for upcoming parallel heap scan, which
requires storing relation statistics and relation data such as extant
oldest XID/MXID collected during lazy heap scan to a shared memory
area.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 308 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 163 insertions(+), 146 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 265e2f5eb05..d7ad9c87451 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -256,6 +256,54 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * Data and counters updated during lazy heap scan.
+ */
+typedef struct LVScanData
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+
+	/*
+	 * Count of all-visible blocks eagerly scanned (for logging only). This
+	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
+	 */
+	BlockNumber eager_scanned_pages;
+
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVScanData;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -282,10 +330,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -311,34 +355,9 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 
-	/*
-	 * Count of all-visible blocks eagerly scanned (for logging only). This
-	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
-	 */
-	BlockNumber eager_scanned_pages;
-
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
-
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
-
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	/* Data and counters updated during lazy heap scan */
+	LVScanData *scan_data;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -348,13 +367,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -616,6 +628,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVScanData *scan_data;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -730,14 +743,24 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->eager_scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	scan_data = palloc(sizeof(LVScanData));
+	scan_data->scanned_pages = 0;
+	scan_data->eager_scanned_pages = 0;
+	scan_data->removed_pages = 0;
+	scan_data->new_frozen_tuple_pages = 0;
+	scan_data->lpdead_item_pages = 0;
+	scan_data->missed_dead_pages = 0;
+	scan_data->nonempty_pages = 0;
+	scan_data->tuples_deleted = 0;
+	scan_data->tuples_frozen = 0;
+	scan_data->lpdead_items = 0;
+	scan_data->live_tuples = 0;
+	scan_data->recently_dead_tuples = 0;
+	scan_data->missed_dead_tuples = 0;
+	scan_data->vm_new_visible_pages = 0;
+	scan_data->vm_new_visible_frozen_pages = 0;
+	scan_data->vm_new_frozen_pages = 0;
+	vacrel->scan_data = scan_data;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -747,16 +770,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
 
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -779,15 +794,15 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	vacrel->vistest = GlobalVisTestFor(rel);
 
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_data->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_data->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 
 	/*
 	 * Initialize state related to tracking all-visible page skipping. This is
 	 * very important to determine whether or not it is safe to advance the
 	 * relfrozenxid/relminmxid.
 	 */
-	vacrel->skippedallvis = false;
+	vacrel->scan_data->skippedallvis = false;
 	skipwithvm = true;
 	if (params.options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -875,15 +890,15 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_data->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_data->NewRelfrozenXid));
+	Assert(vacrel->scan_data->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_data->NewRelminMxid));
+	if (vacrel->scan_data->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -891,8 +906,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_data->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_data->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
@@ -922,7 +937,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, new_rel_allfrozen,
 						vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_data->NewRelfrozenXid,
+						vacrel->scan_data->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -938,8 +954,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples,
+						 vacrel->scan_data->recently_dead_tuples +
+						 vacrel->scan_data->missed_dead_tuples,
 						 starttime);
 	pgstat_progress_end_command();
 
@@ -1013,23 +1029,23 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u eagerly scanned\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_data->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_data->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages /
+							 100.0 * vacrel->scan_data->scanned_pages /
 							 orig_rel_pages,
-							 vacrel->eager_scanned_pages);
+							 vacrel->scan_data->eager_scanned_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %" PRId64 " removed, %" PRId64 " remain, %" PRId64 " are dead but not yet removable\n"),
-							 vacrel->tuples_deleted,
+							 vacrel->scan_data->tuples_deleted,
 							 (int64) vacrel->new_rel_tuples,
-							 vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 vacrel->scan_data->recently_dead_tuples);
+			if (vacrel->scan_data->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %" PRId64 " dead from %u pages not removed due to cleanup lock contention\n"),
-								 vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 vacrel->scan_data->missed_dead_tuples,
+								 vacrel->scan_data->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -1037,33 +1053,33 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_data->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_data->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_data->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_data->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %" PRId64 " tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_data->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_data->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 vacrel->tuples_frozen);
+							 vacrel->scan_data->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_data->vm_new_visible_pages,
+							 vacrel->scan_data->vm_new_visible_frozen_pages +
+							 vacrel->scan_data->vm_new_frozen_pages,
+							 vacrel->scan_data->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -1083,10 +1099,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %" PRId64 " dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_data->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 vacrel->lpdead_items);
+							 100.0 * vacrel->scan_data->lpdead_item_pages / orig_rel_pages,
+							 vacrel->scan_data->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -1261,8 +1277,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages > 0 &&
-			vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_data->scanned_pages > 0 &&
+			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1317,9 +1333,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		blkno = BufferGetBlockNumber(buf);
 
-		vacrel->scanned_pages++;
+		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
-			vacrel->eager_scanned_pages++;
+			vacrel->scan_data->eager_scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -1506,16 +1522,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
 
 	read_stream_end(stream);
 
@@ -1562,7 +1578,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * callback_private_data contains a reference to the LVRelState, passed to the
  * read stream API during stream setup. The LVRelState is an in/out parameter
  * here (locally named `vacrel`). Vacuum options and information about the
- * relation are read from it. vacrel->skippedallvis is set if we skip a block
+ * relation are read from it. vacrel->scan_data->skippedallvis is set if we skip a block
  * that's all-visible but not all-frozen (to ensure that we don't update
  * relfrozenxid in that case). vacrel also holds information about the next
  * unskippable block -- as bookkeeping for this function.
@@ -1624,7 +1640,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_data->skippedallvis = true;
 		}
 	}
 
@@ -1899,8 +1915,8 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			END_CRIT_SECTION();
 
 			/* Count the newly all-frozen pages for logging */
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1975,10 +1991,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_data->NewRelfrozenXid, &vacrel->scan_data->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_data->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_data->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1988,7 +2004,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_data->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -2023,7 +2039,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -2038,15 +2054,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_data->tuples_deleted += presult.ndeleted;
+	vacrel->scan_data->tuples_frozen += presult.nfrozen;
+	vacrel->scan_data->lpdead_items += presult.lpdead_items;
+	vacrel->scan_data->live_tuples += presult.live_tuples;
+	vacrel->scan_data->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2095,17 +2111,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (presult.all_frozen)
 			{
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 				*vm_page_frozen = true;
 			}
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2193,8 +2209,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 
@@ -2204,7 +2220,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		else
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2245,8 +2261,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_data->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2373,8 +2389,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_data->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -2401,25 +2417,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_data->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_data->live_tuples += live_tuples;
+	vacrel->scan_data->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_data->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_data->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -2448,7 +2464,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_data->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -2482,7 +2498,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_data->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2509,7 +2525,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		bypass = (vacrel->scan_data->lpdead_item_pages < threshold &&
 				  TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);
 	}
 
@@ -2647,7 +2663,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2809,8 +2825,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
+			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %" PRId64 " dead item identifiers in %u pages",
@@ -2920,9 +2936,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 						  flags);
 
 		/* Count the newly set VM page for logging */
-		vacrel->vm_new_visible_pages++;
+		vacrel->scan_data->vm_new_visible_pages++;
 		if (all_frozen)
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -2998,7 +3014,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_data->scanned_pages < vacrel->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -3179,7 +3195,7 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->scan_data->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
 		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
@@ -3205,7 +3221,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_data->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -3306,7 +3322,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_data->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
@@ -3314,7 +3330,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_data->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -3342,7 +3358,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_data->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -3454,7 +3470,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_data->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f05be2b32de..cad3e3bcbfd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1527,6 +1527,7 @@ LSEG
 LUID
 LVRelState
 LVSavedErrInfo
+LVScanData
 LWLock
 LWLockHandle
 LWLockMode
-- 
2.43.5

#110

Melanie Plageman

melanieplageman@gmail.com

6 months ago

In reply to: Masahiko Sawada (#109)

Re: Parallel heap vacuum

On Mon, Jun 30, 2025 at 9:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've rebased the patches to the current HEAD.

Hi Sawada-san,

I've started reviewing this. I initially meant to review the eager
scan part, but I think it would be easier to do that if it were in a
separate patch on top of the rest of the parallel heap vacuuming
patches -- just during review. I had a hard time telling which parts
of the design (and code) were necessitated by the eager scanning
feature.

I started reviewing the rest of the code and this is some initial
feedback on it:

LV
---
I don't think we should use "LV" and "lazy vacuum" anymore. Let's call
it what it is: "heap vacuuming". We don't need to rename vacuumlazy.c
right now, but if you are making new structs or substantially altering
old ones, I wouldn't use "LV".

table AM callbacks
-------------------------
I think there is a fundamental tension here related to whether or not
vacuumparallel.c should be table-AM agnostic. All of its code is
invoked below heap_vacuum_rel(). I would argue that vacuumparallel.c
is index AM-agnostic but heap-specific.

I think this is what makes your table AM callbacks feel odd. Other
table AM callbacks are invoked from general code -- for example from
executor code that is agnostic to the underlying table AMs. But, all
of the functions you added and the associated callbacks are invoked
below heap_vacuum_rel() (as is all of vacuumparallel.c).

duplicated information across structs
-----------------------------------------------
I think a useful preliminary patch to this set would be to refactor
LVRelState into multiple structs that each represent the static input,
working state, and output of vacuum. You start to do this in one of
your patches with LVScanData but it doesn't go far enough.

In fact, I think there are members of all the parallel data structures
you introduced and of those that are already present that overlap with
members of the LVRelState and we could start putting these smaller
structs. For my examples, I am referring both to existing code and
code you added.

For example, relnamespace, relname, and indname in LVRelState and
ParallelVacuumState and ParallelVacuumState->heaprel and
LVRelState->rel. Seems like you could have a smaller struct that is
accessible to the vacuum error callback and also to the users in
LVRelState.

(Side note: In your patch, I find it confusing that these members of
LVRelState are initialized in
heap_parallel_vacuum_collect_dead_items() instead of
heap_parallel_vacuum_initialize_worker().)

And even for cases where the information has to be passed from the
leader to the worker, there is no reason you can't have the same
struct but in shared memory for the parallel case and local memory for
the serial case. For example with the struct members "aggressive",
"skipwithvm", and "cutoffs" in LVRelState and ParallelLVShared. Or
ParallelLVScanDesc->nblocks and LVRelState->rel_pages.

Ideally the entire part of the LVRelState that is an unchanging
input/reference data is in a struct which is in local memory for the
serial and local parallel case and a single read-only location in the
shared parallel case. Or, for the shared case, if there is a reason
not to read from the shared memory, we copy them over to a local
instance of the same struct. Maybe it is called HeapVacInput or
similar.

There are a bunch of other instances like this. For example, the
leader initializes LVScanData->NewRelfrozenXid from
vacrel->cutoffs.OldestXmin, then uses this to set
ParallelLVShared->NewRelfrozenXid. Then the worker copies it from
ParallelLVShared->NewRelfrozenXid back to LVScanData->NewRelfrozenXid.
But the worker has access to ParallelLVShared->cutoffs. So you could
either get it from there or allow the workers to access the first
LVScanData and have that always belong to the leader. Either way, I
don't think you should need NewRelfrozenXid in ParallelLVShared.

Another related concern: I don't understand why
ParallelLVScanWorkerData->nallocated has to be in shared memory. I get
that each worker has to keep track for itself how many blocks it has
"processed" so far (either skipped or scanned) and I get that there
has to be some coordination variable that keeps track of whether or
not all blocks have been processed and where the next chunk is. But
why does the worker's nallocated have to be in shared memory? It
doesn't seem like the leader accesses it.

(Side note: I'm rather confused by why ParallelLVScanDesc is its own
thing [instead of part of ParallelLVShared] -- not to mention its
chunk_size member appears to be unused.)

Separately, I think the fact that PVShared and ParallelVacuumState
stayed almost untouched (and continue to have general,
parallel-sounding names) with your patch but now mostly deal with
stuff about indexes while most other parallel vacuuming content is in
other structs is confusing. I think we need to consider which members
in PVShared and ParallelVacuumState are phase II only and put that in
an appropriately named structure or get more of the heap vacuuming
specific members in those generally named structures.

- Melanie

#111

Masahiko Sawada

sawada.mshk@gmail.com

6 months ago

In reply to: Melanie Plageman (#110)

Re: Parallel heap vacuum

On Thu, Jul 17, 2025 at 1:39 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Jun 30, 2025 at 9:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've rebased the patches to the current HEAD.

Hi Sawada-san,

I've started reviewing this. I initially meant to review the eager
scan part, but I think it would be easier to do that if it were in a
separate patch on top of the rest of the parallel heap vacuuming
patches -- just during review. I had a hard time telling which parts
of the design (and code) were necessitated by the eager scanning
feature.

Thank you for reviewing the patch! I'll separate the changes of the
eager scanning part from the main patch in the next version patch.

LV
---
I don't think we should use "LV" and "lazy vacuum" anymore. Let's call
it what it is: "heap vacuuming". We don't need to rename vacuumlazy.c
right now, but if you are making new structs or substantially altering
old ones, I wouldn't use "LV".

Agreed. I'll change accordingly in the next version patch.

table AM callbacks
-------------------------
I think there is a fundamental tension here related to whether or not
vacuumparallel.c should be table-AM agnostic. All of its code is
invoked below heap_vacuum_rel(). I would argue that vacuumparallel.c
is index AM-agnostic but heap-specific.

I think this is what makes your table AM callbacks feel odd. Other
table AM callbacks are invoked from general code -- for example from
executor code that is agnostic to the underlying table AMs. But, all
of the functions you added and the associated callbacks are invoked
below heap_vacuum_rel() (as is all of vacuumparallel.c).

What does it exactly mean to make vacuumparallel.c index AM-agnostic
but heap-specific? I imagine we change the vacuumparallel.c so that it
calls heap's functions for parallel vacuuming such as initializing DSM
and parallel vacuum worker's entry function etc. But I'm not sure how
it works with non-heap table AMs that uses vacuumparallel.c.

I've tried to implement parallel heap scanning in vacuumlazy.c before
but I realized that there could be a lot of duplications between
vacuumlazy.c and vacuumparallel.c.

One possible change would be to have lazyvacuum.c pass the set of
callbacks to parallel_vacuum_init() instead of having them as table AM
callbacks. This removes the weirdness of the associated table AM
callbacks being invoked below heap_vacuum_rel() but it doesn't address
your point "vacuumparallel.c is index AM-agnostic but heap-specific".

duplicated information across structs
-----------------------------------------------
I think a useful preliminary patch to this set would be to refactor
LVRelState into multiple structs that each represent the static input,
working state, and output of vacuum. You start to do this in one of
your patches with LVScanData but it doesn't go far enough.

In fact, I think there are members of all the parallel data structures
you introduced and of those that are already present that overlap with
members of the LVRelState and we could start putting these smaller
structs. For my examples, I am referring both to existing code and
code you added.

For example, relnamespace, relname, and indname in LVRelState and
ParallelVacuumState and ParallelVacuumState->heaprel and
LVRelState->rel. Seems like you could have a smaller struct that is
accessible to the vacuum error callback and also to the users in
LVRelState.

I guess it's also related to the point of whether we should make
vacuumparallel.c heap-specific. Currently, since vacuumparallel.c is
independent from heap AM and has its own error callback
parallel_vacuum_error_callback(), it has duplicated information like
heaprel and indrels. Given vacuumparallel.c should not rely on the
heap-specific struct, I imagine that we store that information in
vacuumparallel.c side and have heap access to it. But which means it's
workable only during parallel vacuum.

(Side note: In your patch, I find it confusing that these members of
LVRelState are initialized in
heap_parallel_vacuum_collect_dead_items() instead of
heap_parallel_vacuum_initialize_worker().)

Yeah, I think at least some fields such as dbname, relnamespace, and
relname should be initialized in
heap_parallel_vacuum_initialize_worker().

And even for cases where the information has to be passed from the
leader to the worker, there is no reason you can't have the same
struct but in shared memory for the parallel case and local memory for
the serial case. For example with the struct members "aggressive",
"skipwithvm", and "cutoffs" in LVRelState and ParallelLVShared. Or
ParallelLVScanDesc->nblocks and LVRelState->rel_pages.

Ideally the entire part of the LVRelState that is an unchanging
input/reference data is in a struct which is in local memory for the
serial and local parallel case and a single read-only location in the
shared parallel case. Or, for the shared case, if there is a reason
not to read from the shared memory, we copy them over to a local
instance of the same struct. Maybe it is called HeapVacInput or
similar.

ParallelLVShared is created to pass information to parallel vacuum
workers while keeping LVRelStates able to work locally. Suppose that
we create HeapVacInput including "aggressive", "cutoff", "skipwithvm",
and "rel_pages", LVRelState would have to have a pointer to a
HeapVacInput instance that is either on local memory or shared memory.
Since we also need to pass other information such as
initial_chunk_size and eager scanning related information to the
parallel vacuum worker, we would have to create something like
ParallelLVShared as the patch does. As a result, we would have two
structs that need to be shared on the shared buffer. Is that kind of
what you meant? or did you mean that we include parallel vacuum
related fields too to HeapVacInput struct?

There are a bunch of other instances like this. For example, the
leader initializes LVScanData->NewRelfrozenXid from
vacrel->cutoffs.OldestXmin, then uses this to set
ParallelLVShared->NewRelfrozenXid. Then the worker copies it from
ParallelLVShared->NewRelfrozenXid back to LVScanData->NewRelfrozenXid.
But the worker has access to ParallelLVShared->cutoffs. So you could
either get it from there or allow the workers to access the first
LVScanData and have that always belong to the leader. Either way, I
don't think you should need NewRelfrozenXid in ParallelLVShared.

You're right. Both NewRelFrozenXid and NewRelminMxid don't need to be
shared and can be removed from ParallelLVShared.

Another related concern: I don't understand why
ParallelLVScanWorkerData->nallocated has to be in shared memory. I get
that each worker has to keep track for itself how many blocks it has
"processed" so far (either skipped or scanned) and I get that there
has to be some coordination variable that keeps track of whether or
not all blocks have been processed and where the next chunk is. But
why does the worker's nallocated have to be in shared memory? It
doesn't seem like the leader accesses it.

ParallelLVScanWorkerData->nallocated works in the same way as
ParallelBlockTableScanDescData.phs_nallocated. That is, it controls
how many blocks has been allocated to any of the workers so far. When
allocating a new chunk to scan, each worker gets
ParallelLVScanWorkerData->nallocated value and adds the chunk size to
it in an atomic operation (see parallel_lazy_scan_get_nextpage()).

(Side note: I'm rather confused by why ParallelLVScanDesc is its own
thing [instead of part of ParallelLVShared] -- not to mention its
chunk_size member appears to be unused.)

I wanted to keep ParallelLVShared() mostly read-only and aimed to
share the information from the leader to the workers, whereas
ParallelLVScanDesc() is a pure scan state maintained by each worker
(and the leader).

chunk_size is used when allocating the new chunk (see
parallel_lazy_scan_get_nextpage()). It's initialized by
vacrel->plvstate->shared->initial_chunk_size since the first chunk
size could vary depending on eager scanning state, and is updated to
the fixed size PARALLEL_LV_CHUNK_SIZE from the next chunk.

Separately, I think the fact that PVShared and ParallelVacuumState
stayed almost untouched (and continue to have general,
parallel-sounding names) with your patch but now mostly deal with
stuff about indexes while most other parallel vacuuming content is in
other structs is confusing. I think we need to consider which members
in PVShared and ParallelVacuumState are phase II only and put that in
an appropriately named structure or get more of the heap vacuuming
specific members in those generally named structures.

Yeah, fields such as reltuples, estimated_count, and
maintenance_work_mem_worker are used only for phase II while some
fields such as relid, TidStore related fields, and vacuum delays
related fields are commonly used in phase I and II. I'll consider
creating a new struct to store phase II only information.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#112

Melanie Plageman

melanieplageman@gmail.com

6 months ago

In reply to: Masahiko Sawada (#111)

Re: Parallel heap vacuum

On Fri, Jul 18, 2025 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 17, 2025 at 1:39 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think there is a fundamental tension here related to whether or not
vacuumparallel.c should be table-AM agnostic. All of its code is
invoked below heap_vacuum_rel(). I would argue that vacuumparallel.c
is index AM-agnostic but heap-specific.

I think this is what makes your table AM callbacks feel odd. Other
table AM callbacks are invoked from general code -- for example from
executor code that is agnostic to the underlying table AMs. But, all
of the functions you added and the associated callbacks are invoked
below heap_vacuum_rel() (as is all of vacuumparallel.c).

What does it exactly mean to make vacuumparallel.c index AM-agnostic
but heap-specific? I imagine we change the vacuumparallel.c so that it
calls heap's functions for parallel vacuuming such as initializing DSM
and parallel vacuum worker's entry function etc. But I'm not sure how
it works with non-heap table AMs that uses vacuumparallel.c.

I don't understand what you mean by non-heap table AMs using
vacuumparallel.c. Do you mean that they are using it like a library? I
don't understand why the existing parallel index vacuumming code added
table AM callbacks either. They are only invoked from below
heap_vacuum_rel(). Unless I am missing something? From looking at
tableam.h, the parallel vacuum callbacks seem to be the only table AM
callbacks that are not invoked from table AM agnostic code. What was
the point of adding them?

I can see how other table AMs implementing vacuum may want to use some
of the vacuumparallel.c functions exposed in the vacuum.h header file
to save time and code for the index vacuuming portion of vacuuming.
But why have table AM callbacks that aren't invoked unless you are
using heap?

More generally, vacuuming is a very table AM-specific operation. You
often wouldn't need any of the same data structures. Think about an
undo-based table AM -- you don't need dead items or anything like
that. Or the phase ordering -- other table AMs may not need to vacuum
the table then the indexes then the table. Something like a scan, it
just has to yield tuples to the executor. But vacuum seems
inextricably linked with the on-disk layout. Most of the vacuum.c code
is just concerned with parsing options, checking permissions,
determining precedence with GUCs/table options, and doing catalog
operations.

One possible change would be to have lazyvacuum.c pass the set of
callbacks to parallel_vacuum_init() instead of having them as table AM
callbacks. This removes the weirdness of the associated table AM
callbacks being invoked below heap_vacuum_rel() but it doesn't address
your point "vacuumparallel.c is index AM-agnostic but heap-specific".

I do agree you are in a tough spot if we keep the index parallel
vacuuming callbacks in tableam.h. It is weird to call a heap-specific
function from within a table AM callback. I guess my first question is
which table AMs are using the existing callbacks and how are they
doing that since I don't see places where they are called from non
heap code. If they use the vacuum.h exposed vacuumparallel.c functions
as a library, they could do that without the callbacks in tableam.h.
But I think I must just be missing something.

And even for cases where the information has to be passed from the
leader to the worker, there is no reason you can't have the same
struct but in shared memory for the parallel case and local memory for
the serial case. For example with the struct members "aggressive",
"skipwithvm", and "cutoffs" in LVRelState and ParallelLVShared. Or
ParallelLVScanDesc->nblocks and LVRelState->rel_pages.

Ideally the entire part of the LVRelState that is an unchanging
input/reference data is in a struct which is in local memory for the
serial and local parallel case and a single read-only location in the
shared parallel case. Or, for the shared case, if there is a reason
not to read from the shared memory, we copy them over to a local
instance of the same struct. Maybe it is called HeapVacInput or
similar.

ParallelLVShared is created to pass information to parallel vacuum
workers while keeping LVRelStates able to work locally. Suppose that
we create HeapVacInput including "aggressive", "cutoff", "skipwithvm",
and "rel_pages", LVRelState would have to have a pointer to a
HeapVacInput instance that is either on local memory or shared memory.
Since we also need to pass other information such as
initial_chunk_size and eager scanning related information to the
parallel vacuum worker, we would have to create something like
ParallelLVShared as the patch does. As a result, we would have two
structs that need to be shared on the shared buffer. Is that kind of
what you meant? or did you mean that we include parallel vacuum
related fields too to HeapVacInput struct?

Yes, I think generally this is what I am saying.

Honestly, I think the two biggest problems right now are that 1) the
existing LVRelState mixes too many different types of data (e.g.
output stats and input data) and that 2) the structs your patch
introduces don't have names making it clear enough what they are.

What is ParallelLVScanDesc? We don't have the convention of a scan
descriptor for vacuum, so is the point here that it is only used for a
single "scan" of the heap table? That sounds like all of phase I to
me. Or why is the static input data structure just called
ParallelLVShared? Nothing about that indicates read-only or static.

When I was imagining how to make it more clear, I was thinking
something like what you are saying above. There are two types of input
read-only data passed from the leader to the workers. One is
parallel-only and one is needed in both the parallel and serial cases.
I was suggesting having the common input read-only fields in the same
type of struct. That would mean having two types of input data
structures in the parallel-case -- one parallel only and one common.
It's possible that that is more confusing.

I think it also depends on how the workers end up accessing the data.
For input read-only data you can either have workers access a single
copy in shared memory (and have a pointer that points to shared memory
in the parallel case and local memory in the serial case) or you can
copy the data over to local memory so that the parallel and serial
cases use it in the same data structure. You are doing mostly the
latter but the setup is spread around enough that it isn't immediately
clear that is what is happening.

One thing that might be worth trying is moving the setup steps in
heap_vacuum_rel() into some kind of helper function that can be called
both in heap_vacuum_rel() by the leader and also in the worker setup.
Then it might make it clear what only the leader does (e.g. call
vacuum_get_cutoffs()) and what everyone does. I'm just brainstorming
here, though.

I think if you just make a new version of the patch with the eager
scanning separated out and the names of the structs clarified, I could
try to be more specific. I found the spread of members across the
different structs hard to parse, but I'm not 100% sure yet what would
make it easier.

I wanted to keep ParallelLVShared() mostly read-only and aimed to
share the information from the leader to the workers, whereas
ParallelLVScanDesc() is a pure scan state maintained by each worker
(and the leader).

I'm not sure what scan state means here. nallocated seems to be a
coordination member. But nblocks seems to be used in places that you
have access to the LVRelState, which has rel_pages, so why wouldn't
you just use that?

chunk_size is used when allocating the new chunk (see
parallel_lazy_scan_get_nextpage()). It's initialized by
vacrel->plvstate->shared->initial_chunk_size since the first chunk
size could vary depending on eager scanning state, and is updated to
the fixed size PARALLEL_LV_CHUNK_SIZE from the next chunk.

I see that ParallelLVScanWorkerData->chunk_size is used in that way,
but I don't see where ParallelLVScanDesc->chunk_size is used. Also,
why does ParallelLVScanWorkerData->chunk_size need to be in shared
memory? Isn't the worker just using it locally?

- Melanie

#113

Masahiko Sawada

sawada.mshk@gmail.com

6 months ago

In reply to: Melanie Plageman (#112)

Re: Parallel heap vacuum

On Mon, Jul 21, 2025 at 7:28 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Jul 18, 2025 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 17, 2025 at 1:39 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think there is a fundamental tension here related to whether or not
vacuumparallel.c should be table-AM agnostic. All of its code is
invoked below heap_vacuum_rel(). I would argue that vacuumparallel.c
is index AM-agnostic but heap-specific.

I think this is what makes your table AM callbacks feel odd. Other
table AM callbacks are invoked from general code -- for example from
executor code that is agnostic to the underlying table AMs. But, all
of the functions you added and the associated callbacks are invoked
below heap_vacuum_rel() (as is all of vacuumparallel.c).

What does it exactly mean to make vacuumparallel.c index AM-agnostic
but heap-specific? I imagine we change the vacuumparallel.c so that it
calls heap's functions for parallel vacuuming such as initializing DSM
and parallel vacuum worker's entry function etc. But I'm not sure how
it works with non-heap table AMs that uses vacuumparallel.c.

I don't understand what you mean by non-heap table AMs using
vacuumparallel.c. Do you mean that they are using it like a library? I
don't understand why the existing parallel index vacuumming code added
table AM callbacks either. They are only invoked from below
heap_vacuum_rel(). Unless I am missing something? From looking at
tableam.h, the parallel vacuum callbacks seem to be the only table AM
callbacks that are not invoked from table AM agnostic code. What was
the point of adding them?

Given your suggestion that we should make vacuumparallel.c index
AM-agnostic but heap-specific, I imagined that parallel_vacuum_init()
calls a function defined in vacuumlazy.c that initializes the DSM
space for parallel heap vacuum. However, since parallel_vacuum_init()
is exposed, IIUC every table AM can use it. Therefore, for example a
non-heap table AM calls parallel_vacuum_init() in its own
relation_vacuum callback in order to vacuum its indexes in parallel,
it means eventually to call heap's DSM-initialization function, which
won't work.

I can see how other table AMs implementing vacuum may want to use some
of the vacuumparallel.c functions exposed in the vacuum.h header file
to save time and code for the index vacuuming portion of vacuuming.
But why have table AM callbacks that aren't invoked unless you are
using heap?

More generally, vacuuming is a very table AM-specific operation. You
often wouldn't need any of the same data structures. Think about an
undo-based table AM -- you don't need dead items or anything like
that. Or the phase ordering -- other table AMs may not need to vacuum
the table then the indexes then the table. Something like a scan, it
just has to yield tuples to the executor. But vacuum seems
inextricably linked with the on-disk layout. Most of the vacuum.c code
is just concerned with parsing options, checking permissions,
determining precedence with GUCs/table options, and doing catalog
operations.

I guess it ultimately depends on how other table AMs define what to do
in the VACUUM command (or autovacuum). For instance, the columnar
table AM simply updates the relation statistics during the vacuum
command[1]https://github.com/citusdata/citus/blob/main/src/backend/columnar/columnar_tableam.c#L1058. Since it can efficiently collect statistics such as the
number of live tuples, it completes this process quickly. If an AM
(not limited to the columnar table AM) needs to gather more
comprehensive statistics that might require scanning the entire table,
the feature introduced by this patch -- multiple workers scanning the
table while collecting data -- could prove beneficial and it might
want to use it in relation_vacuum callback. The patch introduces the
parallel_vacuum_collect_dead_items callback, but it's not restricted
to collecting only dead tuples. It can be used for gathering various
types of data.

Regarding phase ordering, both the existing vacuumparallel.c features
and the new parallel table vacuuming feature are flexible in their
execution sequence. If a table AM needs to invoke parallel index
vacuuming before parallel table vacuuming, it can call the
corresponding functions in that order.

One possible change would be to have lazyvacuum.c pass the set of
callbacks to parallel_vacuum_init() instead of having them as table AM
callbacks. This removes the weirdness of the associated table AM
callbacks being invoked below heap_vacuum_rel() but it doesn't address
your point "vacuumparallel.c is index AM-agnostic but heap-specific".

I do agree you are in a tough spot if we keep the index parallel
vacuuming callbacks in tableam.h. It is weird to call a heap-specific
function from within a table AM callback. I guess my first question is
which table AMs are using the existing callbacks and how are they
doing that since I don't see places where they are called from non
heap code. If they use the vacuum.h exposed vacuumparallel.c functions
as a library, they could do that without the callbacks in tableam.h.
But I think I must just be missing something.

Citus columnar table and orioledb are open-source table AM
implementations I know, and it seems that orioledb's vacuum codes are
somewhat similar to heap's one[2]https://github.com/orioledb/orioledb/blob/main/src/tableam/vacuum.c#L353 (I don't know the details of
orioledb though); it initializes parallel vacuum by calling
vacuum_parallel_init() in its relation_vacuum callback,
orioledb_vacuum_rel().

The reason why I added some callbacks as table AM callbacks in the
patch is that I could not find other better places. Currently,
vacuumparallel.c handles several critical operations for parallel
vacuuming: allocating and initializing DSM space for parallel index
vacuuming, initializing and launching parallel workers, and managing
various auxiliary tasks such as configuring vacuum delays and setting
maintenance_work_mem for workers. Given these existing
functionalities, I aimed to implement the parallel heap vacuum worker
launch through the same vacuumparallel.c codebase, maintaining a
consistent interface. To achieve this integration, vacuumparallel.c
requires access to heap-specific functions, and defining them as table
AM callbacks emerged as the cleanest solution. My personal vision is
to maintain vacuumparallel.c's neutrality regarding both index AM and
table AM implementations, while providing a unified interface for
table AMs that wish to leverage parallel operations-- whether for
table vacuuming, index vacuuming, or both.

And even for cases where the information has to be passed from the
leader to the worker, there is no reason you can't have the same
struct but in shared memory for the parallel case and local memory for
the serial case. For example with the struct members "aggressive",
"skipwithvm", and "cutoffs" in LVRelState and ParallelLVShared. Or
ParallelLVScanDesc->nblocks and LVRelState->rel_pages.

Ideally the entire part of the LVRelState that is an unchanging
input/reference data is in a struct which is in local memory for the
serial and local parallel case and a single read-only location in the
shared parallel case. Or, for the shared case, if there is a reason
not to read from the shared memory, we copy them over to a local
instance of the same struct. Maybe it is called HeapVacInput or
similar.

ParallelLVShared is created to pass information to parallel vacuum
workers while keeping LVRelStates able to work locally. Suppose that
we create HeapVacInput including "aggressive", "cutoff", "skipwithvm",
and "rel_pages", LVRelState would have to have a pointer to a
HeapVacInput instance that is either on local memory or shared memory.
Since we also need to pass other information such as
initial_chunk_size and eager scanning related information to the
parallel vacuum worker, we would have to create something like
ParallelLVShared as the patch does. As a result, we would have two
structs that need to be shared on the shared buffer. Is that kind of
what you meant? or did you mean that we include parallel vacuum
related fields too to HeapVacInput struct?

Yes, I think generally this is what I am saying.

Honestly, I think the two biggest problems right now are that 1) the
existing LVRelState mixes too many different types of data (e.g.
output stats and input data) and that 2) the structs your patch
introduces don't have names making it clear enough what they are.

What is ParallelLVScanDesc? We don't have the convention of a scan
descriptor for vacuum, so is the point here that it is only used for a
single "scan" of the heap table? That sounds like all of phase I to
me.

Yes, ParallelLVScanDesc is added to control the working state during
do_parallel_lazy_scan_heap(), a parallel variant of phase I.

Or why is the static input data structure just called
ParallelLVShared? Nothing about that indicates read-only or static.

When I was imagining how to make it more clear, I was thinking
something like what you are saying above. There are two types of input
read-only data passed from the leader to the workers. One is
parallel-only and one is needed in both the parallel and serial cases.
I was suggesting having the common input read-only fields in the same
type of struct. That would mean having two types of input data
structures in the parallel-case -- one parallel only and one common.
It's possible that that is more confusing.

I think it also depends on how the workers end up accessing the data.
For input read-only data you can either have workers access a single
copy in shared memory (and have a pointer that points to shared memory
in the parallel case and local memory in the serial case) or you can
copy the data over to local memory so that the parallel and serial
cases use it in the same data structure. You are doing mostly the
latter but the setup is spread around enough that it isn't immediately
clear that is what is happening.

One thing that might be worth trying is moving the setup steps in
heap_vacuum_rel() into some kind of helper function that can be called
both in heap_vacuum_rel() by the leader and also in the worker setup.
Then it might make it clear what only the leader does (e.g. call
vacuum_get_cutoffs()) and what everyone does. I'm just brainstorming
here, though.

I think if you just make a new version of the patch with the eager
scanning separated out and the names of the structs clarified, I could
try to be more specific. I found the spread of members across the
different structs hard to parse, but I'm not 100% sure yet what would
make it easier.

Okay, I'll try to clarify the names and the above idea in the next
version patch.

I wanted to keep ParallelLVShared() mostly read-only and aimed to
share the information from the leader to the workers, whereas
ParallelLVScanDesc() is a pure scan state maintained by each worker
(and the leader).

I'm not sure what scan state means here. nallocated seems to be a
coordination member. But nblocks seems to be used in places that you
have access to the LVRelState, which has rel_pages, so why wouldn't
you just use that?

Right, we can use rel_pages instead.

chunk_size is used when allocating the new chunk (see
parallel_lazy_scan_get_nextpage()). It's initialized by
vacrel->plvstate->shared->initial_chunk_size since the first chunk
size could vary depending on eager scanning state, and is updated to
the fixed size PARALLEL_LV_CHUNK_SIZE from the next chunk.

I see that ParallelLVScanWorkerData->chunk_size is used in that way,
but I don't see where ParallelLVScanDesc->chunk_size is used.

It seems that ParallelLVScanDesc->chunk_size is not used at all, so
I'll remove it.

Also,
why does ParallelLVScanWorkerData->chunk_size need to be in shared
memory? Isn't the worker just using it locally?

I put it in shared memory for the case where the
ParallelLVScanWorkerData is re-used by another worker after resuming
phase I. But the worker can simply use PARALLEL_LV_CHUNK_SIZE in that
case as the first chunk should have already been allocated. I'll
remove it.

Regards,

[1]: https://github.com/citusdata/citus/blob/main/src/backend/columnar/columnar_tableam.c#L1058
[2]: https://github.com/orioledb/orioledb/blob/main/src/tableam/vacuum.c#L353

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#114

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Masahiko Sawada (#113)

Re: Parallel heap vacuum

Hi,

On 2025-07-21 12:41:49 -0700, Masahiko Sawada wrote:

The reason why I added some callbacks as table AM callbacks in the
patch is that I could not find other better places. Currently,
vacuumparallel.c handles several critical operations for parallel
vacuuming: allocating and initializing DSM space for parallel index
vacuuming, initializing and launching parallel workers, and managing
various auxiliary tasks such as configuring vacuum delays and setting
maintenance_work_mem for workers. Given these existing
functionalities, I aimed to implement the parallel heap vacuum worker
launch through the same vacuumparallel.c codebase, maintaining a
consistent interface. To achieve this integration, vacuumparallel.c
requires access to heap-specific functions, and defining them as table
AM callbacks emerged as the cleanest solution.

I don't think that can be the right solution. Vacuuming is a table-am specific
operation and thus already happens within a table-am's own code. It would be
rather wrong to have tableam.h indirected calls below heapam specific code.

That is not to say you can't have callbacks or such, it just doesn't make
sense for those callbacks to be at the level of tableam. If you want to make
vacuumparallel support parallel table vacuuming for multiple table AMs (I'm
somewhat doubtful that's a good idea), you could do that by having a
vacuumparallel.c specific callback struct.

Greetings,

Andres Freund

#115

Masahiko Sawada

sawada.mshk@gmail.com

6 months ago

In reply to: Andres Freund (#114)

Re: Parallel heap vacuum

On Mon, Jul 21, 2025 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-07-21 12:41:49 -0700, Masahiko Sawada wrote:

The reason why I added some callbacks as table AM callbacks in the
patch is that I could not find other better places. Currently,
vacuumparallel.c handles several critical operations for parallel
vacuuming: allocating and initializing DSM space for parallel index
vacuuming, initializing and launching parallel workers, and managing
various auxiliary tasks such as configuring vacuum delays and setting
maintenance_work_mem for workers. Given these existing
functionalities, I aimed to implement the parallel heap vacuum worker
launch through the same vacuumparallel.c codebase, maintaining a
consistent interface. To achieve this integration, vacuumparallel.c
requires access to heap-specific functions, and defining them as table
AM callbacks emerged as the cleanest solution.

I don't think that can be the right solution. Vacuuming is a table-am specific
operation and thus already happens within a table-am's own code. It would be
rather wrong to have tableam.h indirected calls below heapam specific code.

That is not to say you can't have callbacks or such, it just doesn't make
sense for those callbacks to be at the level of tableam. If you want to make
vacuumparallel support parallel table vacuuming for multiple table AMs (I'm
somewhat doubtful that's a good idea), you could do that by having a
vacuumparallel.c specific callback struct.

Thank you for the comments!

Do you think it makes sense to implement the above idea that we launch
parallel vacuum workers for heap through the same vacuumparallel.c
codebase and maintain the consistent interface with parallel index
vacuuming APIs?

I've considered an idea of implementing parallel heap vacuum
independent of vacuumparallel.c, but I find some difficulties and
duplications. For instance, vacuumparallel.c already does many
operations like initializing DSM space, setting vacuum delays, and
launching parallel vacuum workers. Probably we can expose
ParallelVacuumContext so that vacuumlazy.c can use it to launch
parallel workers for heap vacuuming, but I'm not sure it's a good
idea. We also need to expose other works that vacuumparallel.c does
such as setting ring_buffer size and cost-based vacuum delay too and
vacuumlazy.c needs to use them appropriately.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#116

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Masahiko Sawada (#115)

Re: Parallel heap vacuum

Hi,

On 2025-07-22 11:44:29 -0700, Masahiko Sawada wrote:

Do you think it makes sense to implement the above idea that we launch
parallel vacuum workers for heap through the same vacuumparallel.c
codebase and maintain the consistent interface with parallel index
vacuuming APIs?

Yes, that might make sense. But wiring it up via tableam doesn't make sense.

Greetings,

Andres Freund

#117

Robert Haas

robertmhaas@gmail.com

6 months ago

In reply to: Andres Freund (#114)

Re: Parallel heap vacuum

On Mon, Jul 21, 2025 at 7:49 PM Andres Freund <andres@anarazel.de> wrote:

That is not to say you can't have callbacks or such, it just doesn't make
sense for those callbacks to be at the level of tableam. If you want to make
vacuumparallel support parallel table vacuuming for multiple table AMs (I'm
somewhat doubtful that's a good idea), you could do that by having a
vacuumparallel.c specific callback struct.

I'm not doubtful that this is a good idea. There are a number of
tableams around these days that are "heap except whatever", where (I
suspect) the size of "whatever" ranges from quite small to moderately
large. I imagine that such efforts end up duplicating a lot of heapam
code and probably always will; but if we can avoid increasing that
amount, I think it's a good idea.

--
Robert Haas
EDB: http://www.enterprisedb.com

#118

Masahiko Sawada

sawada.mshk@gmail.com

5 months ago

In reply to: Robert Haas (#117)

Re: Parallel heap vacuum

On Fri, Jul 25, 2025 at 10:40 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jul 21, 2025 at 7:49 PM Andres Freund <andres@anarazel.de> wrote:

That is not to say you can't have callbacks or such, it just doesn't make
sense for those callbacks to be at the level of tableam. If you want to make
vacuumparallel support parallel table vacuuming for multiple table AMs (I'm
somewhat doubtful that's a good idea), you could do that by having a
vacuumparallel.c specific callback struct.

I'm not doubtful that this is a good idea. There are a number of
tableams around these days that are "heap except whatever", where (I
suspect) the size of "whatever" ranges from quite small to moderately
large. I imagine that such efforts end up duplicating a lot of heapam
code and probably always will; but if we can avoid increasing that
amount, I think it's a good idea.

Based on our previous discussions, I have contemplated modifying the
patch to define the callback functions for parallel table vacuum
within a structure in vacuumparallel.c instead of using table AM's
callbacks. This approach would require the leader to pass both the
library name and handler function name to vacuumparallel.c, enabling
workers to locate the handler function via load_external_function()
and access the callback functions. Although it's technically feasible,
I'm not sure that the design is elegant; while table AM seeking to use
parallel index vacuuming can simply invoke parallel_vacuum_init(),
those requiring parallel table vacuum would need to both provide the
handler function and supply the library and handler function names.

An alternative, more straightforward approach would be to implement a
dedicated ParallelContext in vacuumlazy.c specifically for parallel
heap vacuum, distinct from the ParallelVacuumContext in
vacuumparallel.c. Under this design, vacuumparallel.c would be
exclusively dedicated to parallel index vacuuming, while the parallel
heap vacuum implementation would be contained within vacuumlazy.c,
eliminating the need for a handler function. While this solution seems
more elegant, it would result in code duplication between vacuumlazy.c
and vacuumparallel.c, particularly in areas concerning worker
initialization and cost-based delays.

At present, I am inclined toward the latter solution, though I have
yet to implement it. I welcome any feedback on these approaches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#119

Melanie Plageman

melanieplageman@gmail.com

5 months ago

In reply to: Andres Freund (#116)

Re: Parallel heap vacuum

On Wed, Jul 23, 2025 at 12:06 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-07-22 11:44:29 -0700, Masahiko Sawada wrote:

Do you think it makes sense to implement the above idea that we launch
parallel vacuum workers for heap through the same vacuumparallel.c
codebase and maintain the consistent interface with parallel index
vacuuming APIs?

Yes, that might make sense. But wiring it up via tableam doesn't make sense.

If you do parallel worker setup below heap_vacuum_rel(), then how are
you supposed to use those workers to do non-heap table vacuuming?

All the code in vacuumparallel.c is invoked from below
lazy_scan_heap(), so I don't see how having a
vacuumparallel.c-specific callback struct solves the layering
violation.

It seems like parallel index vacuuming setup would have to be done in
vacuum_rel() if we want to reuse the same parallel workers for the
table vacuuming and index vacuuming phases and allow for different
table AMs to vacuum the tables in their own way using these parallel
workers.

- Melanie

#120

Masahiko Sawada

sawada.mshk@gmail.com

5 months ago

In reply to: Melanie Plageman (#119)

Re: Parallel heap vacuum

On Tue, Aug 26, 2025 at 8:55 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Wed, Jul 23, 2025 at 12:06 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-07-22 11:44:29 -0700, Masahiko Sawada wrote:

Do you think it makes sense to implement the above idea that we launch
parallel vacuum workers for heap through the same vacuumparallel.c
codebase and maintain the consistent interface with parallel index
vacuuming APIs?

Yes, that might make sense. But wiring it up via tableam doesn't make sense.

If you do parallel worker setup below heap_vacuum_rel(), then how are
you supposed to use those workers to do non-heap table vacuuming?

IIUC non-heap tables can call parallel_vacuum_init() in its
relation_vacuum table AM callback implementation in order to
initialize parallel table vacuum, parallel index vacuum, or both.

All the code in vacuumparallel.c is invoked from below
lazy_scan_heap(), so I don't see how having a
vacuumparallel.c-specific callback struct solves the layering
violation.

I think the layering problem we discussed is about where the callbacks
are declared; it seems odd that we add new table AM callbacks that are
invoked only by another table AM callback. IIUC we invoke all codes in
vacuumparallel.c in vacuumlazy.c even today. If we think we think this
design has a problem in terms of layering of functions, we can
refactor it as a separate patch.

It seems like parallel index vacuuming setup would have to be done in
vacuum_rel() if we want to reuse the same parallel workers for the
table vacuuming and index vacuuming phases and allow for different
table AMs to vacuum the tables in their own way using these parallel
workers.

Hmm, let me clarify your idea as I'm confused. If the parallel
context used for both table vacuuming and index vacuuming is set up in
vacuum_rel(), its DSM would need to have some data too required by
table AM to do parallel table vacuuming. In order to do that, table
AMs somehow need to tell the necessary DSM size at least. How do table
AMs tell that to parallel vacuum initialization function (e.g.,
parallel_vacuum_init()) in vacuum_rel()?

Also, if we set up the parallel context in vacuum_rel(), we would
somehow need to pass it to the relation_vacuum table AM callback so
that they can use it during their own vacuum operation. Do you mean to
pass it via table_relation_vacuum()?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#121

Melanie Plageman

melanieplageman@gmail.com

4 months ago

In reply to: Masahiko Sawada (#120)

Re: Parallel heap vacuum

On Wed, Aug 27, 2025 at 2:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Aug 26, 2025 at 8:55 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

If you do parallel worker setup below heap_vacuum_rel(), then how are
you supposed to use those workers to do non-heap table vacuuming?

IIUC non-heap tables can call parallel_vacuum_init() in its
relation_vacuum table AM callback implementation in order to
initialize parallel table vacuum, parallel index vacuum, or both.

Yep, it's just that that is more like a library helper function.
Currently, in master, the flow is:

leader does:
heap_vacuum_rel()
dead_items_alloc()
parallel_vacuum_init()
passes parallel_vacuum_main() to parallel context
lazy_scan_heap()
...
parallel_vacuum_process_all_indexes()
LaunchParallelWorkers()
...
parallel_vacuum_process_[un]safe_indexes()

For the leader, I don't remember where your patch moves
LaunchParallelWorkers() -- since it would have to move to before the
first heap vacuuming phase. When I was imagining what made the most
sense to make parallelism agnostic of heap, it seemed like it would be
moving LaunchParallelWorkers() above heap_vacuum_rel(). But I
recognize that is a pretty big change, and I haven't tried it, so I
don't know all the complications. (sounds like you mention some
below).

Basically, right now in master, we expose some functions for parallel
index vacuuming that people can call in their own table implementation
of table vacuuming, but we don't invoke any of them in table-AM
agnostic code. That may work fine for just index vacuuming, but now
that we are trying to make parts of the table vacuuming itself
extensible, the interface feels more awkward.

I won't push the idea further because I'm not even sure if it works,
but the thing that felt the most natural to me was pushing parallelism
above the table AM.

All the code in vacuumparallel.c is invoked from below
lazy_scan_heap(), so I don't see how having a
vacuumparallel.c-specific callback struct solves the layering
violation.

I think the layering problem we discussed is about where the callbacks
are declared; it seems odd that we add new table AM callbacks that are
invoked only by another table AM callback. IIUC we invoke all codes in
vacuumparallel.c in vacuumlazy.c even today. If we think we think this
design has a problem in terms of layering of functions, we can
refactor it as a separate patch.

Sure, but we are doubling down on it. I think parallel heap vacuuming
would have a completely different interface if that layering worked
differently. But, again, I don't want to put this additional,
extremely complex barrier in the way. I brought it up because it's
what occurred to me when reviewing this. But, I think it's fine to
look for an achievable compromise.

It seems like parallel index vacuuming setup would have to be done in
vacuum_rel() if we want to reuse the same parallel workers for the
table vacuuming and index vacuuming phases and allow for different
table AMs to vacuum the tables in their own way using these parallel
workers.

Hmm, let me clarify your idea as I'm confused. If the parallel
context used for both table vacuuming and index vacuuming is set up in
vacuum_rel(), its DSM would need to have some data too required by
table AM to do parallel table vacuuming. In order to do that, table
AMs somehow need to tell the necessary DSM size at least. How do table
AMs tell that to parallel vacuum initialization function (e.g.,
parallel_vacuum_init()) in vacuum_rel()?

Right, you'd have to gather more information sooner (number of indexes, etc).

Also, if we set up the parallel context in vacuum_rel(), we would
somehow need to pass it to the relation_vacuum table AM callback so
that they can use it during their own vacuum operation. Do you mean to
pass it via table_relation_vacuum()?

Yes, something like that. But, that would require introducing more
knowledge of index vacuuming into that layer.

I chatted with Andres about this a bit off-list and he suggested
passing a callback for vacuuming the table to parallel_vacuum_init().
I don't know exactly how this would play with parallel_vacuum_main()'s
current functionality, though.

I think this is a similar approach to what you mentioned in your earlier comment

Based on our previous discussions, I have contemplated modifying the
patch to define the callback functions for parallel table vacuum
within a structure in vacuumparallel.c instead of using table AM's
callbacks. This approach would require the leader to pass both the
library name and handler function name to vacuumparallel.c, enabling
workers to locate the handler function via load_external_function()
and access the callback functions. Although it's technically feasible,
I'm not sure that the design is elegant; while table AM seeking to use
parallel index vacuuming can simply invoke parallel_vacuum_init(),
those requiring parallel table vacuum would need to both provide the
handler function and supply the library and handler function names.

Are there other drawbacks to requiring the library and handler function names?

I'd find it a bit awkward to pass ParallelWorkerMain as the
bgw_libary/function name in LaunchParallelWorkers and then have
another point where you pass other library and function names, but I'm
not really sure how to resolve that.

In your patch, for the workers, I know you added a state machine to
parallel_vacuum_main() so that it would do table and index vacuuming.
That makes sense to me. But doing it this way seems like it is
limiting extensibility. But finding somewhere to pass the callback
that doesn't feel like an awkward addition to the existing function
passed to parallel workers is also hard.

- Melanie

#122

Tomas Vondra

tomas@vondra.me

4 months ago

In reply to: Melanie Plageman (#121)

8 attachment(s)

Re: Parallel heap vacuum

On 9/8/25 17:40, Melanie Plageman wrote:

On Wed, Aug 27, 2025 at 2:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Aug 26, 2025 at 8:55 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

If you do parallel worker setup below heap_vacuum_rel(), then how are
you supposed to use those workers to do non-heap table vacuuming?

IIUC non-heap tables can call parallel_vacuum_init() in its
relation_vacuum table AM callback implementation in order to
initialize parallel table vacuum, parallel index vacuum, or both.

Yep, it's just that that is more like a library helper function.
Currently, in master, the flow is:

leader does:
heap_vacuum_rel()
dead_items_alloc()
parallel_vacuum_init()
passes parallel_vacuum_main() to parallel context
lazy_scan_heap()
...
parallel_vacuum_process_all_indexes()
LaunchParallelWorkers()
...
parallel_vacuum_process_[un]safe_indexes()

For the leader, I don't remember where your patch moves
LaunchParallelWorkers() -- since it would have to move to before the
first heap vacuuming phase. When I was imagining what made the most
sense to make parallelism agnostic of heap, it seemed like it would be
moving LaunchParallelWorkers() above heap_vacuum_rel(). But I
recognize that is a pretty big change, and I haven't tried it, so I
don't know all the complications. (sounds like you mention some
below).

Basically, right now in master, we expose some functions for parallel
index vacuuming that people can call in their own table implementation
of table vacuuming, but we don't invoke any of them in table-AM
agnostic code. That may work fine for just index vacuuming, but now
that we are trying to make parts of the table vacuuming itself
extensible, the interface feels more awkward.

I won't push the idea further because I'm not even sure if it works,
but the thing that felt the most natural to me was pushing parallelism
above the table AM.

...

I took a quick look at the patch this week. I don't have a very strong
opinion on the changes to table AM API, and I somewhat agree with this
impression. It's not clear to me why we should be adding callbacks that
are AM-specific (and only ever called from that particular AM) to the
common AM interface.

I keep thinking about how we handle parallelism in index builds. The
index AM API did not get a bunch of new callbacks, it's all handled
within the existing ambuild() callback. Shouldn't we be doing something
like that for relation_vacuum()?

That is, each AM-specific relation_vacuum() would be responsible for
starting / stopping workers and all that. relation_vacuum would need
some flag indicating it's OK to use parallelism, of course. We might add
some shared infrastructure to make that more convenient, of course.

This is roughly how index builds handle parallelism. It's not perfect
(e.g. the plan_create_index_workers can do unexpected stuff for
non-btree indexes).

Interestingly enough, this seems to be the exact opposite of what
Melanie proposed above, i.e. moving the parallelism "above" the table
AM. Which AFAICS means we'd keep the new table AM callbacks, but move
the calls "up" above the AM-specific parts.

That should be possible, and maybe even cleaner. But ISTM it'd require a
much more extensive refactoring of how vacuum works. Haven't tried, ofc.

I also repeated the stress test / benchmark, measuring impact with
different number of indexes, amount of deleted tuples, etc. Attached is
a PDF summarizing the results for each part of the patch series (with
different number of workers). For this tests, the improvements are
significant only with no indexes - then the 0004 patch saves 30-50%. But
as soon as indexes are added, the index cleanup starts to dominate.

It's just an assessment, I'm not saying we shouldn't parallelize the
heap cleanup. I assume there are workloads for which patches will make
much more difference. But, what would such cases look like? If I want to
maximize the impact of this patch series, what should I do?

FWIW the patch needs rebasing, there's a bit of bitrot. It wasn't very
extensive, so I did that (in order to run the tests), attached is the
result as v19.

This also reminds I heard a couple complaints we don't allow parallelism
in autovacuum. We have parallel index vacuum, but it's disabled in
autovacuum (and users really don't want to run manual vacuum).

That question is almost entirely orthogonal to this patch, of course.
I'm not suggesting this patch has (or even should) to do anything about
it. But I wonder if you have any thoughts on that?

I believe the reason why parallelism is disabled in autovacuum is that
we want autovacuum to be a background process, with minimal disruption
to user workload. It probably wouldn't be that hard to allow autovacuum
to do parallel stuff, but it feels similar to adding autovacuum workers.
That's rarely the solution, without increasing the cost limit.

regards

--
Tomas Vondra

Attachments:

create.sqlapplication/sql; name=create.sqlDownload

parallel-vacuum-test.shapplication/x-shellscript; name=parallel-vacuum-test.shDownload

parallel-heap-vacuum.pdfapplication/pdf; name=parallel-heap-vacuum.pdfDownload

%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(��Google Sheets)
/Title
(��parallel vacuum / ryzen)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
792
612
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x��]���:Rm��O���{�{�a�6��8�+vY~�������b���O��:��I���;��v��9'v*'�*���<������������������i�>J?,�^~�&o��>a��1�������.�������OE��w�Y��,�n ��d������`�v������� �������nh������������]�Z~_�Dk0��nE���4��������'����l�a��J���oO����������s�n�����F#^�������g������s�/��-C���#�u���3�e�wf��e���i�J�����=��<�����p7:^����5��0/C1��n��vb���O�L�w"Z�FfSHd���m������v2n�Y����cD�%ZH�:5���Q9N�N�����T����������O���Wo����\�%Z�6^�O������y��T��E��7wV�/�K�!4���r��-�B���2R�i��=����p�s:0�������r�����2�`br��Sl�r
[��Ch(�W�:�}0�;sg!W7��p8�v�g`N��<�yNZ��d�!�c�:�,�}���������)� �`F��.HT��r��-��N��1�����&{�6N�}��������[�N���w()"���Y����q���0�-���}�\�+!�E�K
�l������*�J��H�Et�<������[_Q=�=��f?0W$��|I����{��Gl6oL�
_>b�G,�HlZ�8��nq�(���@|��0��f�}���bT|.��d����q&j\��k��}9�[���`s�DH^�		�L$\���G�>�+3�GF��qa"#����6���}31F3Qx#����^���q���g�]�qa�[}r� ���Hg�4#x�I:�����������!X�p}��$��;�C&���qa"�������� 4�9���1��9#�����F��q
3����_l\���c���-w.�u��I���I����D����W��u��I�����P���2Fr9�9$��L��C��F:�GxIL����9�G��0��n������e�#�L��|�&�0��l���=z ���Rz1�Jn��G�$X���`��&�f����5>J�]���P����p��nuI�6���)��M�NF:�sH�85.L <
���&�P����L���
��_�s����_l\��wSx`$�L����&�Bi���x=���X�j\�H�"�4i�G^��C}����l�����}j#�=�C�C:��b����E:-���I���{���	
)B��!'#F��C������ ���>izso"a~���A����#6.��/��,�����t3��`f��$5�VlJ��D9e���1�9�s��&��|���'���])$tF0�0}<���-*��E3�GF��qab<&:�E���F!�7��|�������Xpj\��4<L�����>j;�Q�dj3G��-e.2i@In��&>�|��Q��D�$2<��U�!ye�-$tF0�<���Ko{����#����0�1=�����9�6�|~��-�r!�7����
�>c�g,�5.����$]�6�<(�3{�d�_�5Bv� 5Anu���85.L�cI~����fNJ���P���&0�M0eF��c���>'�=0�[B1(Sx#��Q59���h�k,z�������r
4�Y��x���k3g^*������6r�$$��&��{�kw�a�����/6.L�_3��1�!Fr��$���`"czC@4����b�p�qa��|�oi�q��`�X��9i����y�-jJY&���,��?�]&�9�s�.6.��0�O<u����MDHN�t	��7���M������<�y�����$d}�v���DH>�D	��7�0�	I�?/D���b�_b��,=F�Kf�!���N0�Z^�)����C��1� %���N�s��&.i�:�:�]�<�CBa'38#����/���Q�2;h&�H��06.L$�������c����$FrI��d
o���}Mt=���P������=l������#�Pk�����#/�����)�Z�t��K�d�'���	t��@r:��$�h�P�)M`��9e('��1H���al\�8�QM`_T�?-w_4G���&�&z�E���1��0qH7e�|V�.�,�j�j;��:�jNa�<M?n��6�T
X�j�Z#�9�s��&R�q������J|��P���&��z�g>�|�b���=+�#�vc4'(��&2Tt
:�E���0��0�o
����)�����v��]t�����&��X�B���;��q>j\��x 2�vP�I|����S���a�W*P���t��Nf��f�����^N���t��N"�����B���[���	�:����P��*y�DN��s��N0�{��r���'�4r8(�d:��h�/�ap�bD�0������b`���,�%� �GF�q�/���D7y�za\l��S�H8�������x�#x����h�ih
0���.J+�h}9T{���t�����y�~�i`�������'���	Tc�������I��N"����D���-0��N&�@&b�#���c�
r�L��B*<���J��<��x�"d"��&��&n����RAN��`f��k�c[���r��-39dr��&p�\�����82���������D�9MkS��v2������=0�����c��\eq!��3��H�e�L��`��7�$���-Tv�����	�i�f��Ln�[y�u��9�E����`S�~
��&0������	'���YP&��1:`$���N�pF0��n)�����w������cl\�E��Fr>������)�Ld�^~)�b�����-85.L`dw2���fO�.6{�O"��c��9hQ��o�D���q���0��ZJ���I����{�`<��5H���eh\�%_9��l+y�%�`f
o���1-<����d%sQb��1<&6.Ld<;�;�}4��G
�V�k�BC�g�]L|�4P?*Md�o�))9���ML�t��s
��qa"a����}����-������	�k���@�%����RX���mR���������E$�hP^)M����ym�$F�aP�(>�&0k���8�F�&0$'��@1b�1:^~(��G����9�F��N<��G�8�)���1r
�{��>�(��� P�Hy�T��o��[H��&���-�/B��Au ��e<����9M �{[��>�	'���	����A� 1F�CI�4�1��Lgbn!���#����0����xP"H���P(Md<��J����P�����`G%�`&?u/
�EM�r����V���
I~���8�uG�N;���vb�2:so"��W�� ��<��'���l(K&��=�&� AR;:NH�(��hb��E�!�3����C�b��$`���t�)��2�;<�7u����q>j\��xzD
����������d��G4.x�m�!�t8I�A�<b�s��&0��rl�1����D�s3�DhqH-)5.L ����V��$������Sj�5�,�q!�C:�t��Q���.��1Z2Az-S8#�SKz-3zd��f�(��ep��+�����&2�d���G6�0�6����P�������y�tNO�jN=O��EM�-� �#6�l������|g3l����p'�����x�I����Z���0����O+,��@�)��Z�l*��$b�~�f2�C:���qa"a��+�����n
��F3�4�	�g�`b�8�F�&0rBwrO��wp�����Ce@�6��9�s��&0E����Z�D��L���t��[h!��-)5.L`L�:��1����~
f��O8�������f���-!�������	��e�-�61Fs��Si�_�������haH�qa��:.�-�
f��3O����C���C�!�C:���qa"�q��nd�e4gH�����������R����}Dtc W�t}��#hQS��w��C:�t�����Y�e�,e;�n!�=������{��a����a�mW��=0����D��-�!lN|!�N|����0���[��w����M�]������Y�C�fE��Rr<�@3;��sN�����#�.��y�8�C����3q�y:�+9��L�t�%$�J	��0H���!�f�f���C=0�?�j�L��`"#;h��Yn�M�����P�����tB�x��`f�t<{WXNA���	'������YHy �M�U���DM�����������JX]�=0������h��*aYe��f^�d�z�p������Xpj\����������Aj4��A����-'�b�va�6���9NG�	��{��������D��.�&R������L��26.�*��5�y��M������`��uWG������	������K�!����s�3�hQ�����29dr�)6.�����B�D���2����{� <^Cf��#����0�����q��M������H8&q�[�m�Q,:�G����:�dC�f�c3;��I���Mw�o�� �rD���qBj\�Hy2���#�^!�3���}������3/3�GB��qa����6�]3qE3�%3��\G{C:�E���0��0���a.���v0��<�]X��Y\`�w>:�\R���q>j\�����V���P�GBR;s8#�@y�������cl\��I1�(��O|��I����D����A���5��&#\�����.���q;��o���������_�xu���qH���h����3v��f
�3�em�L�>cK7���XUy�����6�_u�t}�B0�dy�u���}F]�V�B��d9�u�,�>�.�����`��0u!��<���@��>�.D+v!X�����@�a�Q�����m��G�0d�Y��Z��h��~�h�����H�k�����{��
v��_���u�W�v�
`�]8c�}L�p�E��`���=��}��$�a0���.���bi�K��`�_\o�0��k�������G���%��"\����o��������� ����w���mt�w�D�������������~������������0>���O���������?z���o���~�2������������,���=~��/F_g	��Mx�	d��y�����k�n���j���2��������'�H��D���������,#�wO��W�����i�S*��NS�I����
|S/Q~�OS�E����I<������5�9@�]�"��w0om��?u��F�i��
���O��{����O��]^��q����!���]��E����P���������\�����?������f���&r�������������G��?6��[�������2U��~Y���i�����/�����{9/�=��m�9��}��'�����,�D?z�����-�}Z�����*����m�����k]Nw�C{V���[�`o�K#t���o�`�(������W�}���}��i�J�T��n|h�HV�?5���F�\u��&��B�6*MV��i�S�@V��Z5,�c�c�U��Y�����?+����}
��y��]�O�R8��6���wq_�q�]�'����i�ks�,z"U��6�,zU��F�,zU���1�^�S��f�����C��f�����b�*tk|�EO���5������A+z��U�!xq5�S���H�e����Z	�����u�-"��'s����:vk��o���S�:v��d��cj������v/����_�1���J:�D^�5.I(C����`4_1I`
�Y�Ik����(�CZ���LE�5"�5������[���e�o��t���t�|���x�L���z/�V��sk�1������������>ps��$��'	��q�h'	�����3�%����$P������7E��$���/����5��$p�M��^>&L���Z.b(b�?EY+��J�$��<�NXn���_i(���o�7/���E�B��a�����elu
{��E�\m4�#�Z�6����^��Vyj����?�Fx+��\7�/���*�M��F�%D���\�7���j������Cr��"��P�U�z�9�K�.+��I�+W���_<��N�J�n����a��D%iZ�3����V-P��<zO1|l�$�g��s�x��n�Eo�/���oh�F�Z7>l��J���U������)��z���4�?�����TAm��M���T�1��.��T�z����j9��~��~hV&�;�����f=���k�F+��n| |�����K��s��/!�5�v���xi��1��/Fk����q�M_m�]�:���I�x�k���x�\�q}n���x
��������������n�4w#M���Wk� ��)�-X���s��p�������~�����~_����Q����D��%�gxn�%�_����q���}��`7}�r�`���K�
����s��0
��F�~&7�'���$��d���,m�YS�t���Z%0��n�ks`O���h/ZQ�Z�vM������)P����5��e����u�e�����$ ���B���uQ$���e�@&+����*�dr��y����Aj��s|z��=����Q����������|����P�����|V�B?��7'v�<����i1����m1_<�DD�f1��}�b�iq�UVZw{�������H�^CE���_I�k�a�Rk�kN]{�7����qK�6�R�a����
��-�Rl���-k������5�vu�����;j7O�Z{~;T�Z�$�"��63m�����D�-������k@9Y	\��[��Lw���zlH��
�f����ko}�2m�/X+���j����qX�>��^[Vj�x~����/s=-}X_�ma��
���������R9X���(v1A��7u�tod.w�[��LE���@&	��X��QV��� V�B/���P��k����v:�gJ[���B�xk^��������Y+�7Z����ko]������Y��qgd�����~S����/�D*_!V�������=W�B!^q�����������"���XAC��y]-�;��0A�A�����[�6�$piU��3��b7{5�6^<�+B�����`�?�U�5`����?�I[M��B��z�P��zP�����6����5�6���n��XC�m��
���q�=i�.ic�Z���r7.������rX��6��X],�|�Ik��j��������M�C�\F��X.c_W,��o[YU_,���c���O=L�[�>L�+��TG����+Q����L��-O�k��s��l����AI�E������8�*�v��h�~(�w�N�����0����-��\7��y�z����]��(���6Qa�XA�i�C��ko��L+�/�f� ���Z�,�"���$e��C�d���E��+V��kW��c/��8��X����\���l�PfA��?)��1A��T�-�]��%�
z(f���Lw���u�c��[�7udA�)�$�"�B�� V��G���/��I��$����6
�|~��B.s�)ska�	b=l�iL�fA�C��*�Lw�[��Lw�o�5�5� ��
��kk�)�����$�|�����+����d`���$�`��IcV��
�"U�[��,���EP��"��� �����I��]��	g��
������p�I}��~A5\���8���K*x5.�`W�J������;+��:�Hu�v>�%P��U�
��b���=p��x\�/�o=%�I��I���I�����$��GEUf�:�I/�M(��O���6����]���/K�/��t	��b������V�o1]��TepM����~U���8�����N���t��@-&��b�O�iW}��"�fW=]q��iWwN����HU[�����8�Vp��R�#U<,:Q/`�4��T�;A���-q�A'>]�	&�r'^/�v���vD�N�q_�e���w�W]��\�N]g�
���-R=��xZa�o�u9Y+�AP�
E���?7�g5��T��sl��|fp���p�YE�������I������Tp�S��q���%�g%P[����D�~��~����i�@�����$P�QU��3	��������}��;S/�/�b���e~I%\��$	�q���~W]��7J�0��j�"�s<:RJ�7��:&�2��bf�$P�XJ{�E����������.�eW�_�7��&E��e!Y%��8��/���$P�U�p���%P�m}uE	t��U����@�u�2�@	��t��vQ��P)&��.�bXo}�1	��_.9���(�0	��������<"�E�
���
n�EL;���e�e;?	��Z�xM-[����-�?o'o4o$�2-����3���i��v�v\�0���NG�iW]��VzeX��i���5�J�LK���wq��.E�+&����W�WpugUN��}��E��x��VE��xrE��IE� ���M�E��YE��R�i�,�;��5�,��7���,�"��E\��=$�"��o*������q�]\]�r2���h��"�K��6jY%�g���L%�[��iwfW�?)v
2	���J�~#�@���szw��u�'�@	���;"���Up�g�VI�d���>�����
�i<��F[����W�_~&��-VU��
��0&�x����U���o�������/Y�#��)vK�J��j����.���,��w�����3M�J�k��cX�+0}�f����L
�.=�,���Dp�����2��"�ej(�SZP��q�yJVC<To�iA��/�q��������"�j(����2�n3�4����-`�T/��<�sb	�9�������t��: ����&��S��e�p����]U�Qg(�+�9 I���Y4��*V�'	���Bs����@�u|�g��]�}/����%����
�"�5�qZ]�w7{3��7E�s4��u��[����o��[W�N�|�X��x@�/�=���X�?�b���O�q��_P$���U�^���3\��������X
��'Q�[������Tf��C��:����[��L��W�x�E��U7n���W�n>������'?���fq2��q�?@VG�"�g��������+W���������x�>�����1��A��Fa�w�U3��_uk�{>����+�1t�>e��A7/'��������c�����k]s�>	�
��|,����-�����:���,~�fn������:t�N�,~5���EL��c��Q;�_
�u����������G^��E���=o�����^�.�|r[��T(�����GLt��=;"�:�������JT��%�'E�)a
�������Z����/H����a`
Z_��_/H�I���s5n
%\�w$�_	w����szLc�;��Hu��W�~�b���{�*t�7�$��B�kU����H�q�I�vn�V�Dh���(z\u�C��`�KW���2�H���|��=[i{��-^S�����-��R�u%O^������Y�%����My��u���g>��'�B��������5�L��AQ/H������6�8�DK�o��I�{��/I�����/�?�-�a
endstream
endobj
8
0
obj
9840
endobj
9
0
obj
[
]
endobj
12
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
792
612
]
/Contents
13
0
R
/Resources
14
0
R
/Annots
16
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
13
0
obj
<<
/Filter
/FlateDecode
/Length
15
0
R
>>
stream
x��]����u�e���U���?�Y�E"%���m8��/�b�U�0b��#���=�.Uz���03u�8�����K������??�\o������}���}X��������
������y�:�����������(�����~b��a�.�l7P�_)��qq(�Q*���_?�i��<:��Z����n��O�����36v��6ldC@WCI�R��?�.4#����e��R�x�*!�^�;��w�6g��Y���0���:����H�d���M����h�B&�L�3��",7��On�	���-$g�	�7v|�:>��Ll �
z$�@��haDb�<�&2z`4���7JXj��J_?���@hP.�bA.���0��s�E;�A�n���K��h������)����V:u�UJt��K�E��B{;
+axV�U��[.���t9�L��%d����L�KT�<Pm�E������\����W��9.���9�@��%D�)H#\��B:��L����Z,�%!��[0�2K'��;��'�
�u�)R���A�I�K�	n�e�)7��v��L��-��LK�9������H3��d"%J��)ciK���������;"�d[��%.7R�
i��P�E�l�e\n%{��FJ]0&.�-fZ53hU�[�[J�c�+cu��+��q���j��Y����Jv&x�Q�2�6� 5&V/X=���2V?O�
ZMo�_:���YI�dl��5������I�g�P�j��&*/�������
�[��T+��tz�����:.�t��'��1����3{t�"=*���	Zg�t��K�L�^�zdM�e����k���II�dw��5f�bV���Q����p����]c�rk��1�������u�Q��[���)��t�tiT�laR�(��t@I�e��$�<��-kbM����3j�S�7�M�GP+�zA��6��16��$��&\��n��ok��@�&X�Qcd�����C��n	'Z+�dQLT^�H��y�3�Njw"[��l@�"�Jo�K��)��3�v��	J�T^�H���b}�gd���M��U�4����z
w}����zA��4��16u)�K�����A����J6'h�Qc�\�k����l���j��,J���Mo�6�&���G�du!�{�%+�T:��N�l\�����T^��Z0B�����$K>'X��b�VRpZ�X
)OaH���R��[y����.������%�hI�ds��-m
�
o���l4f�����E-��2F��=_�h����VP������G�X2���>l�p+�<8�7��c��9��t�I�e�CO.<��)=����K~'X�Qc`�����d���u&����G�T^����8�H
&��7j��t�L�
���Y�L��D�e�=�i�4V"EZ�%�����I�E����4���D�9.��
&R'H�Ryi��N7�yI�dx��5F^f���73����G�T^��j��&�(�7���4K~'X�Qc������8,2o�#e��,
���x��
 =�'�I�-f"�xb�&y����%�f��+��1�:�I��$d�2k�'H����Z�P��`���#k*/c`f��,�D	��L��	Rf�166vg�D�I��)���Z�'�z��2FZJ��jx�<�3
�<wvN<72����3��v'H�16lT^��g��1�G\�I������ ��iS<C��6F��<7�# �c�x'H�1V��������$F��e���N����������!#�)�LkQHT^�@���y*��6�F��mq��!�&�TZ��5I�v��	N��������-�p��p��[�Y��b�j�5���a��7�aF��G�������p��0����@<��b1��Y��b����Jd���\�3��2�E5��2F����y��/U�&1�c&a�6�<[r`�D���[fu�J�e��y*�NZ�(1�F�)��k����	q����#m*/c��;���J���[f�F����7�1BOV����������
��y}�D���3
;����B%^�0���]��r��HL�eN�T^��
uW��oD�b����h�,�`��YSyc\� VR,f�2�7j�]�W��grDk��,����h�-m���b
1c�`	8�<�2�r���W��������������6if��X�qF�����z�M�1'F4���������j��Ry��v�����a2���q�����S����N�: M�e,h���A��:��1=F�[�y��b�{pR�Y�`�������v2�3UDk;A��#ma�����T��	�X�;�-���>0S�9pRy���1p�l�2�3U��bbqF��5na	���V���G�T^��zTMDk�e����F�������aPK�| �$�d��>�>���(��t@I�e�C�:9)8�<<����C& x�Qc�=pY�����#m*/cl�lg\��2J|����R<`~�&��f!$+�dQHT^�@{��j�{<$W��;H����9�{��Np:���2����!���F�9���v\�D��G�T^�@;X��;#C�%�ZT����M��MU����a�����EQy���35���I��x#5b�(�L�RC2t��!�������x��2�N*/c`����%Mb��HS�1-t#)���'Qb�(�z�M�e����G=��f1Q�i�Qc��;j1BMV�������������U�1U�,�����S�c��(1a�9p��2F��L��|�nqfqF��a'���!�-:L�X����������j����2F��$������,����M���djd���vGyk�:��R&u��)��1�������9�M��T�M�a6���h��b�]L�e�����x���+k0�cveH`�������O�\~i
�32�V*/c�=|i
&4�6�&0nb�=�D�|�`ja0�q#����W��+k��Ry����$�-���i�d�\���������������Y�=���|�CN�&2�3j��w��$�+dR������t0y2%�$O0�@|)�0�p�a��P V���2��C��aJ!�L�3�d�p��^
;)�an!�: ��2��o$�%�$VG�#h�Qcl�������G�T^���T`}|s�BfM�u���,��1����WG�z���t�Z��1���A�$1����q���a*}XJ� u��i*/cA�mc�/G�q�~gLe�gg�I�PH�����y��2����3&3��v��-F���8����1���7a�+�����[G��N�:`��2�k��J�lFfqF���I�:0��I=���2���Dk�u����F����fY��A5Y�������V�;S���	R�T^�H�n�g��$�bJ�)1{qgcI�k����T^��v�����=�b��K�����M��[x/��?[�����������k!;�-��C��e)��-�����oS����+#�T
�Jma����*E����JQ����R���+#�T
�Jma����*E����JQ����R��������"�����n��e����jem���c�zm'�:���
��W���R�u����,���}[������������;}�~���������_�:��������>���W�����>����~���O����<~���?|�>�{���y%����Go�*�q��w�?^��]7��~x��^+�T����������������
���>�Se��E�w�/�\O���"���������A�T#���s?�A�R�l�;�������[%�f��?j�uN5,O�M�U���j\�zk{
�B_n�i}F�x����o�h�D��q�>��z��<����]�Q������������;�_j�p2�Ns�K>^�]�������Q�;��V����$�j�6�N�?�Z��Sp��y�A#��a�hP����:Y����t{����:��@�Vzx6��~�ht���?UZw��r���G��W�}T��kG4.,CW����]�-�
�����t�x�|�eL�h�QS��n��G0���O�#��k�y@x~��!����T�|�O�B�5���>�C�r��+�e6C
y������S��������!ilXRH�09
K��}�$#\f�������P�Z�R9
�FX��3n��{���a��_*���K����FX��2#,��k�b����V6��z���u����O�s6A�����p^��,��5u9�b�=y��qW@���_>�b��Xk�qk�����/����k\�����X�#���mQS�!�b����Ro���|^�7��2D��;���I�Ys�����>{=S{�����^�D�K�:����ivq
�K�8y��^2����7�}��U�p��!��w���f��qEo�8a�SG����n��
��~��N.��-�����4U���Qo��%n��i��JS�n�z3�VB�Mj�_��j�o���p��m�I�+N���5����k/�f�}7����:��r�����L0�,�?�F%z����!����;���}Bvs/���.��T�B�4�K��d^ ����sf�j3��9�&U��A�d���&��+*�{��_���Po�''��Mj\�a1�6��mRG�6Y�9����~iX��6Y��S{�|��o�`��e��G+��[p��_O�F��'��U�3����!U���t�F�+��lN������q�?9w��Q���X�/1�B*�?�������&�f<5,p��C��n�e��}��&����l6�����5N��z�
Y�F
��o<8{�\������O�O������Wo*���5������W:7��-;��yN�{~;~y�^'���7#f9j3��W;O�x��O����R;��~�w���#U���
��/u���p�t8��^��Aj�K��y��/���Y��^�0[�]E<)��b��wW�/a��>�����g�����\�%LF��9����z�G����x�f\:�;���d��|	j��������������w/��~�S�_�����_����������u	�~oO(����������5�WnQ����k�j#3�z4��7��z��5�l�����l�*z}�!�
���I������!j\�����;��������_-j���\��T�7"���������u��`Y����_��H�m�
q���rC,������	k�*zCf:�|��h�z��
�d�l�gO�������5��U�K�Vg��!j��M�qC��_Z^bb����
������6,�rC,�L��7�r���97D
����l�p��y�C��[v���]Q��:�wE�!N��OJ��\���U����}�z��+��c����dR�!��_O�2��q�h�!��/�9yf������57����5?8��2�>8�|�������u	���,�
QC�S������������x����b�6��_�Q�osC/��������zo=�I���OxwG~i���{�
�t�&O�6�����>	8'�m�lw��9�d��xR�\�>1Y��_����n����i�i����_jX�f^@��������3�����~m8q��n�/
K��U��e�d����~�}�S���K�O�Z�}4&���%j���\�m_s�\E�l�����������]��TQ���
QGiX�������P�Q;u�=������6�
Qn�������5���H�O�q6�sk�q<�Qs�\���)��0B���Z��3C�����������5��7w�/�\�FC��_z�!��k���5�3�?U�3�6W�V�!j�-��~�}���:B���j���\��m������pd7D����5n�z�a�6�P�zm2�b�zS3�2�	��U�����!��'�5�3^1/���}�Q���o�8t���9C�r]�]���z�!����/(4b��-g 1C,�=����kf%��c��!�[N�d���3���QEo9#�
��������G������)�O�]�x��`�e��!W�.�JY����%�I�h������#��k��$�����#*�_���XF���3C������q��
|�����!j�-���'�\�0h��[r�*�I�p��\��
�wi?DE�lG7�Y@	�i�W:�-�q����������ix�����3-�\�m[C?�|$����l����a����p���BW��c��w��s�*�u{��rs:3D
=Z�	��
�-?��
QG�������	���?o�����Y@���\�
QCo8N(���d���O0D�}�S>c���j����@��p��2��o�0C��/����!j�i�UWwn�z����T�{��]rC,��	�%w�k�-�������Q���v`��7��C6D����%���l�:p���cw�]��\��u=+��_�`l�T@?�m��K����mN����Q�||�.zs�GE?�<!�����&CT��x�Gny�r��d���/�4��U���x�!j��m��������[�P��tNq�!��k������k�9N�u�����!j�g�U�����u���i������0B,������4��U���c��Sf��������QCO��5�-^��s�~Fn������
QG�� 7�[@�hXf������>�
�p�l6D
���A��'U
�'��F�����+�
endstream
endobj
15
0
obj
6710
endobj
16
0
obj
[
]
endobj
7
0
obj
<<
/Font
<<
/Font0
10
0
R
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
14
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
10
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+Arial-ItalicMT
/Encoding
/Identity-H
/DescendantFonts
[
17
0
R
]
/ToUnicode
18
0
R
>>
endobj
11
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+ArialMT
/Encoding
/Identity-H
/DescendantFonts
[
21
0
R
]
/ToUnicode
22
0
R
>>
endobj
18
0
obj
<<
/Filter
/FlateDecode
/Length
25
0
R
>>
stream
x�}��n�0��<E���RRV	E�:M��?�@b:���x�{tm'5��l������t��onPx�vF;��S�8t&J9�����o��6�Cq5����CT,~��������]�:
�3���W��������,��d� �\���c���!��ij�2>&�#�dF
F[+p�9@T$�HV<�##0�*�PU����3_"��v��	:I�9������H�D9�F 	N�#�"e�rI��Wz���P�t��<7�^����E�QX|av���X��W�ow����m_�o�Pv~!:�d^������0k�/�<���i�`�*|~`�L
endstream
endobj
20
0
obj
<<
/Filter
/FlateDecode
/Length
26
0
R
>>
stream
x��}y`�U�������I��M[(I����P�@)T�� �V[�R�M���"X�`upwP6APigP@EwG�e�;�:��6���O�R�6���|4����/��{�9�	@���TC9&-Z�����|��O�TM�;m����{ �3�����L�<����"Q��)&���r$Q�k��c:l�FO"j�x�������b�;\5g�����h������N�^"Zt�=s�M�;`�}.�5N"u�r��l�d9���O�O
���y�9����P{��!:Lk������L����n������f��������"�${�����������3��S����hR)�N�I*���[YY)���<���Ko�_�Rv5qjC�4��������/Q��zGR����$�q�P<��@J�0��1�g�������T��(�����z������|�t\�����d�L*�(5���{1�s����3�#)Q�?�M�G}+f���Q_*����Qz�����6�M�>>Z�++�4=A��1�PW��!4���Z�{���-������� u�����?�:I�`1,�e��l�f��O\�|%����)��i��O�}�ky��X�X��Y� }��X���?O��E���8��&`V��J��Vc�����D��	��z:H����!}C�3��z�B6�]����~v���^��|��OJ^i,��*�\,����)�3�6�'��n�����3z#���3���TNS��Mt��w���� �C��g��_s2KgXG��z��l���l	���g�=�~`����Y��)��}���������x��;��|_�7��|??��)����?�_����o��)
�.��Is���i���|A�@��~v9K��o�����/�_�?*Ve�r�r�����J�C�D�NW7�u�������j+�����'Ld��v�^��=�i�G���iz�m��|�fw2��H������A�'^�lO����E/�d��<=�e�����v�Z>�/���ry������o���K����Fo���l��i4�������U����_T3��u��2����������VD3�O�e�-�>�QT���������Z�j�P����w�2�M�l"5��l3����!��"OS^c�d>6T��#�i�4���'���q���p�K�i���$��@>���b��>v�<M��Q.�2�{a7�Jd+��}�>z����`>z\^��fw�%�����]~L��������e��)� ������1�i�~9	��Z��y���v�������P��z�k<t����k0]�����\���h(O�Y���Q�M����.v�����C��p��w����%��{����-�k����:���;d���H���z��R��INJt%�;�bcv[t��b6i��Ue�S�-��2�r�w���"����	��$�^\&��2�y..�G��-J�C%�M%��SH�s<%^O�d��S���(Gx]���8k��a9��D#������������JM_SRU���X-����X:���A+B�w�����w������Q���%�$o�B@�(�090|DyIq�����9��wb���v�Q��������3CL�n���9�fm��&V��&{'OW�&T�>b|��8�Z��x!��c���j��FZS�8�#�k���Q�<7U<+*������5��z��bb."�/���o�H���	����������$�	��%�{������\�Y3���(j���P�v����\R���$]��1g�#&��=6{8�<0�)��Eh��&v21"�@�A�3����{1���1�'������`���e�0��Z��%�E�����z�|KXv��3�L���oI�p4	�#�����r���Bb�}�x��9��x�\������V�+<OM�zK��&"�Q�{hb�����Ux��9��#rj"9M����}$����)�����W2�W�%�N��P��Q�A#��{J�T�y;h�E�P~���p(��\j��!�F2r!���
�HyT@������u�	�h�0Oi�Q5 �����������D-�\�f����x���
/j����|���k�X.�+��Y����)]S�fB�^3��qx���m|���%U�����&P�����zAZ9���e�G������������]��3���_��t���{��F*oJ1��� I��MFV�z?Q��+	F|R##�Ic4����F>�!.�_���F����I�	�����"��N�$�h"��Q�IUD>'���5_�t���]ac�P���!��T���g<�tN�I�����L?{�#?���<�������lq�l������\5�4�l��U�-*JSMv����+�q��N�rjZ���vwJ8H4;���Xg6�*j{�o�4Y�$2E���p}�/�q��
�{�**<�b\1�:�V-?��S�o���/&��V9lG��GWO�Q��q�Kg/���J�K�2��Z���o�����/X�� ���J9��R�=8�_���s��G��q����JG���x��k�b�8�*&�l�����v�p�2UI��5���,L�bN���I�+�L���(�l���9������edf�l�\,�F��$W��D��W�|K�:����D8��t��kDBa���s |����k���3� ���X�J�$8���P0`�y,UK�B���4��������fJ�y��{>��1:���*���^��v��[L�K��]6h���:��ZkL��u��������l�;���KK�����q��+�nW{���E�<�15��MJ�����c��;�:���JK��la�f�&��IJ�tv���
}�#l����!g���k�,�-:�I�
�p46� JD
���QU�X~���^O��Glz^������M�y�3!�k~�t��_�|I��;�G�r�gl����pc�����<��G�������kp�3[1��r�Sw������-���_+Xo�xU��	�qQ:���L�8b��x�b�]ne�6��kOHLw�IL�3+����t�+,E�v{;���R�v��a�t[c]���T%�&$���7�/�=�n��)=��rE��l����Q;c3���tk��<
u|���:�� TM2Eb{���\��y�[��B���F�"�/�Q��!d`k�SU\	.������W�K����w��Hj����84���O�;��9���N_��z'�*��~h�+��i��Kv�����.X�&x�����A���l�:�����iBN��,l�������b>����$+
W5�I����}
Y<f�%U19�D�����sE�Ak�b���������t�����(�����f�I���z�	Z(���x>�,��������`WH/���\D	�\1��
�������O~��#��q������U�|����Z:��RHN���<��#/����,vZ;��S�e,�_�3�>F���Q�qi|Y��xui*�b��iq��X�m����8�2r2F�k$�����E�D��j���e�6�mQ�am��U���c��Qd�<RD���:���d�LB.h�����tC;�Mi���A5}(����#"�4�W�c�vn[�����U�����0�(&,G�����3U�JC���=�C��6����^��������Z����]�v��c�����v�����0~����'n���l<?��{���|���9�N�}i���������w=���
�>^��M;�)E���4V�����YRmrJ�4#�.�	��`s�t�����.��N�e����2�T���u��b�,����v{���4GQ�Y��,N7�F�h+�z�ao�c�.��X�������g.��C�b�Vihl�*��\��r�+��Z�%C3�w�k\qy���	Bq�0�b�Qz���Nl:���)��_z8�9g�e�&�|����f�w������}�0�����95cj�����nyCh���G_�%Q��T������J\)\Q��nI�%INL�4�k�?��$)S��^��*Y&Q��$��b�)���M��9�pd6Jn���!��X+{�Q��-[c����6N�)�����]�ndw�������g�����'t~H�=B�W
;���������>���Um�\��i��U��i����{���Tix 3x��E�N{����O�~�F����}�th���f3bo83��M��K7�yq�Hw�����m�V�� tO%���8#�9�L���Xk��~��N%�e����dl>9^��OH�A��(k�d5�b��@�+�l��13'����YR�O�8�(�������r2���s�����}.�k�����f��!��!l�-d5�f:[-��-��b'4�QX��Pp���*a!��AIC�4c�iy�W�|fs�fwb��I%7����=���|���H����zr���$v�4o��S�u^.�	J�<�RO����.D�
*������c2�d�tW2|�������"*t�8<r\.�%)+:+1)�kq9-K'r��:�����vj�Y.5�mQ�)��DsRR9t��2��s1�+�U�z�%�w1r1W��g�}Ou�e[��5��>��,�0�F��aqY�Y2����06*����U�����[�Mee�<��Vp�b�II��P������U�fv�*���)���]�s�T�b���7���y����6��J���l�����e���Y��6=���3��^���
��n������vs�+�o����Y�U1}��$�)8\v���kY�=�=����������G��2b9S����Ko{^��%x�C�{`�~���s�'���l�������6�\c>e���\�&����f�������j��R�4���"e�8sTE��d�fM�Z6;>f��h0
���d.���n�B�U���������8=���!���j���"��'��������"���G�7���9D����+\��ES���I�E���
�s*����
$�g���XLMi�k|�,?Ocqy+��������|���������!,�RxIW�M�'�����������+���-�����f��]�JY>�|kQf�o��ROJ���]�����b�L���[��MZ�U���a-��L�QB:|�0������	������jf��bQ$I�rYU$���Vo������y��iFf������Tq�����V�Uk�I#�����}yR����E����GK�b<�����CYX��k�����*��F#R�3���c�����MFA'�	\W�Q��`5h&����TX!�V5��a7��b^I�JY�����\�7�}��w/��5��\�8��>|@�NH~�!�A��*�
Sj�S�dbn%W��<�VtE����aOACC�%�j2������k�6������!ki|D��L��4�c�("�$D�����#Q&!�"�65?D!�a"l�B�)�=�Wdzc/��&�n!�M&�!���7�U�%��3�aY�@�g���KJcX��l��)�I����]'��p��=��K{<�I�Zb+3��];�v]�jfi�4[wS��������vt�:ds��-8�<�h`hJ�v^���	�v{����8��)�,.���V��QQP4YL��	����W��N�9����8G��:��Nq�+��w;��-R�������q���8�rBu!�	f(�<t�9�3���z�q"	}n��A�4��(|���p��i�Q�
�����[#����������r�M�#j'�t�������+���{��;}�
�93m���xu��+'OZ1�Ku�h�tYF�����56�������`-���
��.x�������tS^j*��GM�KL�c�9lz2�E������r���e���Q0�X+sT�7��t a��EG;��W�s�c�E�YQ��h����v@5�U��GE3���D��O���09;��U���A\��I$Wt�$�!�a�j��cs��WTHI�����0��[���
��n]��E��������up��;�W�_?����������'�]������������w0z2������<���b`����F�e���*mnW�6��vR�sl|Y�2���*��_a�������')m����Z��w��F��I��\M���eN	���K��qCJ
�f��M�fm�0e��&�����=���������Y�7��5obI����������`�/�����gw���mw��[���IWl�4���)$�����������w�n3�jtB�v�1�3�?����������?1�f�����a�h.7�Vwh��C��n/'������������������P�_a�C�.Q����/��BjR��T�<(�z"(9gjX�\�r�	!���e����Z�"|���f�M��`i��4dD�HeuDDB6��x���F�vYn������o/[�~G�k���2����	���o�[������7jA%a�&!�1������E���;��d��0�ASE�bn�nI�j�-�JK���RvTzj��g����J)���2s�'���t�:$��4������;�c�n���{����_�wtt������G��*Ap,&�C7Jp$�X�3���	�Z�o�����J���6�n�Lkp46��p�
�
B2Y
���o8���������j�3�
�K\�����3�����l������Y����;�wy�\�����-?��l��Uu+W�Y��W�V����^��a��������[��kfr���F]3��+�������A��$S�y��	�	�'�%j�W�7F���Q�h�%�TM�-����E�Q���(�#�'%B�%$9���;��RSLNSg(5�C���	���{R�EI�JH2)Wy�PW����$�H�4>iN��T�������m����N)(_)R�R�p%����dK�g�~�P!�C��?+�����k��,4[aaH�	���I�>���9"�#��;�|X�a�����R�����J���>v���o����R�������nj�-|Jc�������(�_�^�y�hN�|�4���3?Jb������.M�&�f��zs��q7%�$�e~����S��q�����*�}�����u�m���(����r9���]�dT-�U1W�d�����11V�k��"��8�$�3Y��_� �LN>�@r[�+9�<� �'+�\{~Y����,�qq�Y�����0����[+���z�w�;��&}���4�dJ���C�}	��o��.��~0�I,�b�R%|=w��M�Ewv>_e5U�Jyq��N����w.��ls���v�����x�����?��y���O���7x��SgO���g�<���_?�����c���wB���8����vf�;c1�����3��D� F���w7��`�Tb��)'Y����(��n�r��b��M6��d�I�$��7v���U�$�8�����>�L���<�Tn���AD������� {�L�S��s|[�Im����B�����	L3��(B!.����R���'�������:�;,v7�%�����n�%��-?l�z��1�������^
~|l�,����~,���4_�(o��O���V7~����J��oA���%'�o/���d2~��e�_����~�g�4���O����QO�Op(��<��?�3�"~��5�!��5\�4a>����i<0CK�cJ=�V1�?B��G�;�:#��b.�.H���w��m�|���T��]��=��_(��6�0��&7�Q��7�_�r�W <N>M�j�q��D]���W����#4����$��A�B�
�
��#��Am�2�0�����D�a���gPv���?����X�}��
��rb�E:;F��c�	���@�����i2��B�<��
^"^��f�c�Lz#���9���Y�I��n}4��54o@�sn���/���@������~��8(���{?������`��i�L��F	��%�Oc1���d��D���c�>�C�	�i�|?EI��'���w��H'����g�V5���|]���v��E�,L�����h�#JF�!@�����
v�Vm��;���`��	 ��A��������O�����"}���P�T�u����Z��q(��m�"0�,����I�px<[@x����w"�(��W���MC> B��Z2�C�!c��a
������Nu7-v;Q�S�_���qF��%d&B
��E��N1O!SMT�=��M{��b�	������R�2+�-B_��c?�=�D/�U����u��������n��
~��}O���D�F*�_��<H�'�r��B��A����������[@{��T���?_�@���y��:��]�g
��.�����;�T�y��4���
eME�s�u]�|n{B��u<���@
�m���M�X�6�*�yU�?�R��/�r����2�>Fy�J����Ma58^g�Z<��;)I����D��s���E2�R�"4"�-���a�2��g��I�w��g�����m��W��&�<AAF��b9��5��3h?��\�������boD�/���qBG
=�XV�|Kz�>��>����'ilxo�����L����eB�����a4I}��Kmi�:���*5��`�_6��W�g��i��Y*���3�sT�L&C��H���E�h���8?�m��&��{N�CcVS�Xy*�)���y�E�~#]���7��
�����Dx<8�;4Y��7H
�=Q�J}��
�����M��2�"M�_}�>�1Ge�q�o��c���U������3��|eR��y1���1�
��f�:��V��?�IJ=�����a~�5��!��hSg�����#o��U���D�[M�h.����a�;�.��7���oi��
d�*Cg*+�g�:rG�a���
�[����1w�B�c����������=q��t�K�JK��i��-QN�|w*��a������v��"�{� �;d���m����������S��r�+���"���G +�h7����;��!�"���;zh'����5���<��Hzy����WR�\�,u�\����;��i�d�Z�y��:��Cr���I�(`��L^���E�J��X��������S����^��
��D=e=��t�C�}pH���|��_��i��O���}��<@�����X#h1f>�:J�Q�/�C//��4�����e|b�F��'���Q>��.����h��w�Q��X���\PW@���W�%�j��y�����+=���c������u�����HG�(���6�+!=�i3�_�t�s�q�;�#��,`��	zp-p;P
P�����x���+���~��:�#���=���9������{��!��.d�v��:��������������<��kY�?�-�����mf���!>D���,��33B�D���n�U(����B�(L
; �?g�*���B}]���J��a��E�qa�=d���������=����G�1z���r�Zj��m��������*�����"�H�������?D\�y8�aK��#v��
��������/�����G�����s�8l��7~$����
�)������c]�Ky������O��dA�,���'�����i�h�]n��L��I���`���H������2��E���^��Wla��M�Zx�-e����s�y��z�R���$7�g����}�0]��v\3���]�L���oH+��M��XP�a�K�/,C�����=��j9t�P,��
G8M�Z�L��e�g�X�4�������p�.�3|�k#mA��@�����������3l��a��8�:;C6��z��`K��,x/�,�KQ���b�������A�������h���+[��W!����SnG
��`��8�^�W��#v������u��O�h�q�	��m]����U���W{sl"N�J����k�B=)�_C��?�f������Kv����}q��&�����n��=
~������Q����`�@*��|��'�b.�������fp���
z��=������Gw��NA?�Gx���-��%PW���@��-NOn	���%���W��[�~k����H��_�o��m	�{g|�Z�����->����gC[�C[�z{q�
[�6l��'�^
(ly����H���P��E�J��Q_����B����l���?N~�
�SAa��>�g�[�Oy���d��H
�oD�`%�/���Aa�'����z!�F#<���������pt��
��>�!�yh,"?��h�kt#���4�3������Y� �Cc�1V�%`0��BzA��t�l��J��A:���ZA
�+t���������.<M��
�Y��V[:�������z��|�I$l���>9I?.l�^�����Z��]
�r$����s�y3���"�]���
��(����7���Xf��_i���m�������X)��A?���O�G���(EYF�M]~��Juz���%�g���sR~��5�GOjH�D�a����C���3�|��fw�����B��
[k�����k
�m�>T����f�g�q~Fl���y
-�R�L�>l-l��
y���*����u�O����S�$��$w�
{#b�*2���lqW`�����x�>mjC�#��w�-������0l�&�72Pq~6�?Lao�������m����d�O�6�7P��4P�z3�mq�}�L��N{K�B=��eJ�<n�"�^���K��s�K����NK�4|2�w�u��`}h����v:��{�������
q�k������c���B:B�� �� �a���kKC�������|5�?DZ3[�"dO����z1�t|�x����=,������7Bn�}�NS/a���F�������{�O[��
+��%my/�[���c��G|�^d_����M�������a��x�T�����n_���d��x����W
��M4l��|���{���wv��>�x�!�9���M]Ou�+��i������z`��mG�qw�;PoE�[i��M�����)��{�����.�6�nFO����������#������xO��������aS�1���,�],�Y&����v�����
D��c�/��5�9�����u�����.�[����7��i�&	�#�����,��q_O�U�
�o�,��#T��������4��\����q�r�Qo0���+�9��S�}c���N>���GC?��k�~��W�}D�'������B��u_����u�k��������O����a{o ���?��d���q'��U|��VL�	��u�Z(���cx����A�%`xrc�9��>G'�i���C��.C����
�������l������t������93������g�X���p�7��l�4�]B �.A�=�������E[��\�a�]m��>�������/e#|"��e�,�p�����x�`�T���
���	�H���S4@z��8<��Yz�������F���Ry�	��G���i%�9�����(�v�����5���h_A���KJ
��,�*)�n�9��oF�N��B�,��I�@j�9r�
��_�[�C�}�i1�^�V��Rg��_�?�(�e%�#E�����Uz��Pn��Y?�2WJf�3��E������/���F����[=�|G��"��p�Cy7M��s_����d�k0�~+bP�K��X>c������	�B��1~�����I?��P��.�f�W"��������g�n�)o������vX��
���5r*�xg&�8d������_�t�������}
�
�
�hP �[�H�l��IV��R�D�0t�6�oM�����f���\O�C���~k��v�o7�J�zO�8��qv�mv�63t����~_z���G�������
��j�?�a`����ya��.�M�\�
��Av��>����2��
�
a`,_���l�/#!S{���CNNC� ���������K�&l�+��0������7`��@����.h�3������MT����n��Gg��t��Qy�AZ���1����2�X�]���j1l����x���)���BC����X���Z�n��]�)d�uy���S	l�����l�=��!��Gh�����}ih���e�o�B^!k�M�y��������I7�g�IsB�7����@@����N�_��O��H;�����7��6�,p�"~�&��d�N�2�����)z�4��z��,< ���l��U��{�/`�@�Oao������G�?�[��X����3 ���!�p1���3h�B0�1&��DQ}B��7{O��#�b���#r^@<|���{���O�;��hE+Z��V���hE+Z��V���hE+Z��V���hE+Z��V���hE+Z��V�W`�_��o����B��Kc��J>�d�����$��|���<�Q���*�^MDw�F�v��o�N���s�L���$T���VQ~����:�k����t5��!j�Qs���'�m�-Y,�[k��
����������b�����\���������P5�ST�R�����0��[Po�1�-~�c���4~�o>�B�k���:�9�x�\�i�~.z7������3�x�a�s��4�������|?g��c)���9r����9��f����{��o�cV~7w'OwW �s�;�
,����f������q2��ED�1&�} �*�qU4��X������f���������7{��i��Q"����S�:V���w���u���������������[�e�����(u�^w��NF����.��}�{$��^���}����*�ouOB���c���:��P����n4������u�=�������hT�*�>�������6��	u�!4�,A�p�GgiF/%�1��hs��w�
;�
��
+�
}�
��
=�
��
��
����!C���9M�&��f�2YL&�j�M�D&����}�_�t�ATY<e#���i��^�s&N�Q N�����L�A=��Fy��e��������h��~����Au�>2����_Q����H
��u�F���$�tS�\�cU�nZ�FP��u���(�(�OLAi��<���f��}������K�������
�Q�n�
"���s����#)��"��T
�1�3���=�v���G�(��r�c%#E��S\Q1Kc���?&�=&����"Q��Lo�d*�5�A�B�<�5�y<�k���A9���(���A�r{zK��x���e��
E�nIuE�U�F7�F��E:��tj*���Ib��Ce�=�2��'������+�!dex����?.Ds����g[�������"`��X����(��������$
�{�&�hsP&��(���pV����,H������5����wj��lG8�����l�,��Kf7���,�4(�=jP�h���=�V�WW �s$�j-����;!�P$JRS��4�9\�xbX�f�B�o>����sp�|��
��@&�
endstream
endobj
17
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+Arial-ItalicMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
19
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
0
0
277
]
4
35
0
36
[
666
]
37
39
0
40
[
666
0
777
]
43
52
0
53
[
722
]
54
56
0
57
[
666
]
58
68
0
69
[
556
0
556
556
277
556
0
222
0
500
222
833
556
556
0
0
333
500
0
556
0
722
500
]
]
>>
endobj
19
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+Arial-ItalicMT
/Flags
68
/FontBBox
[
-517
-324
1358
997
]
/Ascent
728
/Descent
-207
/ItalicAngle
-12.0
/CapHeight
715
/StemV
80
/FontFile2
20
0
R
>>
endobj
22
0
obj
<<
/Filter
/FlateDecode
/Length
27
0
R
>>
stream
x�}RMO�0��W���
-����1��GD@i��������e�]]MlB��yof��l��5��<{��ia���6�4����y&sn���"������]����#�k����4��_����k�=E���z��)n�!|�~��)�-�����z��l���w��I�o����K����
D��j����}Z������T]Oa"�~�)c�u�:7��y���x~�Oy)�&%��BpK��c+����i���������V�\�;�����=F%�V��@��	,	����/]�ou���9��l���~�N8��������<��
endstream
endobj
24
0
obj
<<
/Filter
/FlateDecode
/Length
28
0
R
>>
stream
x���	|T��?��s����IfK&�I&�@ �%���}7A"A���**JPYE�VQY,*��!,��V��"���;Z�.E�/��$��s�
!n��������}����Y�9��g;KX�$P�\�0�9���1�!��is�������1|q��W]?-�f�YG-@���O�\w"�����:��c���/���9�g/\4����/�Vw��)���S�\���gO^4���Q��a����S���1l?������}D����������3��P>Q��n���f3�i��g���v�������]����0SV�h�5`�/X0��a3��f8�e/��a�X �),�e�kXk8 ��H�k���50��o�0����>Y��#ywr<
{�?$��i0�$�0�%�t����>�����J=�|��z�Ff�+��b��:����;�c����7`��~��_%��X*50��>��
�Y���a�#��6�[������x��
��[��!Ep	~�.x�����6W��p�:@������p�E�����Pb�nH����ao������w�tH��N��h������b6�����A�fl����������������+�I�;cF���g$���w��_a�-�
�W��O���_�O�2M���f�Zx��<�'�.c��b����`G�1�	����Y�Ki�4O������[
�
�?i�j>����&K��a��R�����l/�7��}����9�7���8v#�����G�V����c�O�W�k��5�t����7�����/�C�(����H~)[�I��r�Z���Z!�����r�|TN�8��3l4l5<i�����n���W�������fh^�|_sC�����9L�QC9�~2�����9n;���8vi���fCqd&��l[�#y[�}�5{G���K����D�;�n�/�����|_������[�$�$���
��R�4UZ(]/�'%�W�w����9�M�V9,g�yrL(O���7���f�hx����j�m\nl4������4�4�Tc���������y�
�@�vBZ*UJ���U�W������N��S�V����v��"�E�"6N�y8���F~�_$
cC�����o3��������)�9��W����vv3��h���|A�,����-�}f�7������)��4��7roCdI���y�&��+���� g�P/�e%�_R$>����W�f���)����KV'_	w@W����Tt0\m,4z�K|�����]��'���X��p�����o�5pT��{�S������0��a4��p,�y��p��J��$6r���K%r�%�U&�N�����@i��s�"_�C
��G=!#�@������8�7��'C� ��<&$��W�����#���������	[���a.d����������|5����]8�8��,����1���,���c�"�&y��5�p����_`����y8�� ���}F%O���'���<j2�dS�8����{#L������3p��Q��h]��gU����}��/.��WY��J��t�\��cQ��CA~^nN4;+����~�75��V\N��f��MF�,qE����D^mB��������0�UBm"�I.,����b�K����6%�j�xKI�D���cQ�2I��4�	��0���:�8%��Dx�;0���"����#	V�L�v�������6k�h����E��j��
C	t����D��+{��`v`�i����`�?� !�VN�K�UU�?=+��cQ����"��	WL�~����_�$�������#;��^����1{]�n����4���p����	�
'���rO���s�������^�"��4��un=���X���]=�^��8dL[����l6�/��R�oj��RjgF�h����3kqj�V'`��Y
ii����VY=�*���H�VO���
�G_�3�/��X�Cq�����vG����<�)4dt��2�Q�d�DdJ{R�o�I��=a���X��J����HX���VzQ:�Or�hd���=��S&k)�\�k� �I�a�N�b��BbS?�S�co�����F��U"Hp�`$����^�8�YY4��7��
�$�GU��\����Xu��R�=�;�r������Q��]@��7a�k����R*��J0�OdOU����5�*R��V�!c/���=[��P"�_����O�D.2������'�\��(����dF�),2 ��R������Rc�4��|5���^��]��{��vM���V��^����6x�F��alUV�_��d����=	���8Y?*���&i�
�k�j�!��X4������kWOnL�_�(��{����W�����1������5�8V�Y/
}wD��Q;�l��	U{\+�[���W��zG�U���E*�TJ�H�"0��G6p�(��7P/re� �S�4���`J#W�=�c����E����7��5��������p�
��	 �����C�{."87�w�P�����
��������)n]\����������+L
��t�+��%�y�oj;��Ii������Jcz'��!�z�����_�3��nF�{1��WZ"����^:!S���:sJ9��8�tJ)W�+
���SVFX�)v�r�K�@����Y���0t�t	��y*���������a��b(������0+��GQ~�0t�����a�3�C�y��j����������]K���y,��M�u-���4Ob]����6��z�3���`e��'������>��h�/1cf�����	����ya����>�v���w,�v����?��������YH8X}X#��g����s�\�c�8gQ�i��X (�]�
WN�(C��S]:�<��nY�>�k��� �����u���/+/���%����0���~1%54]:w�������#�S0.�7�m8
dpW�����}���6[�q���'q7�
��z�S��Bj)�������BFz���4)��������L����p�-Xy����Q t80�;���n�����'�>�`���;���|�����.�C��m6#�R�P�vzRZ�+��s�1TB�Q�m�Es�Cx����p������t��4XL��L���>6e�wbpt�,�,���������������np�0�o�Oy)�����mWZKw��4��j
�O�A�/������-,����b�BT��.���vR��L���j1�� }"J��J�(�2_���?�&v
���'�� ��y���I?Q]��J�'E��@�<^��L���($n%/�m2�����k��������k��?���������.~j����[����?}�3���_~yW�3����8�}d������V��U��xZ��xZ��O)�p�����;�f�)�Y�h�����������(#%��<�KQ��������B�����p��Ab�C���
(�z���e`C��
A�>V���*���q?)�T�2�.���]�dE�F��[��=���v�ym�/?,^(��{q��O�^��������{���A#`F��o"Y��HB���S��*
�����9�g�n�"nQe�2���T 3�r3CN�����f6�g�vn��#a��y$����#�<��"�
z,!��-
�=.�[\n��s"n���q���F�n�W�����#�C��DR{��h,��"�E�g
����^4�2]b���u���9o������������O�����I��JH�P2�o��qO��g8-���������P(�J�7��$G������f�F�M_��������ZuA���
mf]�
����=��B�3nw��������}<���;������+gTy+o*GKXs��!^��
g��
 �C�����V3_��\oV^���d���FS~w�FM��l:���s�����x�Clo������A������O?��|���~4m�=�N9��gOWm{n���]H�'?�}�{1�k�J�Q�@
�l��m�@�&/F��c�u�Z�Z�2�������!d���:�� OD!i����;�x^1)�#�����
��r
'��!���L9+!��>G�c�C�t_��6]��J��Z���q}�r���U��:���$��fw8e�v�)�Z�2��t�n�1^9��o� ��`/
�M�G�b�>��]�Y0)2'�#�H�I/j���Z���	��� O����g��Zy�:Y���kl�	�fk������w������3����t����"�E��,��\��jR/)=|�s���GKP�b=]������Z�����M��4.�9cM����~����iu��k����%���V$nY�9u_t��[n�-���+�&=�)�7wh��c��Pc)�}�����`O~���a�#T�Q<-�:O�P�F����]<Mf����$�������� �F����j<U�����'�/�
C���_����F�d��$����$�J�,9���Iz�?f��7�� �X[�F>�����Zu�aU�
�B��F�#�0�������n�u.N�`s����.q�Lu0pr�����l�����
&�q���\�XfC9S~��]V�4XF}�r�p�qA���<e�m_����I��$9#��^Q���e������^?���+�g��v,�����m�1\3tc]�]�Q��f�������:����������G�7�*��i�	���A&;OI��O�P�I�Q����)�"�C�xJ��B��P�p=�i���Bb�/�B{��h;hxY�i��zC2a6�,g�N�{�\���Xd\H��T�H5D-CJ��GA�xI�>cu�})��<o��v���p�I�����}hP$�4J���II9�t�:SR�.��x
u$����N�+�eZ��q��5REh^�n��{�2GY������(��P*%��R	��x�c����E������C�%|�r�@�����>bj���'W�;��r L��>l^
�N[�i�oP��dy�$T4�M5�ji�o�\u����\����;��M�������p��?4�ze��Y�0�����T��������N�F���E;��\����i16�F-�G���u6���Q��>O�`&QZh�T����~���],�&1\�d���@/&������f2�U��X�(�~�x�/,�_,O��Z���G�t��9��!��8+���������	���YR���<�SYh�&���<���u�����Wu��>���D��2�#�He���#���^#���L��gFv��C��sVng�.Pg)Tp����!�WYW���g<%SHu�P��B�f6��xv����_����_���Gu�>���������4T;���N�FM3&����uXk-�kf�����,��<�h6��d�Xj+6���(�d��>���}��rW�u�n������>y���#�>�#7��M���yX��_�3����Z}�f6�-fO�n����[���q������NF��3<�wt0��f����Mg�d���@�8
���I<�e^`�F 3M�R�9l	.��NM�+gj��;sj�r�V
��e�2��C����$�)����1Y����������-��J���5�6{��k|�i�{�!���RD��C:K�BH_��	5l
h���f�x�xv���-�;�|q����j���x]�:��R�Xd���Z�'[h�PK<M�;��]Tu����W�������+�[��N�<Y�FCA���I��88�G��\�#H6]�B�"��(�,����ZdZ�����W��+d�{~���Z��*+k��xFt��w��v����p��_k��0p.�M��,��
��_z�����-V�u@���|����%y�?�l�tf7�Ml���<�6�~������������n�/w����>���F�����D��t���N	v��!3v���N��H�j$SW#��F��\���,����
��B�][��GP���w5������P0�`B�$$&5�+����#2sDf����x�^�]��E��	�&64������SV#�<*M��V3���A�H�&�������3`JJj+��Z��s��x���f�^�����o6�������/[5����.Y7f���oY���������z��_v(:���$0v������o�u����K[7���[�m���H�3�n�.�(
�7�>���*�b��ZD�����
����j&�K��"�>E���u>��)������3�0:'�ntM�
���VN���n��n�����b��L�3�����T��XH:�q#�(���4yc5%�J���C�'iT����B�FD�N�������s���V/l�MS�����J��G/�]�&]���ve����2�������D�si��>����\���{�'��Pg���-FQ�Yc� s�s��!j��6R5���a]������]��>V6-�F�L�6=�F��}�����<��]��B�r���w���;���u��-�&C��8�����y���{.a�W��w+�'+�P|<n�]�������s��-�<����Y�gg�9�L��<u�
�x������.\�����6�1�
W]������Y����������_5��y���K��9��4�(%����!�7@�5^dqX
�������2Gwo��^���8j
g:f�v^�X�a����'��������]�Xp[�����G��}�����H��4��x�i����
���XQai�\Vt�<�h��:6�<#v�}�H�q|s�(u2Y)�)��d�&u���w;+�w:7:�N�F�v����������+���x�����E{yN!NN#��3�&�)d9KB�k[�(Z���P�-]��|kIH�u��L���Il��ar�-V�����iL�]�MR�I�r�����Qw��x'n����~a��.j4��_w��iW$��9o{��p������yC��D���Lh�L���e|S+����,z�_l
�s�����w�.��o<j�ac��S�
KU�
�t�X�i7��������4�F���]z�l��]U*P�X���Sg���&z~����1��Iu���<��6�6m�FZR��\�21�{�_�u5�������EY�G��trUl��T^�w���.�m�[W���+�\��\}l��m#�?������s&���1����[�
xr����S����\��/���w��������n��=��!�`X����F\|��G&?�N�4����~���b�F�E<]����[<�uT�s���l,#a.:��'d3B��9�&3M�I��Ihe:A���#�ZWM%��-v�K���2�_�R��?(�wlQ������u&�!�4\c���w<f�m�c�m�������3{�k�k�Kr1����@���n��MpN�\.��c���a|��c��Y�[v�8I8�K���B�rl�0ch�Y��;����X�J��T����X�X�
"&ci�FvI�+��+��+���s����
79���J�LB����4Q��%��������<������>I�j���4�|���%�Rs�#�D�>O���_]a����)�����wd|�����9��UO��\2a��-����-�?s�e0�S�/��9}�U�����oA�<y�}�>pC�_l��#�Q���0tK������N����Z�����_7Oy7�Q�G�_�?~�q"����XZ��<mH�������qt����Cx�c@�%�K��W:>2���-;�T�Wr�������E��-&��gW�������v�.d�|���r������z�����>��C*�-�,i����y����I����C�����x���u+S�U���1	N=?�o:jz��4��#L�)S���g�LU��aJ<�,)���R����4�2��"U5��O�~s9�]��E����yY�Ha��R9����Z��=�\r�����Z{_����S�\���m^�a�w�������p�������;��+�nA��������!�C����C�e�m�4�0�2�f������d|4�2Bb��������irO�`�P���>�Q��������i�C����g���>�r��#}���>�r�S6)\Q������6�a��#*�JA��7��_�{���V�+��vO�-:��aly���_X�p0G�sv���}��[��}dr'
����L1���%��),��Z�zUDZ�{H���+Bb��6�{�V�N���h�Ob���y�6������+o�W.<`�~~Ik��t���j��%�g��������i��,������}bmX6eM�[|����U��`������LbvV��^�7Jd��������?F���y��r�s��Z���Zd�Mj�L5Z2�E����h��{�`�����*�t������N�������C��m*2u�����h�m@Ve��Yc;�0M����������O���~������F�cWA(�$L����0��8�1����!rY+�Cv���5�k����#����Dn p���_����E������v���o�~�h�L�~�j*E[g�v����n�%��y���t�s],���S��S��7�9�]G]���.9��p�@������%x��F<���"!jY��v	]�
��f���
?�6��i��C��Ngi'���N8YQ�]"����#|�|u-@Z����[����^EO�n+�����'�6�������������~��c7-���
��V���-���#q;+�~���_n���.zR*�����<�y�\VH�������C���K���p�r�nR���!�$�?X�7���T���2�RmV�n��t�i�
i���x���I;`a>a|q�\K�x���Z�Sv�
)�)[����U'��Jm!����vi����4�GZ�Ci��������m�%|I����b�S�����O�e�����:M7#��'@b����q?uDg�6�����K\,���0�;pd��#1O�r���=s!h+D�n��$�������4�:��t�0�������{]�f�����[L���^�����z��kf�\[n�����5[j��7��q�75=�_v+Nn�W�
�;�U�!�'#A��1��=H�?�'f7��Z�����O���V[lr������=z��;K���s�f��wBr��\��a��}�<�
R�0�PoHd�
+��	�7���v�V���=y�������h5[V1[�b���6X�Y�$�����p����	�OK:V29��v��[w�};@�c��(�p/�h���:Wn=��_H���4=����2H�iz�N�%����S��@��u=��n=����<z�����Y���%>��(��O�'-�?���F���Z��$E3CF/�K3F����X.[��)����i��un����X-'0Ul>iS���WP�������!d5�������`�@��t�.Hoi ]4�N�75�.d;]�0�YU2�vj*]_��S��F���B�D�b���c�h���P#p	GoQ�R��z/O��O�$�vi�y&�*T���B�B0'��-��5�B�B�Y�z$J�D����x�����?��������
:0QN����2]��SS�R��t�qxu��{.�g�����~��+��X���L�K�y�/�7��mgtb����UU7ti/9��������}OS>��I�����K��h���w5������|�$��=^q}$E=�����R((2<&k�>�8�<�Xm��8�l.Uzyz��*�!�!���D�D�h��S��m�m�Sf{f���1��hp\&�5��^f�J�j�j��n��d��5U��T}E�J�'E��t���Cfj�g�meN��p,)�C��&�)t�srK;��S]��g�]�G�������S0����W�m��;�F�G�����`/��j\$d�9#��9bO���:�}l��Fn�v�Q�k�y��������{��
]���1��\a��"��jqc8E5�v���m��e�o3�����~���
+�7�\������;�m�������L�x��W�������+�g�Y��d��������E��$"<�`�f�xK2�f�����{�{��N�6_f����>�<�>C����~ �Z���w�^�<�z2�D$�E���v�{)����#��������N���/���3�3DPg� 1D�F1�s��k�Zk������v�������������04GV����x�M�u!K���jKZu1�.ls0��mb	v��aV�F���,��3L�&�|��������C���uab��vS���=���GU1�M=s����n��Y<�����.��y)�V�yS��u�[j5�+���{��c3�y��	wvr?v��'_�`G��oV��&y�����}h����-G�|���&��8�p���^|Xq
Sd�K�~�y��P6Z�f���Hq[ ��MLX-����Ia)<���������Z������	������T�U�3��9�'��3�O������W�@yi����4��Y��S�&���t�#�gT\vy��}/�<5S��<oP���V��oz�\��D��#�Y�O���N<��Y�����ku�A
������Yz�.���PH�N��el��3>{j�b���rKy������O�;)z�oH��8WJ�50�<�2�:�6�>�1�<�2�:�6�>��+oW�+?/'?�C��	�j[]^]�������{���.�e����X���*K����|�����z Gh�k�?���Q�L#9T��=�e���v������Ni����m�`EpDpRp{�h��
��s���p�� �9���(v<��T\aq�v�q`
���3�W*vB����N3���!�IV���8�.�Q �Bl$�:��i,-'O	��P�n�4��IR�����P���u�=�����/S��=�t2�_�;Tv��R�T�P���B������!,��zKa��AV~aim��^QR_�KhS'�##�=�>WoQ����o�����B.�wWD���W��
�b������GW����w������"�L+���:vj�p�H!�G;1���S�O�K����y���Q��t��3����<��QR���������������,g4�����5��[����a%���=��Sal���-/����h![.�r����	��t)Y�[�;����gr�p�U4�V��xQ��{=0�O�������	��}���3}������r��C7}�]�5j������K�x}A86��+�'��
e�Xs��Y<q��K�"�����?���f���Y�$����SF=`%6���Z�K�`�>���V&�O��\V�����dC6s\`�������\i��5�5����d@�g�)a:`:f2��6�6����������k�����U��L-���w�:��}|&X���������V�'������r2j��]��h��������h�������������W\Ut�m;w�N�dn����������j^����aEi�����f��o|��r���ks�Y���#�/$�����|���J�v��`���g��<�)���ooP@Ko�qT�$��y
	�g�\^^�%���0��
���k1A��������0��>����!�G�#�#���a���1��{�.�c.�G�0��9���&l�>��0
�������2���-Xo����`�����~L��`��U����X�����B�(����=�k��/������T�;#�c#�@�2)H�"V�a%{1��#�[�������A��e�_��r0~+���F�.D��?e<�CZ������a:}s�7a��>}j����D��%?Bji�����
.��B=�Y�t�(~f�C��x=`�$r��{���:�q��c��)�&� �$?��3��n0���Q���q�����1� ���/El�w~"���b���v�?<����R'�/�y�m�#���cq^�WQ��bs�w6�����2	���o'��:T�����#�)<�e����@*#����4`�!|OaDd :!>B<������(�����"�o
�@�0��c�}<�~�1���l��E�d��Y���$/������I��gt*�{���/�;��Z(���9�>D��)�����>4�+��G>��x���S�51&(-o����� �����S},Z�t����5^�:e�� �.�B>
�������{�l����+���?���O0g3
�;���<������l�83�L~j���I~���;��%���O��;�o�D��d�3��d��n�	���3"�SLo@�#
�1v�yk4�Cf8��#���!=�8?^��(�>������*�x�MV��8,7ya2�u����[	�~�s[��<���t��k[J:_��0R#���Nj8���h�d�l�gaPG#�������?_�G����g>���?�m��-���.���U���~$G:����|[���j�
�������u�����5�G=��}i2i�|��+�U�$�K0��!�8~���Z�l��i����`����+�����o��_;:^��b�K�������� �'�{�\�c���w�(����Hc"� @v�l�t/�3���p��6�T�+�����K��/�4��D)�p)<b�J�q�k@�}����|
8�^���������mc��_P�Y�R�X���	yv8���mu����c�Q}�a|����?�9l4��KQ�6��a3.k�b+��Q�7��������.C�Z��i%��?!���$~�"�����I�qg�o�/�:v���
��G���&�^X-���8�b�Z�Il�vL�
��3��*���6`��0��V�/C>��))�z����)���)l��J��>�{q�AGdir3]T�����i�JY���M������@�l�^y)���C��e�
�?��~J.�$��FXCq9
�~�.�-)�(��t�'���r����'�i���`���\c=��'9X�+|��W� �G�Z��o�OQ9�����yt�ZA�UG�>�!�U�qN��������~�}�������b=*#?H��|����Q|-<�����~�0��mM��q��Z��nl1���
�A,�p��"��q������e���!�I����b���������v~(�5��}�w��A�3�}���q��c{��������	�%�j�R�|L��zm��t����#A��?����tn5������R���wZ�Q�6�������D��/���8��w��go�"]�<��xG��������#����
���Moo;�?�;aRk�|��wCo�\��m����7����������`J��O�����G@>��`_����%~u����H �%�]�^C��w�JB�q�N�*�W��������`����p	�<�eH� ���2�Vn�����������?������s�"����m1@^E(�;��T�y����VF��s���P�E�gLC��������D�0�w_cx>�W��r:l��� ��������Q���o� ��������� ���.��!����nC�{5�i��E<���1~�
���z�!R������C����?�]�>��g����.n�����>�?C��5���9�j��
U��L���h����5�Nq>�[C�lB��N~4���?�Q�b�&�Xl U��;��J�3��H7�=���8Z��~iv��neg`BA�kt����'_E��B���F[w"��HE��B[����H�`<���M�u��t�������j#�6�D��6��t=5\Bhk��S����o��������4��y��PB0���m����?�9?�?���;��x�D�����������AZ���
Z[�����z��q��iq���@=P���GP_����@��J��i7��A��i(��n���SH�(�F�����M���|D���P�s���o�?�!�������P���A�@����������.�s�	���Wm|����`�i��0�B]�jt������������}��=�d��Qf��[^�P�_-����A���.�]����64���a�K{C���$���Z�Wh/E{h!����gB�d���
~!��������%���+c'P�>F�}��By"�GT��9�8��>g5+h�]�i�[	�6�`y6��`�y�8o�Ozn���Lw�C��8_��U��?��G{�i-{��7��	D�&�P��i��^�<m�WbJ����m�jD�z^�<�����W�}���������������q����>��
�\^����m_��p\�~��}_*������Jiu7@���b�.�938P�]4����z>�W^��9�/���8���a���M,�e�j��A�q�w�,�|L��J=73�AT`��a�mtv���G��<V�}��#<5��|��8ctig�Ay
�{���r���.��"p��c<G|�F�X����u���7�	���E��V��0��_m0��r�9��@]������[�!S�	S$7������s����g��&�����r�	��j��4|'p}�v�K�J��X�vNX��3�0�������V�r���_`�}��7b��/��������BC���@�R����-�.����t��m������������c�~�?���������{o�-0=����>�?����sN[`z�O�cx[`����@�����C�6}
�_4{�)��H���b��iZ�/Z�_"p��|�k�d_
����^���\W'G�G�KHC�=���=�B�x�-��������f�N�~��H��&�C|��'�&��i�^���Z�	�����/�R�Q�K�GRB���a�c��y����H��}��~Q8S��g�]��|+�G�Q��:��M���0T������B��
}�D�W%F�!C_�H�������6�'��y�,�� h�&�WCi��Q�b�\�Mz�|iC��Jq&Dg'�`�u��_,�*�
����5�JC0�o4u��:���a��F��<�Oc_��4�Wa�$(3�����q6Xv�4j����0}D��!dY�~�1�c�Co����������"�q.�*���E?L��u�~o�P�cR'�3\�9=2����%��K��dA��VZ��x���~jL��O���3�?���.������'q���U�t�����Y�.�E�8���Z��:o��5tW��_��Q->��G����R��-���V����p�S/��O����Z��9��%��5���&	�c0�����������1
��g&���f��6|����������u�����/��?#�R������lbZ�z-}&b1b��Oy�%j��K��"o�Z�	�0Igp��^��*�:$��O��R-�=vO�3�g����F2Lw�~���-��t=�~��(�wc�����m���O�Y��7$��FE�F�^[������g�	?V�3�^x�E��k4��^�����d��dR�;���;m�-M�?p�@��;O��[?��bN@��X���s~���h��u���O��?#Z������?
��Xa�E�����*�j�\�#��Z!�������u��c�sG��T��'�c&�`�GP#���z��n��L�q��Q��������Yo_{��t�������?������������8?��B���)�8��"�D�pe%��*IS������-u��kqmJ�����=;S@�����h|LSU�3���$�����G����4M��XF�f��l�t�]�����`��>_r���N>�v������0�������	��0,A_�m���%
�T�/�]�i������k�L��� ��m�c��h���~�_�����������X_��g��~��1�:o�&C�Rw�����"��,�=��T��e����|���W�wrh��
���D}}}_ ��f������;=���D~��+
���QXvT����Hi��j��x��o���4\C����s0@�BT#B\]D�w�'�by�$c��\�0�
kTP�Xo��:����}j����
#���������>,�q�$�G!y����a�����������X��1���i0���X��g����>�ry�qN�n8��j����"p��0~������k��4�0\
M���|p�
g�ACG�{
��.B��n�'w��'_���u��ky����)��W����=R�
:���T=2qZ�5������(���{����1�AQ�d��j��(d�~���q��$cP7����;m�[�/�U�O�?��6���*�H��%�^��K��_��Y>����>����-������~�n�����^�?<Si{w���r�l�������!���<��~���q�?��������U��U�
e{!�A/�mO��I3Qe�k���r�}������7�<�����O��9�K���;�4�������[�w-���0�t-�Ta3�n7��P���n�/AW~N�A���.��}��}a^����`�]�[�Q!��/	��Tu���F�ghU}�!������� ���q����zZ���>��6}��I�i���Q����d�����%��|�
}V�?�ju���|��vv��$E�����mYwtw�?��A�O>�y?_�o��s������:��������Y����$�N�8�
�2+����
�q��gA���OZ��OooY��k9}�p���HW�/���$	{�\�����C�<*�2#��#Xn�j7�
yq�G��7�}��7�o�h\Z�C��@�^�;8^��b
A�>�_��a>����
�o�tlC�"=���o�O�O�����q��X�a�(�
�#��'j_��qw_��xO��w#T�"�8�{��WT���'��Q�c[��c�x���X��L�nu;���v���hG;���v���hG;���v���hG;���v���hG;���v���hG;���v���hG;�������_����pP������S����^+����='u�.uh�e��J�RF�E�x�������t�"��b���sb;b?B�IR&�+�\��GlG�GC�I���F�	��2�PC$�����X7�}tI~��DH�g1bb�N�F�Q���9�%����"'.����}�7�.���W���d5:�FDw^Z��a�T���X/�X�R5�S_�����[RO��(9��'��#}����d� ��0l���@p����%�������%��%uN�X��]�����K�@��O�9��N��dc���C������������4���@lD�GE|�0���>������I�����/&�.>�q�xR���;�T���Yo������[�-��k
=�J��@�X�s��?]x|%��O
�t@����F�zV����U�n��n�
�3����;#���>����@p������A�D�"�"�zCo@=bb"�@.�������Wo@gD1a����F~�!�o�����_?���A_��}�� �KH3���2������u�
�b�7�������}�|?�]���
��$��#�����|��p�X�>�1x����x^?d�=�z]�!|l�l������(=���C���m
���w�R�#��k1D�����G��I�G����G#��LN~���Y,�����Q�G�:��@���/|#S�l(,�[�u(��c�������V?��������rV9������d�qV�,��CQ���.�����0���/`�y�>������o�Y
�t�R��}H��^����g��f!�g�N���������P$[-�$����B�w�U2�� �<V|��yx!�=�l�<��y|���I��/I�Kgc��O>��I�%�/F��/�h]�.:V�uz�������Y<+�����2H�3�\�lDf2���<n���9�����:������wBN�:����MF������l����2e�:Vy,iOX �� d&Z
!�$����x��j�+
�cN��'�M�d��P#��'�g��4��!|S��~=�*�Rq�S��kdH�ED��������K1c}C�f"{�7��g�D�T5�������&��������;��+B����R����pg�BL
bg;�D��L��q=��x��>S�i�����Td�2�M�tS��cV�N��l5��F�l�f0�6&O�c�/<��?�H��@a�����$$gf�!�"
�C��eC���+"��c���:jB����!0dl�D���FSrt�GlH�4�����Q��	������F���e�	O�����{��t���VWC�wmE����]6��<j�g�>pA8#q��1U�m��
$3��$��X��}�NW����A��j���}U9��������4���D�?�r�?D93f*s�Zn�Z.�c�"X�b�\Q.�b�dF�v,����#'G��G`�(��i]�p.���e|�pX�9���2���H(�E2C�K��(bi����E��"�Z��-I�|��Z�qB/�8�eb�������yQ����S�����������N$���DvL���HB���b�t���&��S�'�D�Gv\4��'R�E��;`b����S�7\��2:����#K{\�����JG���F��J���=~ �e��zP[=������-<>�j��V������fE~�M����S���{QV���}��l[�:a��M8���O�>��2EYNLviY��/�J���jY
&��}!����@�rF����I��W��?��y�����I���5�j������I�^z��V��<�&v��^�(I-)���,�������#)����d�L�TK��!c9����['N���������X�-���u;���b�o���-���B��5��}HZ~h�b-#�_�����
endstream
endobj
21
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+ArialMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
23
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
]
1
7
0
8
[
889
]
9
15
0
16
[
333
277
0
]
19
28
556
29
67
0
68
[
556
0
500
556
556
0
556
556
]
76
78
0
79
[
222
833
0
556
556
0
333
500
277
]
]
>>
endobj
23
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+ArialMT
/Flags
4
/FontBBox
[
-664
-324
2000
1005
]
/Ascent
728
/Descent
-210
/ItalicAngle
0
/CapHeight
716
/StemV
80
/FontFile2
24
0
R
>>
endobj
25
0
obj
325
endobj
26
0
obj
13609
endobj
27
0
obj
312
endobj
28
0
obj
19098
endobj
1
0
obj
<<
/Type
/Pages
/Kids
[
5
0
R
12
0
R
]
/Count
2
>>
endobj
xref
0 29
0000000002 65535 f 
0000052881 00000 n 
0000000000 00000 f 
0000000016 00000 n 
0000000142 00000 n 
0000000261 00000 n 
0000000426 00000 n 
0000017375 00000 n 
0000010340 00000 n 
0000010360 00000 n 
0000017674 00000 n 
0000017825 00000 n 
0000010379 00000 n 
0000010548 00000 n 
0000017531 00000 n 
0000017334 00000 n 
0000017355 00000 n 
0000032055 00000 n 
0000017969 00000 n 
0000032467 00000 n 
0000018370 00000 n 
0000052237 00000 n 
0000032675 00000 n 
0000052600 00000 n 
0000033063 00000 n 
0000052797 00000 n 
0000052817 00000 n 
0000052839 00000 n 
0000052859 00000 n 
trailer
<<
/Size
29
/Root
3
0
R
/Info
4
0
R
>>
startxref
52947
%%EOF

v19-0005-Add-more-parallel-vacuum-tests.patchtext/x-patch; charset=UTF-8; name=v19-0005-Add-more-parallel-vacuum-tests.patchDownload

From 0ebda44e3d0f59ebd7b6ee26a5d93c6d53bf9c3c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 13 Jun 2025 10:58:18 -0700
Subject: [PATCH v19 5/5] Add more parallel vacuum tests.

---
 src/backend/access/heap/vacuumlazy.c          | 22 ++++-
 src/backend/commands/vacuumparallel.c         | 21 +++-
 .../injection_points/t/002_parallel_vacuum.pl | 97 +++++++++++++++++++
 3 files changed, 135 insertions(+), 5 deletions(-)
 create mode 100644 src/test/modules/injection_points/t/002_parallel_vacuum.pl

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index e6ca9c60e8a..6b7b22816b9 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -192,6 +192,7 @@
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/read_stream.h"
+#include "utils/injection_point.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_rusage.h"
 #include "utils/timestamp.h"
@@ -466,6 +467,14 @@ typedef struct ParallelLVLeader
 	/* The number of workers launched for parallel lazy heap scan */
 	int			nworkers_launched;
 
+	/*
+	 * Will the leader participate to parallel lazy heap scan?
+	 *
+	 * This is a parameter for testing and always true unless it is disabled
+	 * explicitly by the injection point.
+	 */
+	bool		leaderparticipate;
+
 	/*
 	 * These fields point to the arrays of all per-worker scan states stored
 	 * in DSM.
@@ -2251,7 +2260,8 @@ do_parallel_lazy_scan_heap(LVRelState *vacrel)
 		 * retrieving new blocks for the read stream once the space of
 		 * dead_items TIDs exceeds the limit.
 		 */
-		do_lazy_scan_heap(vacrel, false);
+		if (vacrel->leader->leaderparticipate)
+			do_lazy_scan_heap(vacrel, false);
 
 		/* Wait for parallel workers to finish and gather scan results */
 		parallel_lazy_scan_heap_end(vacrel);
@@ -4543,6 +4553,7 @@ heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
 {
 	LVRelState *vacrel = (LVRelState *) state;
 	Size		size = 0;
+	bool		leaderparticipate = true;
 
 	vacrel->leader = palloc(sizeof(ParallelLVLeader));
 
@@ -4567,6 +4578,12 @@ heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
 	vacrel->leader->scandata_len = mul_size(sizeof(LVScanData), nworkers);
 	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scandata_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("parallel-heap-vacuum-disable-leader-participation"))
+		leaderparticipate = false;
+#endif
+	vacrel->leader->leaderparticipate = leaderparticipate;
 }
 
 /*
@@ -4604,7 +4621,8 @@ heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworker
 
 	/* including the leader too */
 	shared->eager_scan_remaining_successes_per_worker =
-		vacrel->eager_scan_remaining_successes / (nworkers + 1);
+		vacrel->eager_scan_remaining_successes /
+		(vacrel->leader->leaderparticipate ? nworkers + 1 : nworkers);
 
 	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SHARED, shared);
 	vacrel->plvstate->shared = shared;
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 49e43b95132..7f0869ee4dc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -39,6 +39,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "tcop/tcopprot.h"
+#include "utils/injection_point.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
@@ -1035,14 +1036,28 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 int
 parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
 {
+	int			nworkers = pvs->nworkers_for_table;
+#ifdef USE_INJECTION_POINTS
+	static int	ntimes = 0;
+#endif
+
 	Assert(!IsParallelWorker());
 
-	if (pvs->nworkers_for_table == 0)
+	if (nworkers == 0)
 		return 0;
 
 	/* Start parallel vacuum workers for collecting dead items */
-	Assert(pvs->nworkers_for_table <= pvs->pcxt->nworkers);
-	parallel_vacuum_begin_work_phase(pvs, pvs->nworkers_for_table,
+	Assert(nworkers <= pvs->pcxt->nworkers);
+
+#ifdef USE_INJECTION_POINTS
+	if (IS_INJECTION_POINT_ATTACHED("parallel-vacuum-ramp-down-workers"))
+	{
+		nworkers = pvs->nworkers_for_table - Min(ntimes, pvs->nworkers_for_table);
+		ntimes++;
+	}
+#endif
+
+	parallel_vacuum_begin_work_phase(pvs, nworkers,
 									 PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
 
 	/* Include the worker count for the leader itself */
diff --git a/src/test/modules/injection_points/t/002_parallel_vacuum.pl b/src/test/modules/injection_points/t/002_parallel_vacuum.pl
new file mode 100644
index 00000000000..f0ef33ed86b
--- /dev/null
+++ b/src/test/modules/injection_points/t/002_parallel_vacuum.pl
@@ -0,0 +1,97 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for parallel heap vacuum.
+
+use strict;
+use warnings FATAL => 'all';
+use locale;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test persistency of statistics generated for injection points.
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('master');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', qq[create extension injection_points;]);
+
+$node->safe_psql('postgres', qq[
+create table t (i int) with (autovacuum_enabled = off);
+create index on t (i);
+		 ]);
+my $nrows = 1_000_000;
+my $first = int($nrows * rand());
+my $second = $nrows - $first;
+
+my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+# Begin the transaciton that holds xmin.
+$psql->query_safe('begin; select pg_current_xact_id();');
+
+# consume some xids
+$node->safe_psql('postgres', qq[
+select pg_current_xact_id();
+select pg_current_xact_id();
+select pg_current_xact_id();
+select pg_current_xact_id();
+select pg_current_xact_id();
+		 ]);
+
+# While inserting $nrows tuples into the table with an older XID,
+# we inject some tuples with a newer XID filling one page somewhere
+# in the table.
+
+# Insert the first part of rows.
+$psql->query_safe(qq[insert into t select generate_series(1, $first);]);
+
+# Insert some rows with a newer XID, which needs to fill at least
+# one page to prevent the page from begin frozen in the following
+# vacuum.
+my $xid = $node->safe_psql('postgres', qq[
+begin;
+insert into t select 0 from generate_series(1, 300);
+select pg_current_xact_id()::xid;
+commit;
+]);
+
+# Insert remaining rows and commit.
+$psql->query_safe(qq[insert into t select generate_series($first, $nrows);]);
+$psql->query_safe(qq[commit;]);
+
+# Delete some rows.
+$node->safe_psql('postgres', qq[delete from t where i between 1 and 20000;]);
+
+# Execute parallel vacuum that freezes all rows except for the
+# tuple inserted by $psql. We should update the relfrozenxid up to
+# that XID. Setting a lower value to maintenance_work_mem invokes
+# multiple rounds of heap scanning and the number of parallel workers
+# will ramp-down thanks to the injection points.
+$node->safe_psql('postgres', qq[
+set vacuum_freeze_min_age to 5;
+set max_parallel_maintenance_workers TO 5;
+set maintenance_work_mem TO 256;
+select injection_points_set_local();
+select injection_points_attach('parallel-vacuum-ramp-down-workers', 'notice');
+select injection_points_attach('parallel-heap-vacuum-disable-leader-participation', 'notice');
+vacuum (parallel 5, verbose) t;
+		 ]);
+
+is( $node->safe_psql('postgres', qq[select relfrozenxid from pg_class where relname = 't';]),
+    "$xid", "relfrozenxid is updated as expected");
+
+# Check if we have successfully frozen the table in the previous
+# vacuum by scanning all tuples.
+$node->safe_psql('postgres', qq[vacuum (freeze, parallel 0, verbose, disable_page_skipping) t;]);
+is( $node->safe_psql('postgres', qq[select $xid < relfrozenxid::text::int from pg_class where relname = 't';]),
+    "t", "all rows are frozen");
+
+$node->stop;
+done_testing();
+
-- 
2.51.0

v19-0004-Support-parallelism-for-collecting-dead-items-du.patchtext/x-patch; charset=UTF-8; name=v19-0004-Support-parallelism-for-collecting-dead-items-du.patchDownload

From 9c650361055dd8587a37149608da9b905e084332 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 15 Sep 2025 22:51:39 +0200
Subject: [PATCH v19 4/5] Support parallelism for collecting dead items during
 lazy vacuum.

This feature allows the vacuum to leverage multiple CPUs in order to
collect dead items (i.e. the first pass over heap table) with parallel
workers. The parallel degree for parallel heap vacuuming is determined
based on the number of blocks to vacuum unless PARALLEL option of
VACUUM command is specified, and further limited by
max_parallel_maintenance_workers.

For the parallel heap scan to collect dead items, we utilize a
parallel block table scan, controlled by ParallelBlockTableScanDesc,
in conjunction with the read stream. The workers' parallel scan
descriptions are stored in the DSM space, enabling different parallel
workers to resume the heap scan (phase 1) after a cycle of heap
vacuuming and index vacuuming (phase 2 and 3) from their previous
state. However, due to the potential presence of pinned buffers loaded
by the read stream's look-ahead mechanism, we cannot abruptly stop
phase 1 even when the space of dead_items TIDs exceeds the
limit. Therefore, once the space of dead_items TIDs exceeds the limit,
we begin processing pages without attempting to retrieve additional
blocks by look-ahead mechanism until the read stream is exhausted,
even if the the memory limit is surpassed. While this approach may
increase the memory usage, it typically doesn't pose a significant
problem, as processing a few 10s-100s buffers doesn't substantially
increase the size of dead_items TIDs.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 doc/src/sgml/ref/vacuum.sgml                  |   54 +-
 src/backend/access/heap/heapam_handler.c      |    8 +-
 src/backend/access/heap/vacuumlazy.c          | 1182 +++++++++++++++--
 src/backend/commands/vacuumparallel.c         |   29 +
 src/include/access/heapam.h                   |   13 +
 src/include/commands/vacuum.h                 |    3 +
 src/test/regress/expected/vacuum_parallel.out |    7 +
 src/test/regress/sql/vacuum_parallel.sql      |    8 +
 src/tools/pgindent/typedefs.list              |    5 +
 9 files changed, 1198 insertions(+), 111 deletions(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..294494877d9 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -280,25 +280,41 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     <term><literal>PARALLEL</literal></term>
     <listitem>
      <para>
-      Perform index vacuum and index cleanup phases of <command>VACUUM</command>
-      in parallel using <replaceable class="parameter">integer</replaceable>
-      background workers (for the details of each vacuum phase, please
-      refer to <xref linkend="vacuum-phases"/>).  The number of workers used
-      to perform the operation is equal to the number of indexes on the
-      relation that support parallel vacuum which is limited by the number of
-      workers specified with <literal>PARALLEL</literal> option if any which is
-      further limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
-      An index can participate in parallel vacuum if and only if the size of the
-      index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
-      Please note that it is not guaranteed that the number of parallel workers
-      specified in <replaceable class="parameter">integer</replaceable> will be
-      used during execution.  It is possible for a vacuum to run with fewer
-      workers than specified, or even with no workers at all.  Only one worker
-      can be used per index.  So parallel workers are launched only when there
-      are at least <literal>2</literal> indexes in the table.  Workers for
-      vacuum are launched before the start of each phase and exit at the end of
-      the phase.  These behaviors might change in a future release.  This
-      option can't be used with the <literal>FULL</literal> option.
+      Perform scanning heap, index vacuum, and index cleanup phases of
+      <command>VACUUM</command> in parallel using
+      <replaceable class="parameter">integer</replaceable> background workers
+      (for the details of each vacuum phase, please refer to
+      <xref linkend="vacuum-phases"/>).
+     </para>
+     <para>
+      For heap tables, the number of workers used to perform the scanning
+      heap is determined based on the size of table. A table can participate in
+      parallel scanning heap if and only if the size of the table is more than
+      <xref linkend="guc-min-parallel-table-scan-size"/>. During scanning heap,
+      the heap table's blocks will be divided into ranges and shared among the
+      cooperating processes. Each worker process will complete the scanning of
+      its given range of blocks before requesting an additional range of blocks.
+     </para>
+     <para>
+      The number of workers used to perform parallel index vacuum and index
+      cleanup is equal to the number of indexes on the relation that support
+      parallel vacuum. An index can participate in parallel vacuum if and only
+      if the size of the index is more than <xref linkend="guc-min-parallel-index-scan-size"/>.
+      Only one worker can be used per index. So parallel workers for index vacuum
+      and index cleanup are launched only when there are at least <literal>2</literal>
+      indexes in the table.
+     </para>
+     <para>
+      Workers for vacuum are launched before the start of each phase and exit
+      at the end of the phase. The number of workers for each phase is limited by
+      the number of workers specified with <literal>PARALLEL</literal> option if
+      any which is futher limited by <xref linkend="guc-max-parallel-maintenance-workers"/>.
+      Please note that in any parallel vacuum phase, it is not guaanteed that the
+      number of parallel workers specified in <replaceable class="parameter">integer</replaceable>
+      will be used during execution. It is possible for a vacuum to run with fewer
+      workers than specified, or even with no workers at all. These behaviors might
+      change in a future release. This option can't be used with the <literal>FULL</literal>
+      option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..c2fac6cbe65 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2668,7 +2668,13 @@ static const TableAmRoutine heapam_methods = {
 
 	.scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
 	.scan_sample_next_block = heapam_scan_sample_next_block,
-	.scan_sample_next_tuple = heapam_scan_sample_next_tuple
+	.scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_initialize_worker = heap_parallel_vacuum_initialize_worker,
+	.parallel_vacuum_collect_dead_items = heap_parallel_vacuum_collect_dead_items
 };
 
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2282c02ffa4..e6ca9c60e8a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -99,6 +99,44 @@
  * After pruning and freezing, pages that are newly all-visible and all-frozen
  * are marked as such in the visibility map.
  *
+ * Parallel Vacuum:
+ *
+ * Lazy vacuum on heap tables supports parallel processing for phase I and
+ * phase II. Before starting phase I, we initialize parallel vacuum state,
+ * ParallelVacuumState, and allocate the TID store in a DSA area if we can
+ * use parallel mode for any of these two phases.
+ *
+ * We could require different number of parallel vacuum workers for each phase
+ * for various factors such as table size and number of indexes. Parallel
+ * workers are launched at the beginning of each phase and exit at the end of
+ * each phase.
+ *
+ * While vacuum cutoffs are shared between leader and worker processes, each
+ * individual process uses its own GlobalVisState, potentially causing some
+ * workers to remove fewer tuples than optimal. During parallel lazy heap scans,
+ * each worker tracks the oldest existing XID and MXID. The leader computes the
+ * globally oldest existing XID and MXID after the parallel scan, while
+ * gathering table data too.
+ *
+ * The parallel lazy heap scan (i.e. parallel phase I) is controlled by
+ * ParallelLVScanDesc in conjunction with the read stream. The table is split
+ * into multiple chunks, which are then distributed among parallel workers.
+ * Due to the potential presence of pinned buffers loaded by the read stream's
+ * look-ahead mechanism, we cannot abruptly stop phase I even hen the space
+ * of dead_items TIDs exceeds the limit. Instead, once this threshold is
+ * surpassed, we begin processing pages without attempting to retrieve additional
+ * blocks until the read stream is exhausted. While this approach may increase
+ * the memory usage, it typically doesn't pose a significant problem, as
+ * processing a few 10s-100s buffers doesn't substantially increase the size
+ * of dead_items TIDs. The workers' parallel scan descriptions,
+ * ParallelLVScanWorkerData, are stored in the DSM space, enabling different
+ * parallel workers to resume phase I from their previous state.
+ *
+ * If the leader launches fewer workers than the previous time to resume the
+ * parallel lazy heap scan, some block within chunks may remain un-scanned.
+ * To address this, the leader completes workers' unfinished scans at the end
+ * of the parallel lazy heap scan (see complete_unfinished_lazy_scan_heap()).
+ *
  * Dead TID Storage:
  *
  * The major space usage for vacuuming is storage for the dead tuple IDs that
@@ -146,6 +184,7 @@
 #include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"	/* for min_parallel_table_scan_size */
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -213,11 +252,22 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * DSM keys for parallel lazy vacuum. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define PARALLEL_LV_KEY_SHARED				0xFFFF0001
+#define PARALLEL_LV_KEY_SCANDESC			0xFFFF0002
+#define PARALLEL_LV_KEY_SCANWORKER			0xFFFF0003
+#define PARALLEL_LV_KEY_SCANDATA			0xFFFF0004
+
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
  * parallel mode and the DSM segment is initialized.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->plvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -248,6 +298,12 @@ typedef enum
  */
 #define EAGER_SCAN_REGION_SIZE 4096
 
+/*
+ * During parallel lazy scans, each worker (including the leader) retrieves
+ * a chunk consisting of PARALLEL_LV_CHUNK_SIZE blocks.
+ */
+#define PARALLEL_LV_CHUNK_SIZE	1024
+
 /*
  * heap_vac_scan_next_block() sets these flags to communicate information
  * about the block it read to the caller.
@@ -303,6 +359,121 @@ typedef struct LVScanData
 	bool		skippedallvis;
 } LVScanData;
 
+/*
+ * Struct for information that needs to be shared among parallel workers
+ * for parallel lazy vacuum. All fields are static, set by the leader
+ * process.
+ */
+typedef struct ParallelLVShared
+{
+	bool		aggressive;
+	bool		skipwithvm;
+
+	/* The current oldest extant XID/MXID shared by the leader process */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* VACUUM operation's cutoffs for freezing and pruning */
+	struct VacuumCutoffs cutoffs;
+
+	/*
+	 * The first chunk size varies depending on the first eager scan region
+	 * size. If eager scan is disabled, we use the default chunk size
+	 * PARALLEL_LV_CHUNK_SIZE for the first chunk.
+	 */
+	BlockNumber initial_chunk_size;
+
+	/*
+	 * Similar to LVRelState.eager_scan_max_fails_per_region but this is a
+	 * per-chunk failure counter.
+	 */
+	BlockNumber eager_scan_max_fails_per_chunk;
+
+	/*
+	 * Similar to LVRelState.eager_scan_remaining_successes but this is a
+	 * success counter per parallel worker.
+	 */
+	BlockNumber eager_scan_remaining_successes_per_worker;
+} ParallelLVShared;
+
+/*
+ * Shared scan description for parallel lazy scan.
+ */
+typedef struct ParallelLVScanDesc
+{
+	/* Number of blocks of the table at start of scan */
+	BlockNumber nblocks;
+
+	/* Number of blocks in to allocate in each I/O chunk */
+	BlockNumber chunk_size;
+
+	/* Number of blocks allocated to workers so far */
+	pg_atomic_uint64 nallocated;
+} ParallelLVScanDesc;
+
+/*
+ * Per-worker data for scan description, statistics counters, and
+ * miscellaneous data need to be shared with the leader.
+ */
+typedef struct ParallelLVScanWorkerData
+{
+	bool		inited;
+
+	/* Current number of blocks into the scan */
+	BlockNumber nallocated;
+
+	/* Number of blocks per chunk */
+	BlockNumber chunk_size;
+
+	/* Number of blocks left in this chunk */
+	uint32		chunk_remaining;
+
+	/* The last processed block number */
+	pg_atomic_uint32 last_blkno;
+
+	/* Eager scan state for resuming the scan */
+	BlockNumber remaining_fails_save;
+	BlockNumber remaining_successes_save;
+	BlockNumber next_region_start_save;
+} ParallelLVScanWorkerData;
+
+/*
+ * Struct to store parallel lazy vacuum working state.
+ */
+typedef struct ParallelLVState
+{
+	/* Shared static information */
+	ParallelLVShared *shared;
+
+	/* Parallel scan description shared among parallel workers */
+	ParallelLVScanDesc *scandesc;
+
+	/* Per-worker scan data */
+	ParallelLVScanWorkerData *scanwork;
+} ParallelLVState;
+
+/*
+ * Struct for the leader process in parallel lazy vacuum.
+ */
+typedef struct ParallelLVLeader
+{
+	/* Shared memory size for each shared object */
+	Size		shared_len;
+	Size		scandesc_len;
+	Size		scanwork_len;
+	Size		scandata_len;
+
+	/* The number of workers launched for parallel lazy heap scan */
+	int			nworkers_launched;
+
+	/*
+	 * These fields point to the arrays of all per-worker scan states stored
+	 * in DSM.
+	 */
+	ParallelLVScanWorkerData *scanwork_array;
+	LVScanData *scandata_array;
+} ParallelLVLeader;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -367,6 +538,12 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 
+	/* Last processed block number */
+	BlockNumber last_blkno;
+
+	/* Next block to check for FSM vacuum */
+	BlockNumber next_fsm_block_to_vacuum;
+
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
 	BlockNumber next_unskippable_block; /* next unskippable block */
@@ -374,6 +551,16 @@ typedef struct LVRelState
 	bool		next_unskippable_eager_scanned; /* if it was eagerly scanned */
 	Buffer		next_unskippable_vmbuffer;	/* buffer containing its VM bit */
 
+	/* Fields used for parallel lazy vacuum */
+
+	/* Parallel lazy vacuum working state */
+	ParallelLVState *plvstate;
+
+	/*
+	 * The leader state for parallel lazy vacuum. NULL for parallel workers.
+	 */
+	ParallelLVLeader *leader;
+
 	/* State related to managing eager scanning of all-visible pages */
 
 	/*
@@ -433,12 +620,19 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void do_lazy_scan_heap(LVRelState *vacrel, bool check_mem_usage);
 static void heap_vacuum_eager_scan_setup(LVRelState *vacrel,
 										 const VacuumParams params);
 static BlockNumber heap_vac_scan_next_block(ReadStream *stream,
 											void *callback_private_data,
 											void *per_buffer_data);
-static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis);
+static void parallel_lazy_scan_init_scan_worker(ParallelLVScanWorkerData *scanwork,
+												BlockNumber initial_chunk_size);
+static BlockNumber parallel_lazy_scan_get_nextpage(LVRelState *vacrel, Relation rel,
+												   ParallelLVScanDesc *scandesc,
+												   ParallelLVScanWorkerData *scanwork);
+static bool find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+										BlockNumber start_blk, BlockNumber end_blk);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -449,6 +643,12 @@ static int	lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
 							  bool *has_lpdead_items);
+static void do_parallel_lazy_scan_heap(LVRelState *vacrel);
+static BlockNumber parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel);
+static void complete_unfinished_lazy_scan_heap(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_begin(LVRelState *vacrel);
+static void parallel_lazy_scan_heap_end(LVRelState *vacrel);
+static void parallel_lazy_scan_gather_results(LVRelState *vacrel);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -473,6 +673,7 @@ static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
 static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 						   int num_offsets);
+static bool dead_items_check_memory_limit(LVRelState *vacrel);
 static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
@@ -769,6 +970,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
+	vacrel->next_fsm_block_to_vacuum = 0;
 
 	/* dead_items_alloc allocates vacrel->dead_items later on */
 
@@ -1214,13 +1416,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 static void
 lazy_scan_heap(LVRelState *vacrel)
 {
-	ReadStream *stream;
-	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno = 0,
-				next_fsm_block_to_vacuum = 0;
-	BlockNumber orig_eager_scan_success_limit =
-		vacrel->eager_scan_remaining_successes; /* for logging */
-	Buffer		vmbuffer = InvalidBuffer;
+	BlockNumber rel_pages = vacrel->rel_pages;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1241,6 +1437,80 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
+	/* Do the actual work */
+	if (ParallelHeapVacuumIsActive(vacrel))
+		do_parallel_lazy_scan_heap(vacrel);
+	else
+		do_lazy_scan_heap(vacrel, true);
+
+	/*
+	 * Report that everything is now scanned. We never skip scanning the last
+	 * block in the relation, so we can pass rel_pages here.
+	 */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
+								 rel_pages);
+
+	/* now we can compute the new value for pg_class.reltuples */
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
+
+	/*
+	 * Also compute the total number of surviving heap entries.  In the
+	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
+	 */
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
+
+	/*
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
+	 */
+	if (vacrel->dead_items_info->num_items > 0)
+		lazy_vacuum(vacrel);
+
+	/*
+	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * We can pass rel_pages here because we never skip scanning the last
+	 * block of the relation.
+	 */
+	if (rel_pages > vacrel->next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, rel_pages);
+
+	/* report all blocks vacuumed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
+
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
+		lazy_cleanup_all_indexes(vacrel);
+}
+
+/*
+ * Workhorse for lazy_scan_heap().
+ *
+ * If check_mem_usage is true, we check the memory usage during the heap scan.
+ * If the space of dead_items TIDs exceeds the limit, we stop the lazy heap scan
+ * and invoke a cycle of index vacuuming and heap vacuuming, and then resume the
+ * scan. If it's false, we continue doing lazy heap scan until the read stream
+ * is exhausted.
+ */
+static void
+do_lazy_scan_heap(LVRelState *vacrel, bool check_mem_usage)
+{
+	ReadStream *stream;
+	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber orig_eager_scan_success_limit =
+		vacrel->eager_scan_remaining_successes; /* for logging */
+	Buffer		vmbuffer = InvalidBuffer;
+
+	/*
+	 * We should not set check_mem_usage to false unless during parallel heap
+	 * vacuum.
+	 */
+	Assert(check_mem_usage || ParallelHeapVacuumIsActive(vacrel));
+
 	/*
 	 * Set up the read stream for vacuum's first pass through the heap.
 	 *
@@ -1276,8 +1546,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * that point.  This check also provides failsafe coverage for the
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
+		 *
+		 * The failsafe check is done only by the leader process.
 		 */
-		if (vacrel->scan_data->scanned_pages > 0 &&
+		if (!IsParallelWorker() &&
+			vacrel->scan_data->scanned_pages > 0 &&
 			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
@@ -1285,12 +1558,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Consider if we definitely have enough space to process TIDs on page
 		 * already.  If we are close to overrunning the available space for
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page. However, let's force at least one page-worth of tuples
-		 * to be stored as to ensure we do at least some work when the memory
-		 * configured is so low that we run out before storing anything.
+		 * this page.
 		 */
-		if (vacrel->dead_items_info->num_items > 0 &&
-			TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes)
+		if (check_mem_usage && dead_items_check_memory_limit(vacrel))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1313,15 +1583,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * upper-level FSM pages. Note that blkno is the previously
 			 * processed block.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
 									blkno + 1);
-			next_fsm_block_to_vacuum = blkno;
+			vacrel->next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 		}
 
+		/* Read the next block to process */
 		buf = read_stream_next_buffer(stream, &per_buffer_data);
 
 		/* The relation is exhausted. */
@@ -1331,7 +1602,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		blk_info = *((uint8 *) per_buffer_data);
 		CheckBufferIsPinnedOnce(buf);
 		page = BufferGetPage(buf);
-		blkno = BufferGetBlockNumber(buf);
+		blkno = vacrel->last_blkno = BufferGetBlockNumber(buf);
 
 		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
@@ -1496,13 +1767,34 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * visible on upper FSM pages. This is done after vacuuming if the
 			 * table has indexes. There will only be newly-freed space if we
 			 * held the cleanup lock and lazy_scan_prune() was called.
+			 *
+			 * During parallel lazy heap scanning, only the leader process
+			 * vacuums the FSM. However, we cannot vacuum the FSM for blocks
+			 * up to 'blk' because there may be un-scanned blocks or blocks
+			 * being processed by workers before this point. Instead, parallel
+			 * workers advertise the block numbers they have just processed,
+			 * and the leader vacuums the FSM up to the smallest block number
+			 * among them. This approach ensures we vacuum the FSM for
+			 * consecutive processed blocks.
 			 */
 			if (got_cleanup_lock && vacrel->nindexes == 0 && ndeleted > 0 &&
-				blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				blkno - vacrel->next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+				if (IsParallelWorker())
+					pg_atomic_write_u32(&(vacrel->plvstate->scanwork->last_blkno),
 										blkno);
-				next_fsm_block_to_vacuum = blkno;
+				else
+				{
+					BlockNumber fsmvac_upto = blkno;
+
+					if (ParallelHeapVacuumIsActive(vacrel))
+						fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+					FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+											fsmvac_upto);
+				}
+
+				vacrel->next_fsm_block_to_vacuum = blkno;
 			}
 		}
 		else
@@ -1513,50 +1805,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/*
-	 * Report that everything is now scanned. We never skip scanning the last
-	 * block in the relation, so we can pass rel_pages here.
-	 */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED,
-								 rel_pages);
-
-	/* now we can compute the new value for pg_class.reltuples */
-	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scan_data->scanned_pages,
-													 vacrel->scan_data->live_tuples);
-
-	/*
-	 * Also compute the total number of surviving heap entries.  In the
-	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
-	 */
-	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
-		vacrel->scan_data->missed_dead_tuples;
-
 	read_stream_end(stream);
-
-	/*
-	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
-	 */
-	if (vacrel->dead_items_info->num_items > 0)
-		lazy_vacuum(vacrel);
-
-	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
-	 * We can pass rel_pages here because we never skip scanning the last
-	 * block of the relation.
-	 */
-	if (rel_pages > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, rel_pages);
-
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
-
-	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
-	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
-		lazy_cleanup_all_indexes(vacrel);
 }
 
 /*
@@ -1570,7 +1819,8 @@ lazy_scan_heap(LVRelState *vacrel)
  * heap_vac_scan_next_block() uses the visibility map, vacuum options, and
  * various thresholds to skip blocks which do not need to be processed and
  * returns the next block to process or InvalidBlockNumber if there are no
- * remaining blocks.
+ * remaining blocks or the space of dead_items TIDs reaches the limit (only
+ * in parallel lazy vacuum cases).
  *
  * The visibility status of the next block to process and whether or not it
  * was eager scanned is set in the per_buffer_data.
@@ -1592,8 +1842,42 @@ heap_vac_scan_next_block(ReadStream *stream,
 	LVRelState *vacrel = callback_private_data;
 	uint8		blk_info = 0;
 
-	/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
-	next_block = vacrel->current_block + 1;
+retry:
+	next_block = InvalidBlockNumber;
+
+	/* Get the next block to process */
+	if (ParallelHeapVacuumIsActive(vacrel))
+	{
+		/*
+		 * Stop returning the next block to the read stream if we are close to
+		 * overrunning the available space for dead_items TIDs so that the
+		 * read stream returns pinned buffers in its buffers queue until the
+		 * stream is exhausted. See the comments atop this file for details.
+		 */
+		if (dead_items_check_memory_limit(vacrel))
+		{
+			if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
+			{
+				ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
+				vacrel->next_unskippable_vmbuffer = InvalidBuffer;
+			}
+
+			return InvalidBlockNumber;
+
+		}
+
+		next_block = parallel_lazy_scan_get_nextpage(vacrel,
+													 vacrel->rel,
+													 vacrel->plvstate->scandesc,
+													 vacrel->plvstate->scanwork);
+	}
+	else
+	{
+		/* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+		next_block = vacrel->current_block + 1;
+	}
+
+	Assert(BlockNumberIsValid(next_block));
 
 	/* Have we reached the end of the relation? */
 	if (next_block >= vacrel->rel_pages)
@@ -1618,8 +1902,41 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * visibility map.
 		 */
 		bool		skipsallvis;
+		bool		found;
+		BlockNumber end_block;
+		BlockNumber nblocks_skip;
 
-		find_next_unskippable_block(vacrel, &skipsallvis);
+		if (ParallelHeapVacuumIsActive(vacrel))
+		{
+			/* We look for the next unskippable block within the chunk */
+			end_block = next_block + vacrel->plvstate->scanwork->chunk_remaining + 1;
+		}
+		else
+			end_block = vacrel->rel_pages;
+
+		found = find_next_unskippable_block(vacrel, &skipsallvis, next_block, end_block);
+
+		/*
+		 * We must have found the next unskippable block within the specified
+		 * range in non-parallel cases as the end_block is always the last
+		 * block + 1 and we must scan the last block.
+		 */
+		Assert(found || ParallelHeapVacuumIsActive(vacrel));
+
+		if (!found)
+		{
+			if (skipsallvis)
+				vacrel->scan_data->skippedallvis = true;
+
+			/*
+			 * Skip all remaining blocks in the current chunk, and retry with
+			 * the next chunk.
+			 */
+			vacrel->plvstate->scanwork->chunk_remaining = 0;
+			goto retry;
+		}
+
+		Assert(vacrel->next_unskippable_block < end_block);
 
 		/*
 		 * We now know the next block that we must process.  It can be the
@@ -1636,11 +1953,21 @@ heap_vac_scan_next_block(ReadStream *stream,
 		 * pages then skipping makes updating relfrozenxid unsafe, which is a
 		 * real downside.
 		 */
-		if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
+		nblocks_skip = vacrel->next_unskippable_block - next_block;
+		if (nblocks_skip >= SKIP_PAGES_THRESHOLD)
 		{
-			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
 				vacrel->scan_data->skippedallvis = true;
+
+			/* Tell the parallel scans to skip blocks */
+			if (ParallelHeapVacuumIsActive(vacrel))
+			{
+				vacrel->plvstate->scanwork->chunk_remaining -= nblocks_skip;
+				vacrel->plvstate->scanwork->nallocated += nblocks_skip;
+				Assert(vacrel->plvstate->scanwork->chunk_remaining > 0);
+			}
+
+			next_block = vacrel->next_unskippable_block;
 		}
 	}
 
@@ -1675,10 +2002,86 @@ heap_vac_scan_next_block(ReadStream *stream,
 	}
 }
 
+
 /*
- * Find the next unskippable block in a vacuum scan using the visibility map.
- * The next unskippable block and its visibility information is updated in
- * vacrel.
+ * Initialize scan state of the given ParallelLVScanWorkerData.
+ */
+static void
+parallel_lazy_scan_init_scan_worker(ParallelLVScanWorkerData *scanwork,
+									BlockNumber initial_chunk_size)
+{
+	Assert(BlockNumberIsValid(initial_chunk_size));
+
+	scanwork->inited = true;
+	scanwork->nallocated = 0;
+	scanwork->chunk_size = initial_chunk_size;
+	scanwork->chunk_remaining = 0;
+	pg_atomic_init_u32(&(scanwork->last_blkno), InvalidBlockNumber);
+}
+
+/*
+ * Return the next page to process for parallel lazy scan.
+ *
+ * If there is no block to scan for the worker, return the number of blocks in
+ * the relation.
+ */
+static BlockNumber
+parallel_lazy_scan_get_nextpage(LVRelState *vacrel, Relation rel,
+								ParallelLVScanDesc *scandesc,
+								ParallelLVScanWorkerData *scanwork)
+{
+	uint64		nallocated;
+
+	if (scanwork->chunk_remaining > 0)
+	{
+		/*
+		 * Give them the next block in the range and update the remaining
+		 * number of blocks.
+		 */
+		nallocated = ++scanwork->nallocated;
+		scanwork->chunk_remaining--;
+	}
+	else
+	{
+		/* Get the new chunk */
+		nallocated = scanwork->nallocated =
+			pg_atomic_fetch_add_u64(&scandesc->nallocated, scanwork->chunk_size);
+
+		/*
+		 * Set the remaining number of blocks in this chunk so that subsequent
+		 * calls from this worker continue on with this chunk until it's done.
+		 */
+		scanwork->chunk_remaining = scanwork->chunk_size - 1;
+
+		/* We use the fixed size chunk for subsequent scans */
+		scanwork->chunk_size = PARALLEL_LV_CHUNK_SIZE;
+
+		/*
+		 * Getting the new chunk also means to start the new eager scan
+		 * region.
+		 *
+		 * Update next_eager_scan_region_start to the first block in the chunk
+		 * so that we can reset the remaining_fails counter when checking the
+		 * visibility of the first block in this chunk in
+		 * find_next_unskippable_block().
+		 */
+		vacrel->next_eager_scan_region_start = nallocated;
+
+	}
+
+	/* Clear the chunk_remaining if there is no more blocks to process */
+	if (nallocated >= scandesc->nblocks)
+		scanwork->chunk_remaining = 0;
+
+	return Min(nallocated, scandesc->nblocks);
+}
+
+/*
+ * Find the next unskippable block in a vacuum scan using the visibility map,
+ * in a range of 'start' (inclusive) and 'end' (exclusive).
+ *
+ * If found, the next unskippable block and its visibility information is updated
+ * in vacrel. Otherwise, return false and reset the information in vacrel.
  *
  * Note: our opinion of which blocks can be skipped can go stale immediately.
  * It's okay if caller "misses" a page whose all-visible or all-frozen marking
@@ -1688,22 +2091,32 @@ heap_vac_scan_next_block(ReadStream *stream,
  * older XIDs/MXIDs.  The *skippedallvis flag will be set here when the choice
  * to skip such a range is actually made, making everything safe.)
  */
-static void
-find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
+static bool
+find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis,
+							BlockNumber start, BlockNumber end)
 {
 	BlockNumber rel_pages = vacrel->rel_pages;
-	BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1;
+	BlockNumber next_unskippable_block = start;
 	Buffer		next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
 	bool		next_unskippable_eager_scanned = false;
 	bool		next_unskippable_allvis;
+	bool		found = true;
 
 	*skipsallvis = false;
 
 	for (;; next_unskippable_block++)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   &next_unskippable_vmbuffer);
+		uint8		mapbits;
+
+		/* Reach the end of range? */
+		if (next_unskippable_block >= end)
+		{
+			found = false;
+			break;
+		}
+
+		mapbits = visibilitymap_get_status(vacrel->rel, next_unskippable_block,
+										   &next_unskippable_vmbuffer);
 
 		next_unskippable_allvis = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0;
 
@@ -1779,11 +2192,285 @@ find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis)
 		*skipsallvis = true;
 	}
 
-	/* write the local variables back to vacrel */
-	vacrel->next_unskippable_block = next_unskippable_block;
-	vacrel->next_unskippable_allvis = next_unskippable_allvis;
-	vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
-	vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	if (found)
+	{
+		/* write the local variables back to vacrel */
+		vacrel->next_unskippable_block = next_unskippable_block;
+		vacrel->next_unskippable_allvis = next_unskippable_allvis;
+		vacrel->next_unskippable_eager_scanned = next_unskippable_eager_scanned;
+		vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
+	}
+	else
+	{
+		if (BufferIsValid(next_unskippable_vmbuffer))
+			ReleaseBuffer(next_unskippable_vmbuffer);
+
+		/*
+		 * There is not unskippable block in the specified range. Reset the
+		 * related fields in vacrel.
+		 */
+		vacrel->next_unskippable_block = InvalidBlockNumber;
+		vacrel->next_unskippable_allvis = InvalidBlockNumber;
+		vacrel->next_unskippable_eager_scanned = false;
+		vacrel->next_unskippable_vmbuffer = InvalidBuffer;
+	}
+
+	return found;
+}
+
+/*
+ * A parallel variant of do_lazy_scan_heap(). The leader process launches
+ * parallel workers to scan the heap in parallel.
+*/
+static void
+do_parallel_lazy_scan_heap(LVRelState *vacrel)
+{
+	ParallelLVScanWorkerData scanwork;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Setup the parallel scan description for the leader to join as a worker */
+	parallel_lazy_scan_init_scan_worker(&scanwork,
+										vacrel->plvstate->shared->initial_chunk_size);
+	vacrel->plvstate->scanwork = &scanwork;
+
+	/* Adjust the eager scan's success counter as a worker */
+	vacrel->eager_scan_remaining_successes =
+		vacrel->plvstate->shared->eager_scan_remaining_successes_per_worker;
+
+	for (;;)
+	{
+		BlockNumber fsmvac_upto;
+
+		/* Launch parallel workers */
+		parallel_lazy_scan_heap_begin(vacrel);
+
+		/*
+		 * Do lazy heap scan until the read stream is exhausted. We will stop
+		 * retrieving new blocks for the read stream once the space of
+		 * dead_items TIDs exceeds the limit.
+		 */
+		do_lazy_scan_heap(vacrel, false);
+
+		/* Wait for parallel workers to finish and gather scan results */
+		parallel_lazy_scan_heap_end(vacrel);
+
+		if (!dead_items_check_memory_limit(vacrel))
+			break;
+
+		/* Perform a round of index and heap vacuuming */
+		vacrel->consider_bypass_optimization = false;
+		lazy_vacuum(vacrel);
+
+		/* Compute the smallest processed block number */
+		fsmvac_upto = parallel_lazy_scan_compute_min_scan_block(vacrel);
+
+		/*
+		 * Vacuum the Free Space Map to make newly-freed space visible on
+		 * upper-level FSM pages.
+		 */
+		if (fsmvac_upto > vacrel->next_fsm_block_to_vacuum)
+		{
+			FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+									fsmvac_upto);
+			vacrel->next_fsm_block_to_vacuum = fsmvac_upto;
+		}
+
+		/* Report that we are once again scanning the heap */
+		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+									 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+	}
+
+	/*
+	 * The parallel heap scan finished, but it's possible that some workers
+	 * have allocated blocks but not processed them yet. This can happen for
+	 * example when workers exit because they are full of dead_items TIDs and
+	 * the leader process launched fewer workers in the next cycle.
+	 */
+	complete_unfinished_lazy_scan_heap(vacrel);
+}
+
+/*
+ * Return the smallest block number that the leader and workers have scanned.
+ */
+static BlockNumber
+parallel_lazy_scan_compute_min_scan_block(LVRelState *vacrel)
+{
+	BlockNumber min_blk;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/* Initialized with the leader's value */
+	min_blk = vacrel->last_blkno;
+
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		ParallelLVScanWorkerData *scanwork = &(vacrel->leader->scanwork_array[i]);
+		BlockNumber blkno;
+
+		/* Skip if no worker has been initialized the scan state */
+		if (!scanwork->inited)
+			continue;
+
+		blkno = pg_atomic_read_u32(&(scanwork->last_blkno));
+
+		if (!BlockNumberIsValid(min_blk) || min_blk > blkno)
+			min_blk = blkno;
+	}
+
+	Assert(BlockNumberIsValid(min_blk));
+
+	return min_blk;
+}
+
+/*
+ * Complete parallel heaps scans that have remaining blocks in their
+ * chunks.
+ */
+static void
+complete_unfinished_lazy_scan_heap(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	Assert(!IsParallelWorker());
+
+	nworkers = parallel_vacuum_get_nworkers_table(vacrel->pvs);
+
+	for (int i = 0; i < nworkers; i++)
+	{
+		ParallelLVScanWorkerData *scanwork = &(vacrel->leader->scanwork_array[i]);
+
+		if (!scanwork->inited)
+			continue;
+
+		if (scanwork->chunk_remaining == 0)
+			continue;
+
+		/* Attach the worker's scan state */
+		vacrel->plvstate->scanwork = scanwork;
+
+		vacrel->next_fsm_block_to_vacuum = pg_atomic_read_u32(&(scanwork->last_blkno));
+		vacrel->next_eager_scan_region_start = scanwork->next_region_start_save;
+		vacrel->eager_scan_remaining_fails = scanwork->remaining_fails_save;
+
+		/*
+		 * Complete the unfinished scan. Note that we might perform multiple
+		 * cycles of index and heap vacuuming while completing the scan.
+		 */
+		do_lazy_scan_heap(vacrel, true);
+	}
+
+	/*
+	 * We don't need to gather the scan results here because the leader's scan
+	 * state got updated directly.
+	 */
+}
+
+/*
+ * Helper routine to launch parallel workers for parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_begin(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* launcher workers */
+	vacrel->leader->nworkers_launched = parallel_vacuum_collect_dead_items_begin(vacrel->pvs);
+
+	ereport(vacrel->verbose ? INFO : DEBUG2,
+			(errmsg(ngettext("launched %d parallel vacuum worker for collecting dead tuples (planned: %d)",
+							 "launched %d parallel vacuum workers for collecting dead tuples (planned: %d)",
+							 vacrel->leader->nworkers_launched),
+					vacrel->leader->nworkers_launched,
+					parallel_vacuum_get_nworkers_table(vacrel->pvs))));
+}
+
+/*
+ * Helper routine to finish the parallel lazy heap scan.
+ */
+static void
+parallel_lazy_scan_heap_end(LVRelState *vacrel)
+{
+	/* Wait for all parallel workers to finish */
+	parallel_vacuum_collect_dead_items_end(vacrel->pvs);
+
+	/* Gather the workers' scan results */
+	parallel_lazy_scan_gather_results(vacrel);
+}
+
+/*
+ * Accumulate each worker's scan results into the leader's.
+*/
+static void
+parallel_lazy_scan_gather_results(LVRelState *vacrel)
+{
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+	Assert(!IsParallelWorker());
+
+	/* Gather the workers' scan results */
+	for (int i = 0; i < vacrel->leader->nworkers_launched; i++)
+	{
+		LVScanData *data = &(vacrel->leader->scandata_array[i]);
+		ParallelLVScanWorkerData *scanwork = &(vacrel->leader->scanwork_array[i]);
+
+		/* Accumulate the counters collected by workers */
+#define ACCUM_COUNT(item) vacrel->scan_data->item += data->item
+		ACCUM_COUNT(scanned_pages);
+		ACCUM_COUNT(removed_pages);
+		ACCUM_COUNT(new_frozen_tuple_pages);
+		ACCUM_COUNT(vm_new_visible_pages);
+		ACCUM_COUNT(vm_new_visible_frozen_pages);
+		ACCUM_COUNT(vm_new_frozen_pages);
+		ACCUM_COUNT(lpdead_item_pages);
+		ACCUM_COUNT(missed_dead_pages);
+		ACCUM_COUNT(tuples_deleted);
+		ACCUM_COUNT(tuples_frozen);
+		ACCUM_COUNT(lpdead_items);
+		ACCUM_COUNT(live_tuples);
+		ACCUM_COUNT(recently_dead_tuples);
+		ACCUM_COUNT(missed_dead_tuples);
+#undef ACCUM_COUNT
+
+		/*
+		 * Track the greatest non-empty page among values the workers
+		 * collected as it's used to cut-off point of heap truncation.
+		 */
+		if (vacrel->scan_data->nonempty_pages < data->nonempty_pages)
+			vacrel->scan_data->nonempty_pages = data->nonempty_pages;
+
+		/*
+		 * All workers must have initialized both values with the values
+		 * passed by the leader.
+		 */
+		Assert(TransactionIdIsValid(data->NewRelfrozenXid));
+		Assert(MultiXactIdIsValid(data->NewRelminMxid));
+
+		/*
+		 * During parallel lazy scanning, since different workers process
+		 * separate blocks, they may observe different existing XIDs and
+		 * MXIDs. Therefore, we compute the oldest XID and MXID from the
+		 * values observed by each worker (including the leader). These
+		 * computations are crucial for correctly advancing both relfrozenxid
+		 * and relmminmxid values.
+		 */
+
+		if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid))
+			vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
+
+		if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid))
+			vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
+
+		/* Has any one of workers skipped all-visible page? */
+		vacrel->scan_data->skippedallvis |= data->skippedallvis;
+
+		/*
+		 * Gather the remaining success count so that we can distribute the
+		 * success counter again in the next parallel lazy scan.
+		 */
+		vacrel->eager_scan_remaining_successes += scanwork->remaining_successes_save;
+	}
 }
 
 /*
@@ -2064,7 +2751,8 @@ lazy_scan_prune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2445,7 +3133,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->scan_data->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages =
+			Max(blkno + 1, vacrel->scan_data->nonempty_pages);
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -3498,12 +4187,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		autovacuum_work_mem != -1 ?
 		autovacuum_work_mem : maintenance_work_mem;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
+	/* Initialize state for a parallel vacuum */
+	if (nworkers >= 0)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -3521,11 +4206,17 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 								vacrel->relname)));
 		}
 		else
+		{
+			/*
+			 * We initialize the parallel vacuum state for either lazy heap
+			 * scan, index vacuuming, or both.
+			 */
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy, (void *) vacrel);
+		}
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
@@ -3565,15 +4256,35 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 	};
 	int64		prog_val[2];
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreLockExclusive(vacrel->dead_items);
+
 	TidStoreSetBlockOffsets(vacrel->dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
+	if (ParallelHeapVacuumIsActive(vacrel))
+		TidStoreUnlock(vacrel->dead_items);
+
 	/* update the progress information */
 	prog_val[0] = vacrel->dead_items_info->num_items;
 	prog_val[1] = TidStoreMemoryUsage(vacrel->dead_items);
 	pgstat_progress_update_multi_param(2, prog_index, prog_val);
 }
 
+/*
+ * Check the memory usage of the collected dead items and return true
+ * if we are close to overrunning the available space for dead_items TIDs.
+ * However, let's force at least one page-worth of tuples to be stored as
+ * to ensure we do at least some work when the memory configured is so low
+ * that we run out before storing anything.
+ */
+static bool
+dead_items_check_memory_limit(LVRelState *vacrel)
+{
+	return vacrel->dead_items_info->num_items > 0 &&
+		TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes;
+}
+
 /*
  * Forget all collected dead items.
  */
@@ -3767,6 +4478,295 @@ update_relstats_all_indexes(LVRelState *vacrel)
 	}
 }
 
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									 void *state)
+{
+	BlockNumber relpages = RelationGetNumberOfBlocks(rel);
+	int			parallel_workers = 0;
+
+	/*
+	 * Parallel heap vacuuming a small relation shouldn't take long. We use
+	 * two times the chunk size as the size cutoff because the leader is
+	 * assigned to one chunk.
+	 */
+	if (relpages < PARALLEL_LV_CHUNK_SIZE * 2 || relpages < min_parallel_table_scan_size)
+		return 0;
+
+	if (nworkers_requested == 0)
+	{
+		LVRelState *vacrel = (LVRelState *) state;
+		int			heap_parallel_threshold;
+		int			heap_pages;
+		BlockNumber allvisible;
+		BlockNumber allfrozen;
+
+		/*
+		 * Estimate the number of blocks that we're going to scan during
+		 * lazy_scan_heap().
+		 */
+		visibilitymap_count(rel, &allvisible, &allfrozen);
+		heap_pages = relpages - (vacrel->aggressive ? allfrozen : allvisible);
+
+		Assert(heap_pages >= 0);
+
+		/*
+		 * Select the number of workers based on the log of the number of
+		 * pages to scan. Note that the upper limit of the
+		 * min_parallel_table_scan_size GUC is chosen to prevent overflow
+		 * here.
+		 */
+		heap_parallel_threshold = PARALLEL_LV_CHUNK_SIZE;
+		while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+		{
+			parallel_workers++;
+			heap_parallel_threshold *= 3;
+			if (heap_parallel_threshold > INT_MAX / 3)
+				break;
+		}
+	}
+	else
+		parallel_workers = nworkers_requested;
+
+	return parallel_workers;
+}
+
+/*
+ * Estimate shared memory size required for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+							  void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	Size		size = 0;
+
+	vacrel->leader = palloc(sizeof(ParallelLVLeader));
+
+	/* Estimate space for ParallelLVShared */
+	size = add_size(size, sizeof(ParallelLVShared));
+	vacrel->leader->shared_len = size;
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->shared_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for ParallelLVScanDesc */
+	vacrel->leader->scandesc_len = sizeof(ParallelLVScanDesc);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scandesc_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of ParallelLVScanWorkerData */
+	vacrel->leader->scanwork_len = mul_size(sizeof(ParallelLVScanWorkerData),
+											nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scanwork_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for an array of LVScanData */
+	vacrel->leader->scandata_len = mul_size(sizeof(LVScanData), nworkers);
+	shm_toc_estimate_chunk(&pcxt->estimator, vacrel->leader->scandata_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * Set up shared memory for parallel heap vacuum.
+ */
+void
+heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers,
+								void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ParallelLVShared *shared;
+	ParallelLVScanDesc *scandesc;
+	ParallelLVScanWorkerData *scanwork;
+	LVScanData *scandata;
+
+	vacrel->plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Initialize ParallelLVShared */
+
+	shared = shm_toc_allocate(pcxt->toc, vacrel->leader->shared_len);
+	MemSet(shared, 0, vacrel->leader->shared_len);
+	shared->aggressive = vacrel->aggressive;
+	shared->skipwithvm = vacrel->skipwithvm;
+	shared->cutoffs = vacrel->cutoffs;
+	shared->NewRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	shared->NewRelminMxid = vacrel->scan_data->NewRelminMxid;
+	shared->initial_chunk_size = BlockNumberIsValid(vacrel->next_eager_scan_region_start)
+		? vacrel->next_eager_scan_region_start
+		: PARALLEL_LV_CHUNK_SIZE;
+
+	/* Calculate the per-chunk maximum failure count */
+	shared->eager_scan_max_fails_per_chunk =
+		(BlockNumber) (vacrel->eager_scan_max_fails_per_region *
+					   ((float) PARALLEL_LV_CHUNK_SIZE / EAGER_SCAN_REGION_SIZE));
+
+	/* including the leader too */
+	shared->eager_scan_remaining_successes_per_worker =
+		vacrel->eager_scan_remaining_successes / (nworkers + 1);
+
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SHARED, shared);
+	vacrel->plvstate->shared = shared;
+
+	/* Initialize ParallelLVScanDesc */
+	scandesc = shm_toc_allocate(pcxt->toc, vacrel->leader->scandesc_len);
+	scandesc->nblocks = RelationGetNumberOfBlocks(rel);
+	pg_atomic_init_u64(&scandesc->nallocated, 0);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANDESC, scandesc);
+	vacrel->plvstate->scandesc = scandesc;
+
+	/* Initialize the array of ParallelLVScanWorkerData */
+	scanwork = shm_toc_allocate(pcxt->toc, vacrel->leader->scanwork_len);
+	MemSet(scanwork, 0, vacrel->leader->scanwork_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANWORKER, scanwork);
+	vacrel->leader->scanwork_array = scanwork;
+
+	/* Initialize the array of LVScanData */
+	scandata = shm_toc_allocate(pcxt->toc, vacrel->leader->scandata_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_LV_KEY_SCANDATA, scandata);
+	vacrel->leader->scandata_array = scandata;
+}
+
+/*
+ * Initialize lazy vacuum state with the information retrieved from
+ * shared memory.
+ */
+void
+heap_parallel_vacuum_initialize_worker(Relation rel, ParallelVacuumState *pvs,
+									   ParallelWorkerContext *pwcxt,
+									   void **state_out)
+{
+	LVRelState *vacrel;
+	ParallelLVState *plvstate;
+	ParallelLVShared *shared;
+	ParallelLVScanDesc *scandesc;
+	ParallelLVScanWorkerData *scanwork_array;
+	LVScanData *scandata_array;
+
+	/* Initialize ParallelLVState and prepare the related objects */
+
+	plvstate = palloc0(sizeof(ParallelLVState));
+
+	/* Prepare ParallelLVShared */
+	shared = (ParallelLVShared *) shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SHARED, false);
+	plvstate->shared = shared;
+
+	/* Prepare ParallelLVScanDesc */
+	scandesc = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANDESC, false);
+	plvstate->scandesc = scandesc;
+
+	/* Prepare ParallelLVScanWorkerData */
+	scanwork_array = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANWORKER, false);
+	plvstate->scanwork = &(scanwork_array[ParallelWorkerNumber]);
+
+	/* Initialize LVRelState and prepare fields required by lazy scan heap */
+	vacrel = palloc0(sizeof(LVRelState));
+	vacrel->rel = rel;
+	vacrel->indrels = parallel_vacuum_get_table_indexes(pvs,
+														&vacrel->nindexes);
+	vacrel->bstrategy = parallel_vacuum_get_bstrategy(pvs);
+	vacrel->pvs = pvs;
+	vacrel->aggressive = shared->aggressive;
+	vacrel->skipwithvm = shared->skipwithvm;
+	vacrel->vistest = GlobalVisTestFor(rel);
+	vacrel->cutoffs = shared->cutoffs;
+	vacrel->dead_items = parallel_vacuum_get_dead_items(pvs,
+														&vacrel->dead_items_info);
+	vacrel->rel_pages = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * Set the per-region failure counter and per-worker success counter,
+	 * which are not changed during parallel heap vacuum.
+	 */
+	vacrel->eager_scan_max_fails_per_region =
+		plvstate->shared->eager_scan_max_fails_per_chunk;
+	vacrel->eager_scan_remaining_successes =
+		plvstate->shared->eager_scan_remaining_successes_per_worker;
+
+	/* Does this worker have un-scanned blocks in a chunk? */
+	if (plvstate->scanwork->chunk_remaining > 0)
+	{
+		/*
+		 * We restore the previous eager scan state of the already allocated
+		 * chunk, if the worker's previous scan suspended due to the full of
+		 * dead_items TIDs space.
+		 */
+		vacrel->next_eager_scan_region_start = plvstate->scanwork->next_region_start_save;
+		vacrel->eager_scan_remaining_fails = plvstate->scanwork->remaining_fails_save;
+	}
+	else
+	{
+		/*
+		 * next_eager_scan_region_start will be set when the first chunk is
+		 * assigned
+		 */
+		vacrel->next_eager_scan_region_start = InvalidBlockNumber;
+		vacrel->eager_scan_remaining_fails = vacrel->eager_scan_max_fails_per_region;
+	}
+
+	vacrel->plvstate = plvstate;
+
+	/* Prepare LVScanData */
+	scandata_array = shm_toc_lookup(pwcxt->toc, PARALLEL_LV_KEY_SCANDATA, false);
+	vacrel->scan_data = &(scandata_array[ParallelWorkerNumber]);
+	MemSet(vacrel->scan_data, 0, sizeof(LVScanData));
+	vacrel->scan_data->NewRelfrozenXid = shared->NewRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = shared->NewRelminMxid;
+	vacrel->scan_data->skippedallvis = false;
+
+	/*
+	 * Initialize the scan state if not yet. The chunk of blocks will be
+	 * allocated when to get the scan block for the first time.
+	 */
+	if (!vacrel->plvstate->scanwork->inited)
+		parallel_lazy_scan_init_scan_worker(vacrel->plvstate->scanwork,
+											vacrel->plvstate->shared->initial_chunk_size);
+
+	*state_out = (void *) vacrel;
+}
+
+/*
+ * Parallel heap vacuum callback for collecting dead items (i.e., lazy heap scan).
+ */
+void
+heap_parallel_vacuum_collect_dead_items(Relation rel, ParallelVacuumState *pvs,
+										void *state)
+{
+	LVRelState *vacrel = (LVRelState *) state;
+	ErrorContextCallback errcallback;
+
+	Assert(ParallelHeapVacuumIsActive(vacrel));
+
+	/*
+	 * Setup error traceback support for ereport() for parallel table vacuum
+	 * workers
+	 */
+	vacrel->dbname = get_database_name(MyDatabaseId);
+	vacrel->relnamespace = get_database_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_SCAN_HEAP;
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = &vacrel;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Join the parallel heap vacuum */
+	do_lazy_scan_heap(vacrel, false);
+
+	/* Advertise the last processed block number */
+	pg_atomic_write_u32(&(vacrel->plvstate->scanwork->last_blkno), vacrel->last_blkno);
+
+	/* Save the eager scan state */
+	vacrel->plvstate->scanwork->remaining_fails_save = vacrel->eager_scan_remaining_fails;
+	vacrel->plvstate->scanwork->remaining_successes_save = vacrel->eager_scan_remaining_successes;
+	vacrel->plvstate->scanwork->next_region_start_save = vacrel->next_eager_scan_region_start;
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Error context callback for errors occurring during vacuum.  The error
  * context messages for index phases should match the messages set in parallel
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 3726fb41028..49e43b95132 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -504,6 +504,35 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
+/*
+ * Return the number of parallel workers initialized for parallel table vacuum.
+ */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+	return pvs->nworkers_for_table;
+}
+
+/*
+ * Return the array of indexes associated to the given table to be vacuumed.
+ */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)
+{
+	*nindexes = pvs->nindexes;
+
+	return pvs->indrels;
+}
+
+/*
+ * Return the buffer strategy for parallel vacuum.
+ */
+BufferAccessStrategy
+parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs)
+{
+	return pvs->bstrategy;
+}
+
 /*
  * Returns the dead items space and dead items information.
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a2bd5a897f8..8811665375c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -15,6 +15,7 @@
 #define HEAPAM_H
 
 #include "access/heapam_xlog.h"
+#include "access/parallel.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
@@ -397,8 +398,20 @@ extern void log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 									  OffsetNumber *unused, int nunused);
 
 /* in heap/vacuumlazy.c */
+struct ParallelVacuumState;
 extern void heap_vacuum_rel(Relation rel,
 							const VacuumParams params, BufferAccessStrategy bstrategy);
+extern int heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+												void *state);
+extern void heap_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers,
+										  void *state);
+extern void heap_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
+											int nworkers, void *state);
+extern void heap_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+												   ParallelWorkerContext *pwcxt,
+												   void **state_out);
+extern void heap_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+													void *state);
 
 /* in heap/heapam_visibility.c */
 extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1369377ea98..876cef301b7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -385,6 +385,9 @@ extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels
 												 BufferAccessStrategy bstrategy,
 												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
+extern int	parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs);
+extern Relation *parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes);
+extern BufferAccessStrategy parallel_vacuum_get_bstrategy(ParallelVacuumState *pvs);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
 extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
diff --git a/src/test/regress/expected/vacuum_parallel.out b/src/test/regress/expected/vacuum_parallel.out
index ddf0ee544b7..b793d8093c2 100644
--- a/src/test/regress/expected/vacuum_parallel.out
+++ b/src/test/regress/expected/vacuum_parallel.out
@@ -1,5 +1,6 @@
 SET max_parallel_maintenance_workers TO 4;
 SET min_parallel_index_scan_size TO '128kB';
+SET min_parallel_table_scan_size TO '128kB';
 -- Bug #17245: Make sure that we don't totally fail to VACUUM individual indexes that
 -- happen to be below min_parallel_index_scan_size during parallel VACUUM:
 CREATE TABLE parallel_vacuum_table (a int) WITH (autovacuum_enabled = off);
@@ -43,7 +44,13 @@ VACUUM (PARALLEL 4, INDEX_CLEANUP ON) parallel_vacuum_table;
 -- Since vacuum_in_leader_small_index uses deduplication, we expect an
 -- assertion failure with bug #17245 (in the absence of bugfix):
 INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 10000) i;
+-- Insert more tuples to use parallel heap vacuum.
+INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 500_000) i;
+VACUUM (PARALLEL 2) parallel_vacuum_table;
+DELETE FROM parallel_vacuum_table WHERE a < 1000;
+VACUUM (PARALLEL 1) parallel_vacuum_table;
 RESET max_parallel_maintenance_workers;
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 -- Deliberately don't drop table, to get further coverage from tools like
 -- pg_amcheck in some testing scenarios
diff --git a/src/test/regress/sql/vacuum_parallel.sql b/src/test/regress/sql/vacuum_parallel.sql
index 1d23f33e39c..5381023642f 100644
--- a/src/test/regress/sql/vacuum_parallel.sql
+++ b/src/test/regress/sql/vacuum_parallel.sql
@@ -1,5 +1,6 @@
 SET max_parallel_maintenance_workers TO 4;
 SET min_parallel_index_scan_size TO '128kB';
+SET min_parallel_table_scan_size TO '128kB';
 
 -- Bug #17245: Make sure that we don't totally fail to VACUUM individual indexes that
 -- happen to be below min_parallel_index_scan_size during parallel VACUUM:
@@ -39,8 +40,15 @@ VACUUM (PARALLEL 4, INDEX_CLEANUP ON) parallel_vacuum_table;
 -- assertion failure with bug #17245 (in the absence of bugfix):
 INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 10000) i;
 
+-- Insert more tuples to use parallel heap vacuum.
+INSERT INTO parallel_vacuum_table SELECT i FROM generate_series(1, 500_000) i;
+VACUUM (PARALLEL 2) parallel_vacuum_table;
+DELETE FROM parallel_vacuum_table WHERE a < 1000;
+VACUUM (PARALLEL 1) parallel_vacuum_table;
+
 RESET max_parallel_maintenance_workers;
 RESET min_parallel_index_scan_size;
+RESET min_parallel_table_scan_size;
 
 -- Deliberately don't drop table, to get further coverage from tools like
 -- pg_amcheck in some testing scenarios
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c5ffd6ca0c5..f9b9fb753f2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1964,6 +1964,11 @@ PLpgSQL_type
 PLpgSQL_type_type
 PLpgSQL_var
 PLpgSQL_variable
+ParallelLVLeader
+ParallelLVScanDesc
+ParallelLVScanWorkerData
+ParallelLVShared
+ParallelLVState
 PLwdatum
 PLword
 PLyArrayToOb
-- 
2.51.0

v19-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchtext/x-patch; charset=UTF-8; name=v19-0003-Move-lazy-heap-scan-related-variables-to-new-str.patchDownload

From 85c833c30ecf6bd262f71c8ab1287c8265502daa Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 26 Feb 2025 11:31:55 -0800
Subject: [PATCH v19 3/5] Move lazy heap scan related variables to new struct
 LVScanData.

This is a pure refactoring for upcoming parallel heap scan, which
requires storing relation statistics and relation data such as extant
oldest XID/MXID collected during lazy heap scan to a shared memory
area.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 308 ++++++++++++++-------------
 src/tools/pgindent/typedefs.list     |   1 +
 2 files changed, 163 insertions(+), 146 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0fce0f13ea1..2282c02ffa4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -255,6 +255,54 @@ typedef enum
 #define VAC_BLK_WAS_EAGER_SCANNED (1 << 0)
 #define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1)
 
+/*
+ * Data and counters updated during lazy heap scan.
+ */
+typedef struct LVScanData
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+
+	/*
+	 * Count of all-visible blocks eagerly scanned (for logging only). This
+	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
+	 */
+	BlockNumber eager_scanned_pages;
+
+	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
+
+	/* # pages newly set all-visible in the VM */
+	BlockNumber vm_new_visible_pages;
+
+	/*
+	 * # pages newly set all-visible and all-frozen in the VM. This is a
+	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
+	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
+	 * only those which were also set all-frozen.
+	 */
+	BlockNumber vm_new_visible_frozen_pages;
+
+	/* # all-visible pages newly set all-frozen in the VM */
+	BlockNumber vm_new_frozen_pages;
+
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+
+	/* Counters that follow are only for scanned_pages */
+	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # newly frozen */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		live_tuples;	/* # live tuples remaining */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
+} LVScanData;
+
 typedef struct LVRelState
 {
 	/* Target heap relation and its indexes */
@@ -281,10 +329,6 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -310,34 +354,9 @@ typedef struct LVRelState
 	VacDeadItemsInfo *dead_items_info;
 
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 
-	/*
-	 * Count of all-visible blocks eagerly scanned (for logging only). This
-	 * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD.
-	 */
-	BlockNumber eager_scanned_pages;
-
-	BlockNumber removed_pages;	/* # pages removed by relation truncation */
-	BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */
-
-	/* # pages newly set all-visible in the VM */
-	BlockNumber vm_new_visible_pages;
-
-	/*
-	 * # pages newly set all-visible and all-frozen in the VM. This is a
-	 * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes
-	 * all pages set all-visible, but vm_new_visible_frozen_pages includes
-	 * only those which were also set all-frozen.
-	 */
-	BlockNumber vm_new_visible_frozen_pages;
-
-	/* # all-visible pages newly set all-frozen in the VM */
-	BlockNumber vm_new_frozen_pages;
-
-	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
-	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	/* Data and counters updated during lazy heap scan */
+	LVScanData *scan_data;
 
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
@@ -347,13 +366,6 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
-	/* Counters that follow are only for scanned_pages */
-	int64		tuples_deleted; /* # deleted from table */
-	int64		tuples_frozen;	/* # newly frozen */
-	int64		lpdead_items;	/* # deleted from indexes */
-	int64		live_tuples;	/* # live tuples remaining */
-	int64		recently_dead_tuples;	/* # dead, but not yet removable */
-	int64		missed_dead_tuples; /* # removable, but not removed */
 
 	/* State maintained by heap_vac_scan_next_block() */
 	BlockNumber current_block;	/* last block returned */
@@ -615,6 +627,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 				BufferAccessStrategy bstrategy)
 {
 	LVRelState *vacrel;
+	LVScanData *scan_data;
 	bool		verbose,
 				instrument,
 				skipwithvm,
@@ -729,14 +742,24 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	}
 
 	/* Initialize page counters explicitly (be tidy) */
-	vacrel->scanned_pages = 0;
-	vacrel->eager_scanned_pages = 0;
-	vacrel->removed_pages = 0;
-	vacrel->new_frozen_tuple_pages = 0;
-	vacrel->lpdead_item_pages = 0;
-	vacrel->missed_dead_pages = 0;
-	vacrel->nonempty_pages = 0;
-	/* dead_items_alloc allocates vacrel->dead_items later on */
+	scan_data = palloc(sizeof(LVScanData));
+	scan_data->scanned_pages = 0;
+	scan_data->eager_scanned_pages = 0;
+	scan_data->removed_pages = 0;
+	scan_data->new_frozen_tuple_pages = 0;
+	scan_data->lpdead_item_pages = 0;
+	scan_data->missed_dead_pages = 0;
+	scan_data->nonempty_pages = 0;
+	scan_data->tuples_deleted = 0;
+	scan_data->tuples_frozen = 0;
+	scan_data->lpdead_items = 0;
+	scan_data->live_tuples = 0;
+	scan_data->recently_dead_tuples = 0;
+	scan_data->missed_dead_tuples = 0;
+	scan_data->vm_new_visible_pages = 0;
+	scan_data->vm_new_visible_frozen_pages = 0;
+	scan_data->vm_new_frozen_pages = 0;
+	vacrel->scan_data = scan_data;
 
 	/* Allocate/initialize output statistics state */
 	vacrel->new_rel_tuples = 0;
@@ -746,16 +769,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 
 	/* Initialize remaining counters (be tidy) */
 	vacrel->num_index_scans = 0;
-	vacrel->tuples_deleted = 0;
-	vacrel->tuples_frozen = 0;
-	vacrel->lpdead_items = 0;
-	vacrel->live_tuples = 0;
-	vacrel->recently_dead_tuples = 0;
-	vacrel->missed_dead_tuples = 0;
 
-	vacrel->vm_new_visible_pages = 0;
-	vacrel->vm_new_visible_frozen_pages = 0;
-	vacrel->vm_new_frozen_pages = 0;
+	/* dead_items_alloc allocates vacrel->dead_items later on */
 
 	/*
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
@@ -778,15 +793,15 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	vacrel->vistest = GlobalVisTestFor(rel);
 
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
-	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
+	vacrel->scan_data->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->scan_data->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 
 	/*
 	 * Initialize state related to tracking all-visible page skipping. This is
 	 * very important to determine whether or not it is safe to advance the
 	 * relfrozenxid/relminmxid.
 	 */
-	vacrel->skippedallvis = false;
+	vacrel->scan_data->skippedallvis = false;
 	skipwithvm = true;
 	if (params.options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -874,15 +889,15 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+	Assert(vacrel->scan_data->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
 		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
 										 vacrel->cutoffs.relfrozenxid,
-										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+										 vacrel->scan_data->NewRelfrozenXid));
+	Assert(vacrel->scan_data->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
-									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+									   vacrel->scan_data->NewRelminMxid));
+	if (vacrel->scan_data->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
@@ -890,8 +905,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
 		Assert(!vacrel->aggressive);
-		vacrel->NewRelfrozenXid = InvalidTransactionId;
-		vacrel->NewRelminMxid = InvalidMultiXactId;
+		vacrel->scan_data->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->scan_data->NewRelminMxid = InvalidMultiXactId;
 	}
 
 	/*
@@ -921,7 +936,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
 						new_rel_allvisible, new_rel_allfrozen,
 						vacrel->nindexes > 0,
-						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						vacrel->scan_data->NewRelfrozenXid,
+						vacrel->scan_data->NewRelminMxid,
 						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
@@ -937,8 +953,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
-						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples,
+						 vacrel->scan_data->recently_dead_tuples +
+						 vacrel->scan_data->missed_dead_tuples,
 						 starttime);
 	pgstat_progress_end_command();
 
@@ -1012,23 +1028,23 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u eagerly scanned\n"),
-							 vacrel->removed_pages,
+							 vacrel->scan_data->removed_pages,
 							 new_rel_pages,
-							 vacrel->scanned_pages,
+							 vacrel->scan_data->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->scanned_pages /
+							 100.0 * vacrel->scan_data->scanned_pages /
 							 orig_rel_pages,
-							 vacrel->eager_scanned_pages);
+							 vacrel->scan_data->eager_scanned_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %" PRId64 " removed, %" PRId64 " remain, %" PRId64 " are dead but not yet removable\n"),
-							 vacrel->tuples_deleted,
+							 vacrel->scan_data->tuples_deleted,
 							 (int64) vacrel->new_rel_tuples,
-							 vacrel->recently_dead_tuples);
-			if (vacrel->missed_dead_tuples > 0)
+							 vacrel->scan_data->recently_dead_tuples);
+			if (vacrel->scan_data->missed_dead_tuples > 0)
 				appendStringInfo(&buf,
 								 _("tuples missed: %" PRId64 " dead from %u pages not removed due to cleanup lock contention\n"),
-								 vacrel->missed_dead_tuples,
-								 vacrel->missed_dead_pages);
+								 vacrel->scan_data->missed_dead_tuples,
+								 vacrel->scan_data->missed_dead_pages);
 			diff = (int32) (ReadNextTransactionId() -
 							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
@@ -1036,33 +1052,33 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid -
+				diff = (int32) (vacrel->scan_data->NewRelfrozenXid -
 								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
-								 vacrel->NewRelfrozenXid, diff);
+								 vacrel->scan_data->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid -
+				diff = (int32) (vacrel->scan_data->NewRelminMxid -
 								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
-								 vacrel->NewRelminMxid, diff);
+								 vacrel->scan_data->NewRelminMxid, diff);
 			}
 			appendStringInfo(&buf, _("frozen: %u pages from table (%.2f%% of total) had %" PRId64 " tuples frozen\n"),
-							 vacrel->new_frozen_tuple_pages,
+							 vacrel->scan_data->new_frozen_tuple_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->new_frozen_tuple_pages /
+							 100.0 * vacrel->scan_data->new_frozen_tuple_pages /
 							 orig_rel_pages,
-							 vacrel->tuples_frozen);
+							 vacrel->scan_data->tuples_frozen);
 
 			appendStringInfo(&buf,
 							 _("visibility map: %u pages set all-visible, %u pages set all-frozen (%u were all-visible)\n"),
-							 vacrel->vm_new_visible_pages,
-							 vacrel->vm_new_visible_frozen_pages +
-							 vacrel->vm_new_frozen_pages,
-							 vacrel->vm_new_frozen_pages);
+							 vacrel->scan_data->vm_new_visible_pages,
+							 vacrel->scan_data->vm_new_visible_frozen_pages +
+							 vacrel->scan_data->vm_new_frozen_pages,
+							 vacrel->scan_data->vm_new_frozen_pages);
 			if (vacrel->do_index_vacuuming)
 			{
 				if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
@@ -1082,10 +1098,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 				msgfmt = _("%u pages from table (%.2f%% of total) have %" PRId64 " dead item identifiers\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 vacrel->lpdead_item_pages,
+							 vacrel->scan_data->lpdead_item_pages,
 							 orig_rel_pages == 0 ? 100.0 :
-							 100.0 * vacrel->lpdead_item_pages / orig_rel_pages,
-							 vacrel->lpdead_items);
+							 100.0 * vacrel->scan_data->lpdead_item_pages / orig_rel_pages,
+							 vacrel->scan_data->lpdead_items);
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -1261,8 +1277,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * one-pass strategy, and the two-pass strategy with the index_cleanup
 		 * param set to 'off'.
 		 */
-		if (vacrel->scanned_pages > 0 &&
-			vacrel->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
+		if (vacrel->scan_data->scanned_pages > 0 &&
+			vacrel->scan_data->scanned_pages % FAILSAFE_EVERY_PAGES == 0)
 			lazy_check_wraparound_failsafe(vacrel);
 
 		/*
@@ -1317,9 +1333,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		blkno = BufferGetBlockNumber(buf);
 
-		vacrel->scanned_pages++;
+		vacrel->scan_data->scanned_pages++;
 		if (blk_info & VAC_BLK_WAS_EAGER_SCANNED)
-			vacrel->eager_scanned_pages++;
+			vacrel->scan_data->eager_scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
@@ -1506,16 +1522,16 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
-													 vacrel->scanned_pages,
-													 vacrel->live_tuples);
+													 vacrel->scan_data->scanned_pages,
+													 vacrel->scan_data->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
-		vacrel->missed_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->scan_data->recently_dead_tuples +
+		vacrel->scan_data->missed_dead_tuples;
 
 	read_stream_end(stream);
 
@@ -1562,7 +1578,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * callback_private_data contains a reference to the LVRelState, passed to the
  * read stream API during stream setup. The LVRelState is an in/out parameter
  * here (locally named `vacrel`). Vacuum options and information about the
- * relation are read from it. vacrel->skippedallvis is set if we skip a block
+ * relation are read from it. vacrel->scan_data->skippedallvis is set if we skip a block
  * that's all-visible but not all-frozen (to ensure that we don't update
  * relfrozenxid in that case). vacrel also holds information about the next
  * unskippable block -- as bookkeeping for this function.
@@ -1624,7 +1640,7 @@ heap_vac_scan_next_block(ReadStream *stream,
 		{
 			next_block = vacrel->next_unskippable_block;
 			if (skipsallvis)
-				vacrel->skippedallvis = true;
+				vacrel->scan_data->skippedallvis = true;
 		}
 	}
 
@@ -1899,8 +1915,8 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 			END_CRIT_SECTION();
 
 			/* Count the newly all-frozen pages for logging */
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 		}
 
 		freespace = PageGetHeapFreeSpace(page);
@@ -1977,10 +1993,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
 							   &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
 							   &vacrel->offnum,
-							   &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
+							   &vacrel->scan_data->NewRelfrozenXid, &vacrel->scan_data->NewRelminMxid);
 
-	Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
-	Assert(TransactionIdIsValid(vacrel->NewRelfrozenXid));
+	Assert(MultiXactIdIsValid(vacrel->scan_data->NewRelminMxid));
+	Assert(TransactionIdIsValid(vacrel->scan_data->NewRelfrozenXid));
 
 	if (presult.nfrozen > 0)
 	{
@@ -1990,7 +2006,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * frozen tuples (don't confuse that with pages newly set all-frozen
 		 * in VM).
 		 */
-		vacrel->new_frozen_tuple_pages++;
+		vacrel->scan_data->new_frozen_tuple_pages++;
 	}
 
 	/*
@@ -2025,7 +2041,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (presult.lpdead_items > 0)
 	{
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		/*
 		 * deadoffsets are collected incrementally in
@@ -2040,15 +2056,15 @@ lazy_scan_prune(LVRelState *vacrel,
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += presult.ndeleted;
-	vacrel->tuples_frozen += presult.nfrozen;
-	vacrel->lpdead_items += presult.lpdead_items;
-	vacrel->live_tuples += presult.live_tuples;
-	vacrel->recently_dead_tuples += presult.recently_dead_tuples;
+	vacrel->scan_data->tuples_deleted += presult.ndeleted;
+	vacrel->scan_data->tuples_frozen += presult.nfrozen;
+	vacrel->scan_data->lpdead_items += presult.lpdead_items;
+	vacrel->scan_data->live_tuples += presult.live_tuples;
+	vacrel->scan_data->recently_dead_tuples += presult.recently_dead_tuples;
 
 	/* Can't truncate this page */
 	if (presult.hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (presult.lpdead_items > 0);
@@ -2097,17 +2113,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
 			if (presult.all_frozen)
 			{
-				vacrel->vm_new_visible_frozen_pages++;
+				vacrel->scan_data->vm_new_visible_frozen_pages++;
 				*vm_page_frozen = true;
 			}
 		}
 		else if ((old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
 				 presult.all_frozen)
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2201,8 +2217,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
-			vacrel->vm_new_visible_pages++;
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 
@@ -2212,7 +2228,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 */
 		else
 		{
-			vacrel->vm_new_frozen_pages++;
+			vacrel->scan_data->vm_new_frozen_pages++;
 			*vm_page_frozen = true;
 		}
 	}
@@ -2255,8 +2271,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	bool		hastup;
 	HeapTupleHeader tupleheader;
-	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->scan_data->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->scan_data->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2383,8 +2399,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
-	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
+	vacrel->scan_data->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->scan_data->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
@@ -2411,25 +2427,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 * indexes will be deleted during index vacuuming (and then marked
 		 * LP_UNUSED in the heap)
 		 */
-		vacrel->lpdead_item_pages++;
+		vacrel->scan_data->lpdead_item_pages++;
 
 		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
-		vacrel->lpdead_items += lpdead_items;
+		vacrel->scan_data->lpdead_items += lpdead_items;
 	}
 
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
-	vacrel->missed_dead_tuples += missed_dead_tuples;
+	vacrel->scan_data->live_tuples += live_tuples;
+	vacrel->scan_data->recently_dead_tuples += recently_dead_tuples;
+	vacrel->scan_data->missed_dead_tuples += missed_dead_tuples;
 	if (missed_dead_tuples > 0)
-		vacrel->missed_dead_pages++;
+		vacrel->scan_data->missed_dead_pages++;
 
 	/* Can't truncate this page */
 	if (hastup)
-		vacrel->nonempty_pages = blkno + 1;
+		vacrel->scan_data->nonempty_pages = blkno + 1;
 
 	/* Did we find LP_DEAD items? */
 	*has_lpdead_items = (lpdead_items > 0);
@@ -2458,7 +2474,7 @@ lazy_vacuum(LVRelState *vacrel)
 
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
-	Assert(vacrel->lpdead_item_pages > 0);
+	Assert(vacrel->scan_data->lpdead_item_pages > 0);
 
 	if (!vacrel->do_index_vacuuming)
 	{
@@ -2492,7 +2508,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
+		Assert(vacrel->scan_data->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2519,7 +2535,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
+		bypass = (vacrel->scan_data->lpdead_item_pages < threshold &&
 				  TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);
 	}
 
@@ -2657,7 +2673,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2819,8 +2835,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
-			vacuumed_pages == vacrel->lpdead_item_pages));
+		   (vacrel->dead_items_info->num_items == vacrel->scan_data->lpdead_items &&
+			vacuumed_pages == vacrel->scan_data->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %" PRId64 " dead item identifiers in %u pages",
@@ -2930,9 +2946,9 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 						  flags);
 
 		/* Count the newly set VM page for logging */
-		vacrel->vm_new_visible_pages++;
+		vacrel->scan_data->vm_new_visible_pages++;
 		if (all_frozen)
-			vacrel->vm_new_visible_frozen_pages++;
+			vacrel->scan_data->vm_new_visible_frozen_pages++;
 	}
 
 	/* Revert to the previous phase information for error traceback */
@@ -3008,7 +3024,7 @@ static void
 lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
 	double		reltuples = vacrel->new_rel_tuples;
-	bool		estimated_count = vacrel->scanned_pages < vacrel->rel_pages;
+	bool		estimated_count = vacrel->scan_data->scanned_pages < vacrel->rel_pages;
 	const int	progress_start_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_INDEXES_TOTAL
@@ -3189,7 +3205,7 @@ should_attempt_truncation(LVRelState *vacrel)
 	if (!vacrel->do_rel_truncate || VacuumFailsafeActive)
 		return false;
 
-	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->scan_data->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
 		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION))
@@ -3215,7 +3231,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 
 	/* Update error traceback information one last time */
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-							 vacrel->nonempty_pages, InvalidOffsetNumber);
+							 vacrel->scan_data->nonempty_pages, InvalidOffsetNumber);
 
 	/*
 	 * Loop until no more truncating can be done.
@@ -3316,7 +3332,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrel->removed_pages += orig_rel_pages - new_rel_pages;
+		vacrel->scan_data->removed_pages += orig_rel_pages - new_rel_pages;
 		vacrel->rel_pages = new_rel_pages;
 
 		ereport(vacrel->verbose ? INFO : DEBUG2,
@@ -3324,7 +3340,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+	} while (new_rel_pages > vacrel->scan_data->nonempty_pages && lock_waiter_detected);
 }
 
 /*
@@ -3352,7 +3368,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrel->nonempty_pages)
+	while (blkno > vacrel->scan_data->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -3464,7 +3480,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrel->nonempty_pages;
+	return vacrel->scan_data->nonempty_pages;
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 78ff1de93a6..c5ffd6ca0c5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1527,6 +1527,7 @@ LSEG
 LUID
 LVRelState
 LVSavedErrInfo
+LVScanData
 LWLock
 LWLockHandle
 LWLockMode
-- 
2.51.0

v19-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchtext/x-patch; charset=UTF-8; name=v19-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patchDownload

From 6855300fd63e69627753251453a8820b3b8b8adf Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 18 Feb 2025 17:45:36 -0800
Subject: [PATCH v19 2/5] vacuumparallel.c: Support parallel vacuuming for
 tables to collect dead items.

Previously, parallel vacuum was available only for index vacuuming and
index cleanup, ParallelVacuumState was initialized only when the table
has at least two indexes that are eligible for parallel index
vacuuming and cleanup.

This commit extends vacuumparallel.c to support parallel table
vacuuming. parallel_vacuum_init() now initializes ParallelVacuumState
to perform parallel heap scan to collect dead items, or paralel index
vacuuming/cleanup, or both. During the initialization, it asks the
table AM for the number of parallel workers required for parallel
table vacuuming. If >0, it enables parallel table vacuuming and calls
further table AM APIs such as parallel_vacuum_estimate.

For parallel table vacuuming, this commit introduces
parallel_vacuum_collect_dead_items_begin() function, which can be used
to collect dead items in the table (for example, the first pass over
heap table in lazy vacuum for heap tables).

Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c  |   2 +-
 src/backend/commands/vacuumparallel.c | 393 +++++++++++++++++++-------
 src/include/commands/vacuum.h         |   5 +-
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 293 insertions(+), 108 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 981d9380a92..0fce0f13ea1 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3509,7 +3509,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->nindexes, nworkers,
 											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
-											   vacrel->bstrategy);
+											   vacrel->bstrategy, (void *) vacrel);
 
 		/*
 		 * If parallel mode started, dead_items and dead_items_info spaces are
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..3726fb41028 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -4,17 +4,18 @@
  *	  Support routines for parallel vacuum execution.
  *
  * This file contains routines that are intended to support setting up, using,
- * and tearing down a ParallelVacuumState.
+ * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared
+ * information as well as the memory space for storing dead items allocated in
+ * the DSA area. We launch
  *
- * In a parallel vacuum, we perform both index bulk deletion and index cleanup
- * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSA area.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * In a parallel vacuum, we perform table scan, index bulk-deletion, index
+ * cleanup, or all of them with parallel worker processes depending on the
+ * number of parallel workers required for each phase. So different numbers of
+ * workers might be required for the table scanning and index processing.
+ * We launch parallel worker processes at the start of a phase, and once we
+ * complete all work in the phase, parallel workers exit. Each time we process
+ * table or indexes in parallel, the parallel context is re-initialized so that
+ * the same DSM can be used for multiple passes of each phase.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -26,8 +27,10 @@
  */
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/amapi.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "access/xact.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
@@ -50,6 +53,13 @@
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		4
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
+/* The kind of parallel vacuum phases */
+typedef enum
+{
+	PV_WORK_PHASE_PROCESS_INDEXES,	/* index vacuuming or cleanup */
+	PV_WORK_PHASE_COLLECT_DEAD_ITEMS,	/* collect dead tuples */
+} PVWorkPhase;
+
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
  * segment.
@@ -65,6 +75,12 @@ typedef struct PVShared
 	int			elevel;
 	int64		queryid;
 
+	/*
+	 * Tell parallel workers what phase to perform: processing indexes or
+	 * collecting dead tuples from the table.
+	 */
+	PVWorkPhase work_phase;
+
 	/*
 	 * Fields for both index vacuum and cleanup.
 	 *
@@ -164,6 +180,9 @@ struct ParallelVacuumState
 	/* NULL for worker processes */
 	ParallelContext *pcxt;
 
+	/* Do we need to reinitialize parallel DSM? */
+	bool		need_reinitialize_dsm;
+
 	/* Parent Heap Relation */
 	Relation	heaprel;
 
@@ -178,7 +197,7 @@ struct ParallelVacuumState
 	 * Shared index statistics among parallel vacuum workers. The array
 	 * element is allocated for every index, even those indexes where parallel
 	 * index vacuuming is unsafe or not worthwhile (e.g.,
-	 * will_parallel_vacuum[] is false).  During parallel vacuum,
+	 * idx_will_parallel_vacuum[] is false).  During parallel vacuum,
 	 * IndexBulkDeleteResult of each index is kept in DSM and is copied into
 	 * local memory at the end of parallel vacuum.
 	 */
@@ -193,12 +212,18 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/*
+	 * The number of workers for parallel table vacuuming. If 0, the parallel
+	 * table vacuum is disabled.
+	 */
+	int			nworkers_for_table;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
 	 * min_parallel_index_scan_size cutoff.
 	 */
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 
 	/*
 	 * The number of indexes that support parallel index bulk-deletion and
@@ -221,8 +246,10 @@ struct ParallelVacuumState
 	PVIndVacStatus status;
 };
 
-static int	parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-											bool *will_parallel_vacuum);
+static int	parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+											int nrequested, int *nworkers_for_table,
+											bool *idx_will_parallel_vacuum,
+											void *state);
 static void parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scans,
 												bool vacuum);
 static void parallel_vacuum_process_safe_indexes(ParallelVacuumState *pvs);
@@ -231,18 +258,25 @@ static void parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation
 											  PVIndStats *indstats);
 static bool parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 												   bool vacuum);
+static void parallel_vacuum_begin_work_phase(ParallelVacuumState *pvs, int nworkers,
+											 PVWorkPhase work_phase);
+static void parallel_vacuum_end_worker_phase(ParallelVacuumState *pvs);
 static void parallel_vacuum_error_callback(void *arg);
 
 /*
  * Try to enter parallel mode and create a parallel context.  Then initialize
  * shared memory state.
  *
+ * nrequested_workers is the requested parallel degree. 0 means that the parallel
+ * degrees for table and indexes vacuum are decided differently. See the comments
+ * of parallel_vacuum_compute_workers() for details.
+ *
  * On success, return parallel vacuum state.  Otherwise return NULL.
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					 int nrequested_workers, int vac_work_mem,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int elevel, BufferAccessStrategy bstrategy, void *state)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
@@ -251,38 +285,38 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	bool	   *will_parallel_vacuum;
+	bool	   *idx_will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
+	int			nworkers_for_table;
 	int			querylen;
 
-	/*
-	 * A parallel vacuum must be requested and there must be indexes on the
-	 * relation
-	 */
+	/* A parallel vacuum must be requested */
 	Assert(nrequested_workers >= 0);
-	Assert(nindexes > 0);
 
 	/*
 	 * Compute the number of parallel vacuum workers to launch
 	 */
-	will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = parallel_vacuum_compute_workers(indrels, nindexes,
+	idx_will_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
+	parallel_workers = parallel_vacuum_compute_workers(rel, indrels, nindexes,
 													   nrequested_workers,
-													   will_parallel_vacuum);
+													   &nworkers_for_table,
+													   idx_will_parallel_vacuum,
+													   state);
+
 	if (parallel_workers <= 0)
 	{
 		/* Can't perform vacuum in parallel -- return NULL */
-		pfree(will_parallel_vacuum);
+		pfree(idx_will_parallel_vacuum);
 		return NULL;
 	}
 
 	pvs = (ParallelVacuumState *) palloc0(sizeof(ParallelVacuumState));
 	pvs->indrels = indrels;
 	pvs->nindexes = nindexes;
-	pvs->will_parallel_vacuum = will_parallel_vacuum;
+	pvs->idx_will_parallel_vacuum = idx_will_parallel_vacuum;
 	pvs->bstrategy = bstrategy;
 	pvs->heaprel = rel;
 
@@ -291,6 +325,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 parallel_workers);
 	Assert(pcxt->nworkers > 0);
 	pvs->pcxt = pcxt;
+	pvs->need_reinitialize_dsm = false;
+	pvs->nworkers_for_table = nworkers_for_table;
 
 	/* Estimate size for index vacuum stats -- PARALLEL_VACUUM_KEY_INDEX_STATS */
 	est_indstats_len = mul_size(sizeof(PVIndStats), nindexes);
@@ -327,6 +363,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	/* Estimate AM-specific space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_estimate(rel, pcxt, pvs->nworkers_for_table, state);
+
 	InitializeParallelDSM(pcxt);
 
 	/* Prepare index vacuum stats */
@@ -345,7 +385,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 			   ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) == 0));
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
-		if (!will_parallel_vacuum[i])
+		if (!idx_will_parallel_vacuum[i])
 			continue;
 
 		if (indrel->rd_indam->amusemaintenanceworkmem)
@@ -419,6 +459,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 					   PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 	}
 
+	/* Initialize AM-specific DSM space for parallel table vacuum */
+	if (pvs->nworkers_for_table > 0)
+		table_parallel_vacuum_initialize(rel, pcxt, pvs->nworkers_for_table, state);
+
 	/* Success -- return parallel vacuum state */
 	return pvs;
 }
@@ -456,7 +500,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
-	pfree(pvs->will_parallel_vacuum);
+	pfree(pvs->idx_will_parallel_vacuum);
 	pfree(pvs);
 }
 
@@ -533,26 +577,35 @@ parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs, long num_table_tup
 }
 
 /*
- * Compute the number of parallel worker processes to request.  Both index
- * vacuum and index cleanup can be executed with parallel workers.
- * The index is eligible for parallel vacuum iff its size is greater than
- * min_parallel_index_scan_size as invoking workers for very small indexes
- * can hurt performance.
+ * Compute the number of parallel worker processes to request for table
+ * vacuum and index vacuum/cleanup.  Return the maximum number of parallel
+ * workers for table vacuuming and index vacuuming.
+ *
+ * nrequested is the number of parallel workers that user requested, which
+ * applies to both the number of workers for table vacuum and index vacuum.
+ * If nrequested is 0, we compute the parallel degree for them differently
+ * as described below.
  *
- * nrequested is the number of parallel workers that user requested.  If
- * nrequested is 0, we compute the parallel degree based on nindexes, that is
- * the number of indexes that support parallel vacuum.  This function also
- * sets will_parallel_vacuum to remember indexes that participate in parallel
- * vacuum.
+ * For parallel table vacuum, we ask AM-specific routine to compute the
+ * number of parallel worker processes. The result is set to nworkers_table_p.
+ *
+ * For parallel index vacuum, the index is eligible for parallel vacuum iff
+ * its size is greater than min_parallel_index_scan_size as invoking workers
+ * for very small indexes can hurt performance. This function sets
+ * idx_will_parallel_vacuum to remember indexes that participate in parallel vacuum.
  */
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
-								bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+								int nrequested, int *nworkers_table_p,
+								bool *idx_will_parallel_vacuum, void *state)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
-	int			parallel_workers;
+	int			nworkers_table = 0;
+	int			nworkers_index = 0;
+
+	*nworkers_table_p = 0;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -561,6 +614,14 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;
 
+	/* Compute the number of workers for parallel table scan */
+	if (rel->rd_tableam->parallel_vacuum_compute_workers != NULL)
+		nworkers_table = table_parallel_vacuum_compute_workers(rel, nrequested,
+															   state);
+
+	/* Cap by max_parallel_maintenance_workers */
+	nworkers_table = Min(nworkers_table, max_parallel_maintenance_workers);
+
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
@@ -574,7 +635,7 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		will_parallel_vacuum[i] = true;
+		idx_will_parallel_vacuum[i] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -589,18 +650,18 @@ parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested,
 	/* The leader process takes one index */
 	nindexes_parallel--;
 
-	/* No index supports parallel vacuum */
-	if (nindexes_parallel <= 0)
-		return 0;
-
-	/* Compute the parallel degree */
-	parallel_workers = (nrequested > 0) ?
-		Min(nrequested, nindexes_parallel) : nindexes_parallel;
+	if (nindexes_parallel > 0)
+	{
+		/* Take into account the requested number of workers */
+		nworkers_index = (nrequested > 0) ?
+			Min(nrequested, nindexes_parallel) : nindexes_parallel;
 
-	/* Cap by max_parallel_maintenance_workers */
-	parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+		/* Cap by max_parallel_maintenance_workers */
+		nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+	}
 
-	return parallel_workers;
+	*nworkers_table_p = nworkers_table;
+	return Max(nworkers_table, nworkers_index);
 }
 
 /*
@@ -657,7 +718,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL);
 		indstats->status = new_status;
 		indstats->parallel_workers_can_process =
-			(pvs->will_parallel_vacuum[i] &&
+			(pvs->idx_will_parallel_vacuum[i] &&
 			 parallel_vacuum_index_is_parallel_safe(pvs->indrels[i],
 													num_index_scans,
 													vacuum));
@@ -669,40 +730,9 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
-		/* Reinitialize parallel context to relaunch parallel workers */
-		if (num_index_scans > 0)
-			ReinitializeParallelDSM(pvs->pcxt);
-
-		/*
-		 * Set up shared cost balance and the number of active workers for
-		 * vacuum delay.  We need to do this before launching workers as
-		 * otherwise, they might not see the updated values for these
-		 * parameters.
-		 */
-		pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
-		pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
-
-		/*
-		 * The number of workers can vary between bulkdelete and cleanup
-		 * phase.
-		 */
-		ReinitializeParallelWorkers(pvs->pcxt, nworkers);
-
-		LaunchParallelWorkers(pvs->pcxt);
-
-		if (pvs->pcxt->nworkers_launched > 0)
-		{
-			/*
-			 * Reset the local cost values for leader backend as we have
-			 * already accumulated the remaining balance of heap.
-			 */
-			VacuumCostBalance = 0;
-			VacuumCostBalanceLocal = 0;
-
-			/* Enable shared cost balance for leader backend */
-			VacuumSharedCostBalance = &(pvs->shared->cost_balance);
-			VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
-		}
+		/* Start parallel vacuum workers for processing indexes */
+		parallel_vacuum_begin_work_phase(pvs, nworkers,
+										 PV_WORK_PHASE_PROCESS_INDEXES);
 
 		if (vacuum)
 			ereport(pvs->shared->elevel,
@@ -732,13 +762,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 	 * to finish, or we might get incomplete data.)
 	 */
 	if (nworkers > 0)
-	{
-		/* Wait for all vacuum workers to finish */
-		WaitForParallelWorkersToFinish(pvs->pcxt);
-
-		for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
-			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
-	}
+		parallel_vacuum_end_worker_phase(pvs);
 
 	/*
 	 * Reset all index status back to initial (while checking that we have
@@ -755,15 +779,8 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		indstats->status = PARALLEL_INDVAC_STATUS_INITIAL;
 	}
 
-	/*
-	 * Carry the shared balance value to heap scan and disable shared costing
-	 */
-	if (VacuumSharedCostBalance)
-	{
-		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
-		VacuumSharedCostBalance = NULL;
-		VacuumActiveNWorkers = NULL;
-	}
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
 }
 
 /*
@@ -979,6 +996,77 @@ parallel_vacuum_index_is_parallel_safe(Relation indrel, int num_index_scans,
 	return true;
 }
 
+/*
+ * Begin the parallel scan to collect dead items. Return the number of
+ * launched parallel workers.
+ *
+ * The caller must call parallel_vacuum_collect_dead_items_end() to finish
+ * the parallel scan.
+ */
+int
+parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+
+	if (pvs->nworkers_for_table == 0)
+		return 0;
+
+	/* Start parallel vacuum workers for collecting dead items */
+	Assert(pvs->nworkers_for_table <= pvs->pcxt->nworkers);
+	parallel_vacuum_begin_work_phase(pvs, pvs->nworkers_for_table,
+									 PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Include the worker count for the leader itself */
+	if (pvs->pcxt->nworkers_launched > 0)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	return pvs->pcxt->nworkers_launched;
+}
+
+/*
+ * Wait for all workers for parallel vacuum workers launched by
+ * parallel_vacuum_collect_dead_items_begin(), and gather workers' statistics.
+ */
+void
+parallel_vacuum_collect_dead_items_end(ParallelVacuumState *pvs)
+{
+	Assert(!IsParallelWorker());
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	if (pvs->nworkers_for_table == 0)
+		return;
+
+	/* Wait for parallel workers to finish */
+	parallel_vacuum_end_worker_phase(pvs);
+
+	/* Decrement the worker count for the leader itself */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * The function is for parallel workers to execute the parallel scan to
+ * collect dead tuples.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs, void *state)
+{
+	Assert(VacuumActiveNWorkers);
+	Assert(pvs->shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS);
+
+	/* Increment the active worker before starting the table vacuum */
+	pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Do the parallel scan to collect dead tuples */
+	table_parallel_vacuum_collect_dead_items(pvs->heaprel, pvs, state);
+
+	/*
+	 * We have completed the table vacuum so decrement the active worker
+	 * count.
+	 */
+	pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
 /*
  * Perform work within a launched parallel process.
  *
@@ -998,6 +1086,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
+	void	   *state;
 	ErrorContextCallback errcallback;
 
 	/*
@@ -1030,7 +1119,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * matched to the leader's one.
 	 */
 	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
-	Assert(nindexes > 0);
 
 	/*
 	 * Apply the desired value of maintenance_work_mem within this process.
@@ -1076,6 +1164,17 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	pvs.bstrategy = GetAccessStrategyWithSize(BAS_VACUUM,
 											  shared->ring_nbuffers * (BLCKSZ / 1024));
 
+	/* Initialize AM-specific vacuum state for parallel table vacuuming */
+	if (shared->work_phase == PV_WORK_PHASE_COLLECT_DEAD_ITEMS)
+	{
+		ParallelWorkerContext pwcxt;
+
+		pwcxt.toc = toc;
+		pwcxt.seg = seg;
+		table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+												&state);
+	}
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = parallel_vacuum_error_callback;
 	errcallback.arg = &pvs;
@@ -1085,8 +1184,19 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
-	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_process_safe_indexes(&pvs);
+	switch (pvs.shared->work_phase)
+	{
+		case PV_WORK_PHASE_COLLECT_DEAD_ITEMS:
+			/* Scan the table to collect dead items */
+			parallel_vacuum_process_table(&pvs, state);
+			break;
+		case PV_WORK_PHASE_PROCESS_INDEXES:
+			/* Process indexes to perform vacuum/cleanup */
+			parallel_vacuum_process_safe_indexes(&pvs);
+			break;
+		default:
+			elog(ERROR, "unrecognized parallel vacuum phase %d", pvs.shared->work_phase);
+	}
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
@@ -1109,6 +1219,77 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	FreeAccessStrategy(pvs.bstrategy);
 }
 
+/*
+ * Launch parallel vacuum workers for the given phase. If at least one
+ * worker launched, enable the shared vacuum delay costing.
+ */
+static void
+parallel_vacuum_begin_work_phase(ParallelVacuumState *pvs, int nworkers,
+								 PVWorkPhase work_phase)
+{
+	/* Set the work phase */
+	pvs->shared->work_phase = work_phase;
+
+	/* Reinitialize parallel context to relaunch parallel workers */
+	if (pvs->need_reinitialize_dsm)
+		ReinitializeParallelDSM(pvs->pcxt);
+
+	/*
+	 * Set up shared cost balance and the number of active workers for vacuum
+	 * delay.  We need to do this before launching workers as otherwise, they
+	 * might not see the updated values for these parameters.
+	 */
+	pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
+
+	/*
+	 * The number of workers can vary between bulkdelete and cleanup phase.
+	 */
+	ReinitializeParallelWorkers(pvs->pcxt, nworkers);
+
+	LaunchParallelWorkers(pvs->pcxt);
+
+	/* Enable shared vacuum costing if we are able to launch any worker */
+	if (pvs->pcxt->nworkers_launched > 0)
+	{
+		/*
+		 * Reset the local cost values for leader backend as we have already
+		 * accumulated the remaining balance of heap.
+		 */
+		VacuumCostBalance = 0;
+		VacuumCostBalanceLocal = 0;
+
+		/* Enable shared cost balance for leader backend */
+		VacuumSharedCostBalance = &(pvs->shared->cost_balance);
+		VacuumActiveNWorkers = &(pvs->shared->active_nworkers);
+	}
+}
+
+/*
+ * Wait for parallel vacuum workers to finish, accumulate the statistics,
+ * and disable shared vacuum delay costing if enabled.
+ */
+static void
+parallel_vacuum_end_worker_phase(ParallelVacuumState *pvs)
+{
+	/* Wait for all vacuum workers to finish */
+	WaitForParallelWorkersToFinish(pvs->pcxt);
+
+	for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+
+	/* Carry the shared balance value and disable shared costing */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+
+	/* Parallel DSM will need to be reinitialized for the next execution */
+	pvs->need_reinitialize_dsm = true;
+}
+
 /*
  * Error context callback for errors occurring during parallel index vacuum.
  * The error context messages should match the messages set in the lazy vacuum
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 14eeccbd718..1369377ea98 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -382,7 +382,8 @@ extern void VacuumUpdateCosts(void);
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
 												 int vac_work_mem, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 BufferAccessStrategy bstrategy,
+												 void *state);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
 extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
 												VacDeadItemsInfo **dead_items_info_p);
@@ -394,6 +395,8 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans,
 												bool estimated_count);
+extern int	parallel_vacuum_collect_dead_items_begin(ParallelVacuumState *pvs);
+extern void parallel_vacuum_collect_dead_items_end(ParallelVacuumState *pvs);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e90af5b2ad3..78ff1de93a6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2037,6 +2037,7 @@ PVIndStats
 PVIndVacStatus
 PVOID
 PVShared
+PVWorkPhase
 PX_Alias
 PX_Cipher
 PX_Combo
-- 
2.51.0

v19-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchtext/x-patch; charset=UTF-8; name=v19-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patchDownload

From 458d0a52ceb49f87cbae79652df0871125854cff Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Jan 2025 15:35:03 -0800
Subject: [PATCH v19 1/5] Introduces table AM APIs for parallel table
 vacuuming.

This commit introduces the following new table AM APIs for parallel
table vacuuming:

- parallel_vacuum_compute_workers
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

All callbacks are optional. parallel_vacuum_compute_workers needs to
return 0 to disable parallel table vacuuming.

There is no code using these new APIs for now. Upcoming parallel
vacuum patches utilize these APIs.

Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
---
 src/include/access/tableam.h | 138 +++++++++++++++++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b2ce35e2a34..7ab3c2fe6e7 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -36,6 +36,9 @@ extern PGDLLIMPORT bool synchronize_seqscans;
 
 struct BulkInsertStateData;
 struct IndexInfo;
+struct ParallelContext;
+struct ParallelVacuumState;
+struct ParallelWorkerContext;
 struct SampleScanState;
 struct ValidateIndexState;
 
@@ -653,6 +656,79 @@ typedef struct TableAmRoutine
 									const VacuumParams params,
 									BufferAccessStrategy bstrategy);
 
+	/* ------------------------------------------------------------------------
+	 * Callbacks for parallel table vacuum.
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Compute the number of parallel workers for parallel table vacuum. The
+	 * parallel degree for parallel vacuum is further limited by
+	 * max_parallel_maintenance_workers. The function must return 0 to disable
+	 * parallel table vacuum.
+	 *
+	 * 'nworkers_requested' is a >=0 number and the requested number of
+	 * workers. This comes from the PARALLEL option. 0 means to choose the
+	 * parallel degree based on the table AM specific factors such as table
+	 * size.
+	 *
+	 * Optional callback.
+	 */
+	int			(*parallel_vacuum_compute_workers) (Relation rel,
+													int nworkers_requested,
+													void *state);
+
+	/*
+	 * Estimate the size of shared memory needed for a parallel table vacuum
+	 * of this relation.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_estimate) (Relation rel,
+											 struct ParallelContext *pcxt,
+											 int nworkers,
+											 void *state);
+
+	/*
+	 * Initialize DSM space for parallel table vacuum.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_initialize) (Relation rel,
+											   struct ParallelContext *pctx,
+											   int nworkers,
+											   void *state);
+
+	/*
+	 * Initialize AM-specific vacuum state for worker processes.
+	 *
+	 * The state_out is the output parameter so that arbitrary data can be
+	 * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_initialize_worker) (Relation rel,
+													  struct ParallelVacuumState *pvs,
+													  struct ParallelWorkerContext *pwcxt,
+													  void **state_out);
+
+	/*
+	 * Execute a parallel scan to collect dead items.
+	 *
+	 * Not called if parallel table vacuum is disabled.
+	 *
+	 * Optional callback.
+	 */
+	void		(*parallel_vacuum_collect_dead_items) (Relation rel,
+													   struct ParallelVacuumState *pvs,
+													   void *state);
+
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -1677,6 +1753,68 @@ table_relation_vacuum(Relation rel, const VacuumParams params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/* ----------------------------------------------------------------------------
+ * Parallel vacuum related functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Compute the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested,
+									  void *state)
+{
+	return rel->rd_tableam->parallel_vacuum_compute_workers(rel,
+															nworkers_requested,
+															state);
+}
+
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, struct ParallelContext *pcxt,
+							   int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, struct ParallelContext *pcxt,
+								 int nworkers, void *state)
+{
+	Assert(nworkers > 0);
+	rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize AM-specific vacuum state for worker processes.
+ */
+static inline void
+table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs,
+										struct ParallelWorkerContext *pwcxt,
+										void **state_out)
+{
+	rel->rd_tableam->parallel_vacuum_initialize_worker(rel, pvs, pwcxt, state_out);
+}
+
+/*
+ * Execute a parallel vacuum scan to collect dead items.
+ */
+static inline void
+table_parallel_vacuum_collect_dead_items(Relation rel, struct ParallelVacuumState *pvs,
+										 void *state)
+{
+	rel->rd_tableam->parallel_vacuum_collect_dead_items(rel, pvs, state);
+}
+
 /*
  * Prepare to analyze the next block in the read stream. The scan needs to
  * have been  started with table_beginscan_analyze().  Note that this routine
-- 
2.51.0

#123

Robert Haas

robertmhaas@gmail.com

4 months ago

In reply to: Tomas Vondra (#122)

Re: Parallel heap vacuum

On Wed, Sep 17, 2025 at 7:25 AM Tomas Vondra <tomas@vondra.me> wrote:

I took a quick look at the patch this week. I don't have a very strong
opinion on the changes to table AM API, and I somewhat agree with this
impression. It's not clear to me why we should be adding callbacks that
are AM-specific (and only ever called from that particular AM) to the
common AM interface.

We clearly should not do that.

I keep thinking about how we handle parallelism in index builds. The
index AM API did not get a bunch of new callbacks, it's all handled
within the existing ambuild() callback. Shouldn't we be doing something
like that for relation_vacuum()?

I have a feeling that we might have made the wrong decision there.
That approach will probably require a good deal of code to be
duplicated for each AM. I'm not sure what the final solution should
look like here, but we want the common parts like worker setup to use
common code, while allowing each AM to insert its own logic in the
places where that is needed. The challenge in my view is to figure out
how best to arrange things so as to make that possible.

--
Robert Haas
EDB: http://www.enterprisedb.com

#124

Tomas Vondra

tomas@vondra.me

4 months ago

In reply to: Robert Haas (#123)

Re: Parallel heap vacuum

On 9/17/25 18:01, Robert Haas wrote:

On Wed, Sep 17, 2025 at 7:25 AM Tomas Vondra <tomas@vondra.me> wrote:

I took a quick look at the patch this week. I don't have a very strong
opinion on the changes to table AM API, and I somewhat agree with this
impression. It's not clear to me why we should be adding callbacks that
are AM-specific (and only ever called from that particular AM) to the
common AM interface.

We clearly should not do that.

I keep thinking about how we handle parallelism in index builds. The
index AM API did not get a bunch of new callbacks, it's all handled
within the existing ambuild() callback. Shouldn't we be doing something
like that for relation_vacuum()?

I have a feeling that we might have made the wrong decision there.
That approach will probably require a good deal of code to be
duplicated for each AM. I'm not sure what the final solution should
look like here, but we want the common parts like worker setup to use
common code, while allowing each AM to insert its own logic in the
places where that is needed. The challenge in my view is to figure out
how best to arrange things so as to make that possible.

But a lot of the parallel-mode setup is already wrapped in some API (for
example LaunchParallelWorkers, WaitForParallelWorkersToAttach,
CreateParallelContext, ...).

I guess we might "invert" how the parallel builds work - invent a set of
callbacks / API an index AM would need to implement to support parallel
builds. And then those callbacks would be called from a single "parallel
index build" routine.

But I don't think there's a lot of duplicated code, at least based on my
experience with implementing parallel builds for BRIN and GIN.

Look at the BRIN code, for example. Most of the parallel stuff happens
in _brin_begin_parallel. Maybe more of it could be generalized a bit
more (some of the shmem setup?). But most of it is tied to the
AM-specific state / how parallel builds work for that particular AM.

regards

--
Tomas Vondra

#125

Robert Haas

robertmhaas@gmail.com

4 months ago

In reply to: Tomas Vondra (#124)

Re: Parallel heap vacuum

On Wed, Sep 17, 2025 at 12:23 PM Tomas Vondra <tomas@vondra.me> wrote:

Look at the BRIN code, for example. Most of the parallel stuff happens
in _brin_begin_parallel. Maybe more of it could be generalized a bit
more (some of the shmem setup?). But most of it is tied to the
AM-specific state / how parallel builds work for that particular AM.

Well, the code for PARALLEL_KEY_WAL_USAGE, PARALLEL_KEY_BUFFER_USAGE,
and PARALLEL_KEY_QUERY_TEXT is duplicated, for instance. That's not a
ton of code, perhaps, but it may evolve over time, and having to keep
copies for a bunch of different AMs in sync is not ideal.

--
Robert Haas
EDB: http://www.enterprisedb.com

#126

Tomas Vondra

tomas@vondra.me

4 months ago

In reply to: Robert Haas (#125)

Re: Parallel heap vacuum

On 9/17/25 18:32, Robert Haas wrote:

On Wed, Sep 17, 2025 at 12:23 PM Tomas Vondra <tomas@vondra.me> wrote:

Look at the BRIN code, for example. Most of the parallel stuff happens
in _brin_begin_parallel. Maybe more of it could be generalized a bit
more (some of the shmem setup?). But most of it is tied to the
AM-specific state / how parallel builds work for that particular AM.

Well, the code for PARALLEL_KEY_WAL_USAGE, PARALLEL_KEY_BUFFER_USAGE,
and PARALLEL_KEY_QUERY_TEXT is duplicated, for instance. That's not a
ton of code, perhaps, but it may evolve over time, and having to keep
copies for a bunch of different AMs in sync is not ideal.

True. And I agree it's not great it might break if we need to setup the
wal/buffer usage tracking a bit differently (and forget to update all
the places, or even can't do that in custom AMs).

I suppose we could wrap that in a helper, and call that? That's what I
meant by "maybe we could generalize the shmem setup".

The alternative would be to have a single AM-agnostic place doing
parallel builds with any index AM, and calls "AM callbacks" instead.
AFAICS that's pretty much how Melanie imagines the parallel vacuum to
work (at least that's how I understand it).

I'm not sure which way is better. I'm terrible in designing APIs.

For the parallel heap vacuum, the patches seem to be a bit 50:50. The
per-AM callbacks are there, but each AM still has to do the custom code
anyway.

--
Tomas Vondra

#127

Robert Haas

robertmhaas@gmail.com

4 months ago

In reply to: Tomas Vondra (#126)

Re: Parallel heap vacuum

On Wed, Sep 17, 2025 at 12:52 PM Tomas Vondra <tomas@vondra.me> wrote:

True. And I agree it's not great it might break if we need to setup the
wal/buffer usage tracking a bit differently (and forget to update all
the places, or even can't do that in custom AMs).

Exactly.

I suppose we could wrap that in a helper, and call that? That's what I
meant by "maybe we could generalize the shmem setup".

Yep, that's one possibility.

The alternative would be to have a single AM-agnostic place doing
parallel builds with any index AM, and calls "AM callbacks" instead.
AFAICS that's pretty much how Melanie imagines the parallel vacuum to
work (at least that's how I understand it).

And that's the other possibility.

I'm not sure which way is better. I'm terrible in designing APIs.

I'm not sure either. I don't think I'm terrible at designing APIs (and
you probably aren't either) but I don't know enough about this
particular problem space to be certain what's best.

For the parallel heap vacuum, the patches seem to be a bit 50:50. The
per-AM callbacks are there, but each AM still has to do the custom code
anyway.

Unless I misunderstand, that seems like the worst of all possible situations.

--
Robert Haas
EDB: http://www.enterprisedb.com

#128

Masahiko Sawada

sawada.mshk@gmail.com

4 months ago

In reply to: Tomas Vondra (#122)

Re: Parallel heap vacuum

On Wed, Sep 17, 2025 at 4:25 AM Tomas Vondra <tomas@vondra.me> wrote:

On 9/8/25 17:40, Melanie Plageman wrote:

On Wed, Aug 27, 2025 at 2:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Aug 26, 2025 at 8:55 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

If you do parallel worker setup below heap_vacuum_rel(), then how are
you supposed to use those workers to do non-heap table vacuuming?

IIUC non-heap tables can call parallel_vacuum_init() in its
relation_vacuum table AM callback implementation in order to
initialize parallel table vacuum, parallel index vacuum, or both.

Yep, it's just that that is more like a library helper function.
Currently, in master, the flow is:

leader does:
heap_vacuum_rel()
dead_items_alloc()
parallel_vacuum_init()
passes parallel_vacuum_main() to parallel context
lazy_scan_heap()
...
parallel_vacuum_process_all_indexes()
LaunchParallelWorkers()
...
parallel_vacuum_process_[un]safe_indexes()

For the leader, I don't remember where your patch moves
LaunchParallelWorkers() -- since it would have to move to before the
first heap vacuuming phase. When I was imagining what made the most
sense to make parallelism agnostic of heap, it seemed like it would be
moving LaunchParallelWorkers() above heap_vacuum_rel(). But I
recognize that is a pretty big change, and I haven't tried it, so I
don't know all the complications. (sounds like you mention some
below).

Basically, right now in master, we expose some functions for parallel
index vacuuming that people can call in their own table implementation
of table vacuuming, but we don't invoke any of them in table-AM
agnostic code. That may work fine for just index vacuuming, but now
that we are trying to make parts of the table vacuuming itself
extensible, the interface feels more awkward.

I won't push the idea further because I'm not even sure if it works,
but the thing that felt the most natural to me was pushing parallelism
above the table AM.

...

I took a quick look at the patch this week. I don't have a very strong
opinion on the changes to table AM API, and I somewhat agree with this
impression. It's not clear to me why we should be adding callbacks that
are AM-specific (and only ever called from that particular AM) to the
common AM interface.

I agree that we should not add AM-specific callbacks to the AM interface.

I keep thinking about how we handle parallelism in index builds. The
index AM API did not get a bunch of new callbacks, it's all handled
within the existing ambuild() callback. Shouldn't we be doing something
like that for relation_vacuum()?

That is, each AM-specific relation_vacuum() would be responsible for
starting / stopping workers and all that. relation_vacuum would need
some flag indicating it's OK to use parallelism, of course. We might add
some shared infrastructure to make that more convenient, of course.

This is roughly how index builds handle parallelism. It's not perfect
(e.g. the plan_create_index_workers can do unexpected stuff for
non-btree indexes).

In parallel vacuum cases, I believe we need to consider that
vacuumparallel.c (particularly ParallelVacuumContext) effectively
serves as a shared infrastructure. It handles several core functions,
including the initialization of shared dead items storage (shared
TidStore), worker management, buffer/WAL usage, vacuum-delay settings,
and provides APIs for parallel index vacuuming/cleanup. If we were to
implement parallel heap vacuum naively in vacuumlazy.c, we would
likely end up with significant code duplication.

To avoid this redundancy, my idea is to extend vacuumparallel.c to
support parallel table vacuum. This approach would allow table AMs to
use vacuumparallel.c for parallel table vacuuming, parallel index
vacuuming, or both. While I agree that adding new table AM callbacks
for parallel table vacuum to the table AM interface isn't a good idea,
I think passing the necessary callbacks for parallel heap vacuum to
vacuumparallel.c through parallel_vacuum_init() follows a similar
principle. This means that all information needed for parallel
table/index vacuuming would remain within vacuumparallel.c, and table
AMs could trigger parallel table vacuum, parallel index vacuuming, and
parallel index cleanup through the APIs provided by vacuumparallel.c.

The alternative approach, as you suggested, would be to have the table
AM's relation_vacuum() handle the parallel processing implementation
while utilizing the infrastructure and APIs provided by
vacuumparallel.c. Based on the feedback received, I'm now evaluating
what shared information and helper functions would be necessary for
this latter approach.

I also repeated the stress test / benchmark, measuring impact with
different number of indexes, amount of deleted tuples, etc. Attached is
a PDF summarizing the results for each part of the patch series (with
different number of workers). For this tests, the improvements are
significant only with no indexes - then the 0004 patch saves 30-50%. But
as soon as indexes are added, the index cleanup starts to dominate.

It's just an assessment, I'm not saying we shouldn't parallelize the
heap cleanup. I assume there are workloads for which patches will make
much more difference. But, what would such cases look like? If I want to
maximize the impact of this patch series, what should I do?

Another use case is executing vacuum without index cleanup for faster
freezing XIDs on a large table.

FWIW the patch needs rebasing, there's a bit of bitrot. It wasn't very
extensive, so I did that (in order to run the tests), attached is the
result as v19.

Thank you.

This also reminds I heard a couple complaints we don't allow parallelism
in autovacuum. We have parallel index vacuum, but it's disabled in
autovacuum (and users really don't want to run manual vacuum).

That question is almost entirely orthogonal to this patch, of course.
I'm not suggesting this patch has (or even should) to do anything about
it. But I wonder if you have any thoughts on that?

I believe the reason why parallelism is disabled in autovacuum is that
we want autovacuum to be a background process, with minimal disruption
to user workload. It probably wouldn't be that hard to allow autovacuum
to do parallel stuff, but it feels similar to adding autovacuum workers.
That's rarely the solution, without increasing the cost limit.

A patch enabling autovacuum to use parallel vacuum capabilities has
already been proposed[1]/messages/by-id/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com, and I've been reviewing it. There seems to
be consensus that users would benefit from parallel index vacuuming
during autovacuum in certain scenarios. As an initial implementation,
parallel vacuum in autovacuum will be disabled by default, but users
can enable it by specifying parallel degrees through a new storage
option for specific tables where they want parallel vacuuming.

For your information, while the implementation itself is relatively
straightforward, we're still facing one unresolved issue; the system
doesn't support reloading the configuration file during parallel mode,
but it is necessary for autovacuum to update vacuum cost delay
parameters[2]/messages/by-id/CAD21AoD6HhraqhOgkQJOrr0ixZkAZuqJRpzGv-B+_-ad6d5aPw@mail.gmail.com.

Regards,

[1]: /messages/by-id/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com
[2]: /messages/by-id/CAD21AoD6HhraqhOgkQJOrr0ixZkAZuqJRpzGv-B+_-ad6d5aPw@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#129

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Tomas Vondra (#122)

Re: Parallel heap vacuum

Hi,

On 2025-09-17 13:25:11 +0200, Tomas Vondra wrote:

I believe the reason why parallelism is disabled in autovacuum is that
we want autovacuum to be a background process, with minimal disruption
to user workload. It probably wouldn't be that hard to allow autovacuum
to do parallel stuff, but it feels similar to adding autovacuum workers.
That's rarely the solution, without increasing the cost limit.

I continue to find this argument extremely unconvincing. It's very common for
autovacuum to be continuously be busy with the one large table that has a
bunch of indexes. Vacuuming that one table is what prevents the freeze horizon
to move forward / prevents getting out of anti-wraparound territory in time.

Greetings,

Andres Freund

#130

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Robert Haas (#123)

Re: Parallel heap vacuum

Hi,

On 2025-09-17 12:01:47 -0400, Robert Haas wrote:

I keep thinking about how we handle parallelism in index builds. The
index AM API did not get a bunch of new callbacks, it's all handled
within the existing ambuild() callback. Shouldn't we be doing something
like that for relation_vacuum()?

I have a feeling that we might have made the wrong decision there.

You might be right.

That approach will probably require a good deal of code to be
duplicated for each AM. I'm not sure what the final solution should
look like here, but we want the common parts like worker setup to use
common code, while allowing each AM to insert its own logic in the
places where that is needed. The challenge in my view is to figure out
how best to arrange things so as to make that possible.

I've actually been thinking about the structure of the index build code
recently. There have been a bunch of requests, one recently on the list, to
build multiple indexes with one table scan - our current code structure would
make that a pretty hard feature to develop.

Greetings,

Andres Freund

#131

Tomas Vondra

tomas@vondra.me

4 months ago

In reply to: Andres Freund (#129)

Re: Parallel heap vacuum

On 9/18/25 01:22, Andres Freund wrote:

Hi,

On 2025-09-17 13:25:11 +0200, Tomas Vondra wrote:

I believe the reason why parallelism is disabled in autovacuum is that
we want autovacuum to be a background process, with minimal disruption
to user workload. It probably wouldn't be that hard to allow autovacuum
to do parallel stuff, but it feels similar to adding autovacuum workers.
That's rarely the solution, without increasing the cost limit.

I continue to find this argument extremely unconvincing. It's very common for
autovacuum to be continuously be busy with the one large table that has a
bunch of indexes. Vacuuming that one table is what prevents the freeze horizon
to move forward / prevents getting out of anti-wraparound territory in time.

OK. I'm not claiming the argument is correct, I mostly asking if this
was the argument for not allowing parallelism in autovacuum.

I don't doubt an autovacuum worker can get "stuck" on a huge table,
holding back the freeze horizon. But does it happen even with an
increased cost limit? And is the bottleneck I/O or CPU?

If it's vacuum_cost_limit, then the right "fix" is increasing the limit.
Just adding workers improves nothing. If it's waiting on I/O, then
adding workers is not going to help much. With CPU bottleneck it might,
though. Does that match the cases you saw?

regards

--
Tomas Vondra

#132

Tomas Vondra

tomas@vondra.me

4 months ago

In reply to: Masahiko Sawada (#128)

Re: Parallel heap vacuum

On 9/18/25 01:18, Masahiko Sawada wrote:

On Wed, Sep 17, 2025 at 4:25 AM Tomas Vondra <tomas@vondra.me> wrote:

On 9/8/25 17:40, Melanie Plageman wrote:

On Wed, Aug 27, 2025 at 2:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Aug 26, 2025 at 8:55 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

If you do parallel worker setup below heap_vacuum_rel(), then how are
you supposed to use those workers to do non-heap table vacuuming?

IIUC non-heap tables can call parallel_vacuum_init() in its
relation_vacuum table AM callback implementation in order to
initialize parallel table vacuum, parallel index vacuum, or both.

Yep, it's just that that is more like a library helper function.
Currently, in master, the flow is:

leader does:
heap_vacuum_rel()
dead_items_alloc()
parallel_vacuum_init()
passes parallel_vacuum_main() to parallel context
lazy_scan_heap()
...
parallel_vacuum_process_all_indexes()
LaunchParallelWorkers()
...
parallel_vacuum_process_[un]safe_indexes()

For the leader, I don't remember where your patch moves
LaunchParallelWorkers() -- since it would have to move to before the
first heap vacuuming phase. When I was imagining what made the most
sense to make parallelism agnostic of heap, it seemed like it would be
moving LaunchParallelWorkers() above heap_vacuum_rel(). But I
recognize that is a pretty big change, and I haven't tried it, so I
don't know all the complications. (sounds like you mention some
below).

Basically, right now in master, we expose some functions for parallel
index vacuuming that people can call in their own table implementation
of table vacuuming, but we don't invoke any of them in table-AM
agnostic code. That may work fine for just index vacuuming, but now
that we are trying to make parts of the table vacuuming itself
extensible, the interface feels more awkward.

I won't push the idea further because I'm not even sure if it works,
but the thing that felt the most natural to me was pushing parallelism
above the table AM.

...

I took a quick look at the patch this week. I don't have a very strong
opinion on the changes to table AM API, and I somewhat agree with this
impression. It's not clear to me why we should be adding callbacks that
are AM-specific (and only ever called from that particular AM) to the
common AM interface.

I agree that we should not add AM-specific callbacks to the AM interface.

I keep thinking about how we handle parallelism in index builds. The
index AM API did not get a bunch of new callbacks, it's all handled
within the existing ambuild() callback. Shouldn't we be doing something
like that for relation_vacuum()?

That is, each AM-specific relation_vacuum() would be responsible for
starting / stopping workers and all that. relation_vacuum would need
some flag indicating it's OK to use parallelism, of course. We might add
some shared infrastructure to make that more convenient, of course.

This is roughly how index builds handle parallelism. It's not perfect
(e.g. the plan_create_index_workers can do unexpected stuff for
non-btree indexes).

In parallel vacuum cases, I believe we need to consider that
vacuumparallel.c (particularly ParallelVacuumContext) effectively
serves as a shared infrastructure. It handles several core functions,
including the initialization of shared dead items storage (shared
TidStore), worker management, buffer/WAL usage, vacuum-delay settings,
and provides APIs for parallel index vacuuming/cleanup. If we were to
implement parallel heap vacuum naively in vacuumlazy.c, we would
likely end up with significant code duplication.

To avoid this redundancy, my idea is to extend vacuumparallel.c to
support parallel table vacuum. This approach would allow table AMs to
use vacuumparallel.c for parallel table vacuuming, parallel index
vacuuming, or both. While I agree that adding new table AM callbacks
for parallel table vacuum to the table AM interface isn't a good idea,
I think passing the necessary callbacks for parallel heap vacuum to
vacuumparallel.c through parallel_vacuum_init() follows a similar
principle. This means that all information needed for parallel
table/index vacuuming would remain within vacuumparallel.c, and table
AMs could trigger parallel table vacuum, parallel index vacuuming, and
parallel index cleanup through the APIs provided by vacuumparallel.c.

The alternative approach, as you suggested, would be to have the table
AM's relation_vacuum() handle the parallel processing implementation
while utilizing the infrastructure and APIs provided by
vacuumparallel.c. Based on the feedback received, I'm now evaluating
what shared information and helper functions would be necessary for
this latter approach.

I also repeated the stress test / benchmark, measuring impact with
different number of indexes, amount of deleted tuples, etc. Attached is
a PDF summarizing the results for each part of the patch series (with
different number of workers). For this tests, the improvements are
significant only with no indexes - then the 0004 patch saves 30-50%. But
as soon as indexes are added, the index cleanup starts to dominate.

It's just an assessment, I'm not saying we shouldn't parallelize the
heap cleanup. I assume there are workloads for which patches will make
much more difference. But, what would such cases look like? If I want to
maximize the impact of this patch series, what should I do?

Another use case is executing vacuum without index cleanup for faster
freezing XIDs on a large table.

Good point. I forgot about this case, where we skip index cleanup.

FWIW the patch needs rebasing, there's a bit of bitrot. It wasn't very
extensive, so I did that (in order to run the tests), attached is the
result as v19.

Thank you.

This also reminds I heard a couple complaints we don't allow parallelism
in autovacuum. We have parallel index vacuum, but it's disabled in
autovacuum (and users really don't want to run manual vacuum).

That question is almost entirely orthogonal to this patch, of course.
I'm not suggesting this patch has (or even should) to do anything about
it. But I wonder if you have any thoughts on that?

I believe the reason why parallelism is disabled in autovacuum is that
we want autovacuum to be a background process, with minimal disruption
to user workload. It probably wouldn't be that hard to allow autovacuum
to do parallel stuff, but it feels similar to adding autovacuum workers.
That's rarely the solution, without increasing the cost limit.

A patch enabling autovacuum to use parallel vacuum capabilities has
already been proposed[1], and I've been reviewing it. There seems to
be consensus that users would benefit from parallel index vacuuming
during autovacuum in certain scenarios. As an initial implementation,
parallel vacuum in autovacuum will be disabled by default, but users
can enable it by specifying parallel degrees through a new storage
option for specific tables where they want parallel vacuuming.

Thanks, the v18 patch seems generally sensible, but I only skimmed it.
I'll try to take a closer look ...

I agree it probably makes sense to disable that by default. One thing
that bothers me is that this makes predicting autovacuum behavior even
more difficult (but that's not the fault of the patch).

For your information, while the implementation itself is relatively
straightforward, we're still facing one unresolved issue; the system
doesn't support reloading the configuration file during parallel mode,
but it is necessary for autovacuum to update vacuum cost delay
parameters[2].

Hmmm, that's annoying :-(

If it happens only for max_stack_depth (when setting it based on
environment) maybe we could simply skip that when in parallel mode? But
it's ugly, and we probably will have the same issue for any other
GUC_ALLOW_IN_PARALLEL option in a file ...

regards

--
Tomas Vondra

#133

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Tomas Vondra (#131)

Re: Parallel heap vacuum

Hi,

On 2025-09-18 02:00:50 +0200, Tomas Vondra wrote:

On 9/18/25 01:22, Andres Freund wrote:

Hi,

On 2025-09-17 13:25:11 +0200, Tomas Vondra wrote:

I believe the reason why parallelism is disabled in autovacuum is that
we want autovacuum to be a background process, with minimal disruption
to user workload. It probably wouldn't be that hard to allow autovacuum
to do parallel stuff, but it feels similar to adding autovacuum workers.
That's rarely the solution, without increasing the cost limit.

I continue to find this argument extremely unconvincing. It's very common for
autovacuum to be continuously be busy with the one large table that has a
bunch of indexes. Vacuuming that one table is what prevents the freeze horizon
to move forward / prevents getting out of anti-wraparound territory in time.

OK. I'm not claiming the argument is correct, I mostly asking if this
was the argument for not allowing parallelism in autovacuum.

I don't doubt an autovacuum worker can get "stuck" on a huge table,
holding back the freeze horizon. But does it happen even with an
increased cost limit? And is the bottleneck I/O or CPU?

If it's vacuum_cost_limit, then the right "fix" is increasing the limit.
Just adding workers improves nothing. If it's waiting on I/O, then
adding workers is not going to help much. With CPU bottleneck it might,
though. Does that match the cases you saw?

Cost limit can definitely be sometimes be the issue and indeed in that case
increasing parallelism is the wrong answer. But I've repeatedly seen autovac
being bottlenecked by CPU and/or I/O.

Greetings,

Andres Freund

#134

Robert Haas

robertmhaas@gmail.com

4 months ago

In reply to: Andres Freund (#129)

Re: Parallel heap vacuum

On Wed, Sep 17, 2025 at 7:22 PM Andres Freund <andres@anarazel.de> wrote:

I continue to find this argument extremely unconvincing. It's very common for
autovacuum to be continuously be busy with the one large table that has a
bunch of indexes. Vacuuming that one table is what prevents the freeze horizon
to move forward / prevents getting out of anti-wraparound territory in time.

The problem is that we're not smart enough to know whether this is the
case or not. It's also fairly common for tables to get starved because
all of the autovacuum workers are busy. Until we get a real scheduling
system for autovacuum, I don't see this area improving very much.

--
Robert Haas
EDB: http://www.enterprisedb.com

#135

Masahiko Sawada

sawada.mshk@gmail.com

4 months ago

In reply to: Tomas Vondra (#132)

Re: Parallel heap vacuum

On Wed, Sep 17, 2025 at 5:24 PM Tomas Vondra <tomas@vondra.me> wrote:

On 9/18/25 01:18, Masahiko Sawada wrote:

For your information, while the implementation itself is relatively
straightforward, we're still facing one unresolved issue; the system
doesn't support reloading the configuration file during parallel mode,
but it is necessary for autovacuum to update vacuum cost delay
parameters[2].

Hmmm, that's annoying :-(

If it happens only for max_stack_depth (when setting it based on
environment) maybe we could simply skip that when in parallel mode? But
it's ugly, and we probably will have the same issue for any other
GUC_ALLOW_IN_PARALLEL option in a file ...

We have the same issue for all non-GUC_ALLOW_IN_PARALLEL parameters.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com