Support tid range scan in parallel?
Hello
When using ctid as a
restriction clause with lower and upper bounds, PostgreSQL's planner will use
TID range scan plan to handle such query. This works and generally fine.
However, if the ctid range covers a huge amount of data, the planner will not
use parallel worker to perform ctid range scan because it is not supported. It could, however,
still choose to use parallel sequential scan to complete the scan if ti costs less.
In one of our
migration scenarios, we rely on tid range scan to migrate huge table from one
database to another once the lower and upper ctid bound is determined. With the
support of parallel ctid range scan, this process could be done much quicker.
The attached patch
is my approach to add parallel ctid range scan to PostgreSQL's planner and executor. In my
tests, I do see an increase in performance using parallel tid range scan over
the single worker tid range scan and it is also faster than parallel sequential
scan covering similar ranges. Of course, the table needs to be large enough to
reflect the performance increase.
below is the timing to complete a select query covering all the records in a simple 2-column
table with 40 million records,
- tid range scan takes 10216ms
- tid range scan with 2 workers takes 7109ms
- sequential scan with 2 workers takes 8499ms
Having the support
for parallel ctid range scan is definitely helpful in our migration case, I am
sure it could be useful in other cases as well. I am sharing the patch here and
if someone could provide a quick feedback or review that would be greatly appreciated.
Thank you!
Cary Huang
-------------
HighGo Software Inc. (Canada)
mailto:cary.huang@highgo.ca
Attachments:
v1-0001-add-parallel-tid-rangescan.patchapplication/octet-stream; name=v1-0001-add-parallel-tid-rangescan.patchDownload
From 2093a78191cacaeecae6b1bd095433dbdbf1eb3d Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Mon, 29 Apr 2024 14:48:44 -0700
Subject: [PATCH] added parallel tid range scan feature
---
src/backend/access/heap/heapam.c | 19 +++--
src/backend/access/table/tableam.c | 29 +++++++
src/backend/executor/execParallel.c | 20 +++++
src/backend/executor/nodeTidrangescan.c | 81 +++++++++++++++++++
src/backend/optimizer/path/allpaths.c | 5 ++
src/backend/optimizer/path/costsize.c | 22 +++++
src/backend/optimizer/path/tidpath.c | 31 ++++++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/tableam.h | 12 +++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 1 +
src/include/optimizer/pathnode.h | 3 +-
src/include/optimizer/paths.h | 2 +
src/test/regress/expected/select_parallel.out | 51 ++++++++++++
src/test/regress/sql/select_parallel.sql | 15 ++++
15 files changed, 293 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4be0dee4de..2cbb058e96 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1367,7 +1367,8 @@ heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
* Check for an empty range and protect from would be negative results
* from the numBlks calculation below.
*/
- if (ItemPointerCompare(&highestItem, &lowestItem) < 0)
+ if (ItemPointerCompare(&highestItem, &lowestItem) < 0 &&
+ sscan->rs_parallel == NULL)
{
/* Set an empty range of blocks to scan */
heap_setscanlimits(sscan, 0, 0);
@@ -1381,15 +1382,19 @@ heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
* lowestItem has an offset above MaxOffsetNumber. In this case, we could
* advance startBlk by one. Likewise, if highestItem has an offset of 0
* we could scan one fewer blocks. However, such an optimization does not
- * seem worth troubling over, currently.
+ * seem worth troubling over, currently. This is set only in non-parallel
+ * case.
*/
- startBlk = ItemPointerGetBlockNumberNoCheck(&lowestItem);
+ if (sscan->rs_parallel == NULL)
+ {
+ startBlk = ItemPointerGetBlockNumberNoCheck(&lowestItem);
- numBlks = ItemPointerGetBlockNumberNoCheck(&highestItem) -
- ItemPointerGetBlockNumberNoCheck(&lowestItem) + 1;
+ numBlks = ItemPointerGetBlockNumberNoCheck(&highestItem) -
+ ItemPointerGetBlockNumberNoCheck(&lowestItem) + 1;
- /* Set the start block and number of blocks to scan */
- heap_setscanlimits(sscan, startBlk, numBlks);
+ /* Set the start block and number of blocks to scan */
+ heap_setscanlimits(sscan, startBlk, numBlks);
+ }
/* Finally, set the TID range in sscan */
ItemPointerCopy(&lowestItem, &sscan->rs_mintid);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e57a0b7ea3..38253643e8 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -187,6 +187,35 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan,
+ ItemPointer mintid, ItemPointer maxtid)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelationGetRelid(relation) == pscan->phs_relid);
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 8c53d1834e..e4733ca5a3 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -40,6 +40,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -296,6 +297,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeEstimate((MemoizeState *) planstate, e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
default:
break;
}
@@ -520,6 +526,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeDSM((MemoizeState *) planstate, d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
default:
break;
}
@@ -1006,6 +1017,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
default:
break;
@@ -1372,6 +1388,10 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeWorker((MemoizeState *) planstate, pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate, pwcxt);
+ break;
default:
break;
}
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 9aa7683d7e..a353a17731 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -403,3 +403,84 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->pscan_len = table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->pscan_len);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel heap scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ &node->trss_mintid, &node->trss_maxtid);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+}
+
+/* ----------------------------------------------------------------
+ * ExecSeqScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ &node->trss_mintid, &node->trss_maxtid);
+}
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index cc51ae1757..ab0d1d5fb7 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -794,6 +794,7 @@ static void
create_plain_partial_paths(PlannerInfo *root, RelOptInfo *rel)
{
int parallel_workers;
+ Path *path = NULL;
parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
max_parallel_workers_per_gather);
@@ -804,6 +805,10 @@ create_plain_partial_paths(PlannerInfo *root, RelOptInfo *rel)
/* Add an unordered partial path based on a parallel sequential scan. */
add_partial_path(rel, create_seqscan_path(root, rel, NULL, parallel_workers));
+
+ path = create_tidrangescan_subpaths(root, rel, parallel_workers);
+ if (path)
+ add_partial_path(rel, path);
}
/*
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ee23ed7835..262b1ef02d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1435,6 +1435,28 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
startup_cost += path->pathtarget->cost.startup;
run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ run_cost /= parallel_divisor;
+
+ /*
+ * It may be possible to amortize some of the I/O cost, but probably
+ * not very much, because most operating systems already do aggressive
+ * prefetching. For now, we assume that the disk run cost can't be
+ * amortized at all.
+ */
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2ae5ddfe43..c5413be9e6 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -496,7 +496,8 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
}
/*
@@ -526,3 +527,31 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
*/
BuildParameterizedTidPaths(root, rel, rel->joininfo);
}
+
+Path *
+create_tidrangescan_subpaths(PlannerInfo *root, RelOptInfo *rel, int parallel_workers)
+{
+ List *tidrangequals;
+ Path *path;
+ /*
+ * If there are range quals in the baserestrict list, generate a
+ * TidRangePath.
+ */
+ tidrangequals = TidRangeQualFromRestrictInfoList(rel->baserestrictinfo,
+ rel);
+
+ if (tidrangequals != NIL)
+ {
+ /*
+ * This path uses no join clauses, but it could still have required
+ * parameterization due to LATERAL refs in its tlist.
+ */
+ Relids required_outer = rel->lateral_relids;
+ path = (Path *) create_tidrangescan_path(root, rel,
+ tidrangequals,
+ required_outer,
+ parallel_workers);
+ return path;
+ }
+ return NULL;
+}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3cf1dac087..7ceeaf8688 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1206,7 +1206,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1215,9 +1216,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8e583b45cd..14c4a694a1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1175,6 +1175,18 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan,
+ ItemPointer mintid,
+ ItemPointer maxtid);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index 1cfc7a07be..977cb8eb6e 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d927ac44a8..81eec34730 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1862,6 +1862,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size pscan_len; /* size of parallel tid range scan descriptor */
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index c5c4756b0f..d7683ec1c3 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -66,7 +66,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 39ba461548..c571354890 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -87,6 +87,8 @@ extern void check_index_predicates(PlannerInfo *root, RelOptInfo *rel);
* routines to generate tid paths
*/
extern void create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel);
+extern Path *create_tidrangescan_subpaths(PlannerInfo *root, RelOptInfo *rel,
+ int parallel_worker);
/*
* joinpath.c
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 87273fa635..61e6700194 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -1293,4 +1293,55 @@ SELECT 1 FROM tenk1_vw_sec
Filter: (f1 < tenk1_vw_sec.unique1)
(9 rows)
+-- test parallel tid range scan
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid > '(0,1)' LIMIT 1;
+ QUERY PLAN
+-----------------------------------------------
+ Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1
+ TID Cond: (ctid > '(0,1)'::tid)
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid < '(400,1)' LIMIT 1;
+ QUERY PLAN
+-------------------------------------------------
+ Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1
+ TID Cond: (ctid < '(400,1)'::tid)
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid >= '(0,1)' AND ctid <= '(400,1)' LIMIT 1;
+ QUERY PLAN
+-------------------------------------------------------------------------------
+ Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1
+ TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <= '(400,1)'::tid))
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM tenk1 t,
+LATERAL (SELECT count(*) c FROM tenk1 t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)' LIMIT 1;
+ QUERY PLAN
+------------------------------------------------------
+ Limit
+ -> Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1 t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on tenk1 t2
+ TID Cond: (ctid <= t.ctid)
+(9 rows)
+
rollback;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 20376c03fa..1d4ef68790 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -495,4 +495,19 @@ EXPLAIN (COSTS OFF)
SELECT 1 FROM tenk1_vw_sec
WHERE (SELECT sum(f1) FROM int4_tbl WHERE f1 < unique1) < 100;
+-- test parallel tid range scan
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid > '(0,1)' LIMIT 1;
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid < '(400,1)' LIMIT 1;
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid >= '(0,1)' AND ctid <= '(400,1)' LIMIT 1;
+
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM tenk1 t,
+LATERAL (SELECT count(*) c FROM tenk1 t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)' LIMIT 1;
+
rollback;
--
2.17.1
Import Notes
Reply to msg id not found:
On Tue, 30 Apr 2024 at 10:36, Cary Huang <cary.huang@highgo.ca> wrote:
In one of our migration scenarios, we rely on tid range scan to migrate huge table from one database to another once the lower and upper ctid bound is determined. With the support of parallel ctid range scan, this process could be done much quicker.
I would have thought that the best way to migrate would be to further
divide the TID range into N segments and run N queries, one per
segment to get the data out.
From a CPU point of view, I'd hard to imagine that a SELECT * query
without any other items in the WHERE clause other than the TID range
quals would run faster with multiple workers than with 1. The problem
is the overhead of pushing tuples to the main process often outweighs
the benefits of the parallelism. However, from an I/O point of view
on a server with slow enough disks, I can imagine there'd be a
speedup.
The attached patch is my approach to add parallel ctid range scan to PostgreSQL's planner and executor. In my tests, I do see an increase in performance using parallel tid range scan over the single worker tid range scan and it is also faster than parallel sequential scan covering similar ranges. Of course, the table needs to be large enough to reflect the performance increase.
below is the timing to complete a select query covering all the records in a simple 2-column table with 40 million records,
- tid range scan takes 10216ms
- tid range scan with 2 workers takes 7109ms
- sequential scan with 2 workers takes 8499ms
Can you share more details about this test? i.e. the query, what the
times are that you've measured (EXPLAIN ANALYZE, or SELECT, COPY?).
Also, which version/commit did you patch against? I was wondering if
the read stream code added in v17 would result in the serial case
running faster because the parallelism just resulted in more I/O
concurrency.
Of course, it may be beneficial to have parallel TID Range for other
cases when more row filtering or aggregation is being done as that
requires pushing fewer tuples over from the parallel worker to the
main process. It just would be good to get to the bottom of if there's
still any advantage to parallelism when no filtering other than the
ctid quals is being done now that we've less chance of having to wait
for I/O coming from disk with the read streams code.
David
Hi David
Thank you for your reply.
From a CPU point of view, I'd hard to imagine that a SELECT * query
without any other items in the WHERE clause other than the TID range
quals would run faster with multiple workers than with 1. The problem
is the overhead of pushing tuples to the main process often outweighs
the benefits of the parallelism. However, from an I/O point of view
on a server with slow enough disks, I can imagine there'd be a
speedup.
yeah, this is generally true. With everything set to default, the planner would not choose parallel sequential scan if the scan range covers mostly all tuples of a table (to reduce the overhead of pushing tuples to main proc as you mentioned). It is preferred when the target data is small but the table is huge. In my case, it is also the same, the planner by default uses normal tid range scan, so I had to alter cost parameters to influence the planner's decision. This is where I found that with WHERE clause only containing TID ranges that cover the entire table would result faster with parallel workers, at least in my environment.
Of course, it may be beneficial to have parallel TID Range for other
cases when more row filtering or aggregation is being done as that
requires pushing fewer tuples over from the parallel worker to the
main process. It just would be good to get to the bottom of if there's
still any advantage to parallelism when no filtering other than the
ctid quals is being done now that we've less chance of having to wait
for I/O coming from disk with the read streams code.
I believe so too. I shared my test procedure below with ctid being the only quals.
below is the timing to complete a select query covering all the records in a simple 2-column table with 40 million records,
- tid range scan takes 10216ms
- tid range scan with 2 workers takes 7109ms
- sequential scan with 2 workers takes 8499msCan you share more details about this test? i.e. the query, what the
times are that you've measured (EXPLAIN ANALYZE, or SELECT, COPY?).
Also, which version/commit did you patch against? I was wondering if
the read stream code added in v17 would result in the serial case
running faster because the parallelism just resulted in more I/O
concurrency.
Yes of course. These numbers were obtained earlier this year on master with the patch applied most likely without the read stream code you mentioned. The patch attached here is rebased to commit dd0183469bb779247c96e86c2272dca7ff4ec9e7 on master, which is quite recent and should have the read stream code for v17 as I can immediately tell that the serial scans run much faster now in my setup. I increased the records on the test table from 40 to 100 million because serial scans are much faster now. Below is the summary and details of my test. Note that I only include the EXPLAIN ANALYZE details of round1 test. Round2 is the same except for different execution times.
[env]
- OS: Ubuntu 18.04
- CPU: 4 cores @ 3.40 GHz
- MEM: 16 GB
[test table setup]
initdb with all default values
CREATE TABLE test (a INT, b TEXT);
INSERT INTO test VALUES(generate_series(1,100000000), 'testing');
SELECT min(ctid), max(ctid) from test;
min | max
-------+--------------
(0,1) | (540540,100)
(1 row)
[summary]
round 1:
tid range scan: 14915ms
tid range scan 2 workers: 12265ms
seq scan with 2 workers: 12675ms
round2:
tid range scan: 12112ms
tid range scan 2 workers: 10649ms
seq scan with 2 workers: 11206ms
[details of EXPLAIN ANALYZE below]
[default tid range scan]
EXPLAIN ANALYZE SELECT a FROM test WHERE ctid >= '(1,0)' AND ctid <= '(540540,100)';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Tid Range Scan on test (cost=0.01..1227029.81 rows=68648581 width=4) (actual time=0.188..12280.791 rows=99999815 loops=1)
TID Cond: ((ctid >= '(1,0)'::tid) AND (ctid <= '(540540,100)'::tid))
Planning Time: 0.817 ms
Execution Time: 14915.035 ms
(4 rows)
[parallel tid range scan with 2 workers]
set parallel_setup_cost=0;
set parallel_tuple_cost=0;
set min_parallel_table_scan_size=0;
set max_parallel_workers_per_gather=2;
EXPLAIN ANALYZE SELECT a FROM test WHERE ctid >= '(1,0)' AND ctid <= '(540540,100)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=0.01..511262.43 rows=68648581 width=4) (actual time=1.322..9249.197 rows=99999815 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Tid Range Scan on test (cost=0.01..511262.43 rows=28603575 width=4) (actual time=0.332..4906.262 rows=33333272 loops=3)
TID Cond: ((ctid >= '(1,0)'::tid) AND (ctid <= '(540540,100)'::tid))
Planning Time: 0.213 ms
Execution Time: 12265.873 ms
(7 rows)
[parallel seq scan with 2 workers]
set enable_tidscan = 'off';
EXPLAIN ANALYZE SELECT a FROM test WHERE ctid >= '(1,0)' AND ctid <= '(540540,100)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Gather (cost=0.00..969595.42 rows=68648581 width=4) (actual time=4.489..9713.299 rows=99999815 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on test (cost=0.00..969595.42 rows=28603575 width=4) (actual time=0.995..5541.178 rows=33333272 loops=3)
Filter: ((ctid >= '(1,0)'::tid) AND (ctid <= '(540540,100)'::tid))
Rows Removed by Filter: 62
Planning Time: 0.129 ms
Execution Time: 12675.681 ms
(8 rows)
Best regards
Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca
On Wed, 1 May 2024 at 07:10, Cary Huang <cary.huang@highgo.ca> wrote:
Yes of course. These numbers were obtained earlier this year on master with the patch applied most likely without the read stream code you mentioned. The patch attached here is rebased to commit dd0183469bb779247c96e86c2272dca7ff4ec9e7 on master, which is quite recent and should have the read stream code for v17 as I can immediately tell that the serial scans run much faster now in my setup. I increased the records on the test table from 40 to 100 million because serial scans are much faster now. Below is the summary and details of my test. Note that I only include the EXPLAIN ANALYZE details of round1 test. Round2 is the same except for different execution times.
It would be good to see the EXPLAIN (ANALYZE, BUFFERS) with SET
track_io_timing = 1;
Here's a quick review
1. Does not produce correct results:
-- serial plan
postgres=# select count(*) from t where ctid >= '(0,0)' and ctid < '(10,0)';
count
-------
2260
(1 row)
-- parallel plan
postgres=# set max_parallel_workers_per_gather=2;
SET
postgres=# select count(*) from t where ctid >= '(0,0)' and ctid < '(10,0)';
count
-------
0
(1 row)
I've not really looked into why, but I see you're not calling
heap_setscanlimits() in parallel mode. You need to somehow restrict
the block range of the scan to the range specified in the quals. You
might need to do more work to make the scan limits work with parallel
scans.
If you look at heap_scan_stream_read_next_serial(), it's calling
heapgettup_advance_block(), where there's "if (--scan->rs_numblocks
== 0)". But no such equivalent code in
table_block_parallelscan_nextpage() called by
heap_scan_stream_read_next_parallel(). To make Parallel TID Range
work, you'll need heap_scan_stream_read_next_parallel() to abide by
the scan limits.
2. There's a 4 line comment you've added to cost_tidrangescan() which
is just a copy and paste from cost_seqscan(). If you look at the
seqscan costing, the comment is true in that scenario, but not true in
where you've pasted it. The I/O cost is all tied in to run_cost.
+ /* The CPU cost is divided among all the workers. */
+ run_cost /= parallel_divisor;
+
+ /*
+ * It may be possible to amortize some of the I/O cost, but probably
+ * not very much, because most operating systems already do aggressive
+ * prefetching. For now, we assume that the disk run cost can't be
+ * amortized at all.
+ */
3. Calling TidRangeQualFromRestrictInfoList() once for the serial path
and again for the partial path isn't great. It would be good to just
call that function once and use the result for both path types.
4. create_tidrangescan_subpaths() seems like a weird name for a
function. That seems to imply that scans have subpaths. Scans are
always leaf paths and have no subpaths.
This isn't a complete review. It's just that this seems enough to keep
you busy for a while. I can look a bit harder when the patch is
working correctly. I think you should have enough feedback to allow
that now.
David
This isn't a complete review. It's just that this seems enough to keep
you busy for a while. I can look a bit harder when the patch is
working correctly. I think you should have enough feedback to allow
that now.
Thanks for the test, review and feedback. They are greatly appreciated!
I will polish the patch some more following your feedback and share new
results / patch when I have them.
Thanks again!
Cary
Hello
-- parallel plan
postgres=# set max_parallel_workers_per_gather=2;
SET
postgres=# select count(*) from t where ctid >= '(0,0)' and ctid < '(10,0)';
count
-------
0
(1 row)I've not really looked into why, but I see you're not calling
heap_setscanlimits() in parallel mode. You need to somehow restrict
the block range of the scan to the range specified in the quals. You
might need to do more work to make the scan limits work with parallel
scans.
I found that select count(*) using parallel tid rangescan for the very first time,
it would return the correct result, but the same subsequent queries would
result in 0 as you stated. This is due to the "pscan->phs_syncscan" set to true
in ExecTidRangeScanInitializeDSM(), inherited from parallel seq scan case.
With syncscan enabled, the "table_block_parallelscan_nextpage()" would
return the next block since the end of the first tid rangescan instead of the
correct start block that should be scanned. I see that single tid rangescan
does not have SO_ALLOW_SYNC set, so I figure syncscan should also be
disabled in parallel case. With this change, then it would be okay to call
heap_setscanlimits() in parallel case, so I added this call back to
heap_set_tidrange() in both serial and parallel cases.
2. There's a 4 line comment you've added to cost_tidrangescan() which
is just a copy and paste from cost_seqscan(). If you look at the
seqscan costing, the comment is true in that scenario, but not true in
where you've pasted it. The I/O cost is all tied in to run_cost.
thanks for pointing out, I have removed these incorrect comments
3. Calling TidRangeQualFromRestrictInfoList() once for the serial path
and again for the partial path isn't great. It would be good to just
call that function once and use the result for both path types.
good point. I moved the adding of tid range scan partial path inside
create_tidscan_paths() where it makes a TidRangeQualFromRestrictInfoList()
call for serial path, so I can just reuse tidrangequals if it is appropriate to
consider parallel tid rangescan.
4. create_tidrangescan_subpaths() seems like a weird name for a
function. That seems to imply that scans have subpaths. Scans are
always leaf paths and have no subpaths.
I removed this function with weird name; it is not needed because the logic inside
is moved to create_tidscan_paths() where it can reuse tidrangequals.
It would be good to see the EXPLAIN (ANALYZE, BUFFERS) with SET
track_io_timing = 1;
the v2 patch is attached that should address the issues above. Below are the EXPLAIN
outputs with track_io_timing = 1 in my environment. Generally, parallel tid range scan
results in more I/O timings and shorter execution time.
SET track_io_timing = 1;
===serial tid rangescan===
EXPLAIN (ANALYZE, BUFFERS) select a from test where ctid >= '(0,0)' and ctid < '(216216,40)';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Tid Range Scan on test (cost=0.01..490815.59 rows=27459559 width=4) (actual time=0.072..10143.770 rows=39999999 loops=1)
TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid))
Buffers: shared hit=298 read=215919 written=12972
I/O Timings: shared read=440.277 write=58.525
Planning:
Buffers: shared hit=2
Planning Time: 0.289 ms
Execution Time: 12497.081 ms
(8 rows)
set parallel_setup_cost=0;
set parallel_tuple_cost=0;
set min_parallel_table_scan_size=0;
set max_parallel_workers_per_gather=2;
===parallel tid rangescan===
EXPLAIN (ANALYZE, BUFFERS) select a from test where ctid >= '(0,0)' and ctid < '(216216,40)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=0.01..256758.88 rows=40000130 width=4) (actual time=0.878..7083.705 rows=39999999 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=216217
I/O Timings: shared read=1224.153
-> Parallel Tid Range Scan on test (cost=0.01..256758.88 rows=16666721 width=4) (actual time=0.256..3980.770 rows=13333333 loops=3)
TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid))
Buffers: shared read=216217
I/O Timings: shared read=1224.153
Planning Time: 0.258 ms
Execution Time: 9731.800 ms
(11 rows)
===serial tid rangescan with aggregate===
set max_parallel_workers_per_gather=0;
EXPLAIN (ANALYZE, BUFFERS) select count(a) from test where ctid >= '(0,0)' and ctid < '(216216,40)';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=716221.63..716221.64 rows=1 width=8) (actual time=12931.695..12931.696 rows=1 loops=1)
Buffers: shared read=216217
I/O Timings: shared read=599.331
-> Tid Range Scan on test (cost=0.01..616221.31 rows=40000130 width=4) (actual time=0.079..6800.482 rows=39999999 loops=1)
TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid))
Buffers: shared read=216217
I/O Timings: shared read=599.331
Planning:
Buffers: shared hit=1 read=2
I/O Timings: shared read=0.124
Planning Time: 0.917 ms
Execution Time: 12932.348 ms
(12 rows)
===parallel tid rangescan with aggregate===
set max_parallel_workers_per_gather=2;
EXPLAIN (ANALYZE, BUFFERS) select count(a) from test where ctid >= '(0,0)' and ctid < '(216216,40)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=298425.70..298425.71 rows=1 width=8) (actual time=4842.512..4847.863 rows=1 loops=1)
Buffers: shared read=216217
I/O Timings: shared read=1155.321
-> Gather (cost=298425.68..298425.69 rows=2 width=8) (actual time=4842.020..4847.851 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=216217
I/O Timings: shared read=1155.321
-> Partial Aggregate (cost=298425.68..298425.69 rows=1 width=8) (actual time=4824.730..4824.731 rows=1 loops=3)
Buffers: shared read=216217
I/O Timings: shared read=1155.321
-> Parallel Tid Range Scan on test (cost=0.01..256758.88 rows=16666721 width=4) (actual time=0.098..2614.108 rows=13333333 loops=3)
TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid))
Buffers: shared read=216217
I/O Timings: shared read=1155.321
Planning:
Buffers: shared read=3
I/O Timings: shared read=3.323
Planning Time: 4.124 ms
Execution Time: 4847.992 ms
(20 rows)
Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca
Attachments:
v2-0001-add-parallel-tid-rangescan.patchapplication/octet-stream; name=v2-0001-add-parallel-tid-rangescan.patchDownload
From f3afb1afb7bb5967d37311b2210071b668caa52f Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Fri, 3 May 2024 10:34:40 -0700
Subject: [PATCH] v2 parallel tid range scan: 1) fixed incorrect query output
of parallel tid range scan by disabling syncscan in such case 2) reused
tidrangequals computed from regular tid range scan instead of creating
another 3) removed unused min and max tid when initializing parallel scan
context
---
src/backend/access/table/tableam.c | 28 +++++++
src/backend/executor/execParallel.c | 20 +++++
src/backend/executor/nodeTidrangescan.c | 81 +++++++++++++++++++
src/backend/optimizer/path/costsize.c | 15 ++++
src/backend/optimizer/path/tidpath.c | 18 ++++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/tableam.h | 10 +++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 1 +
src/include/optimizer/pathnode.h | 3 +-
src/test/regress/expected/select_parallel.out | 51 ++++++++++++
src/test/regress/sql/select_parallel.sql | 15 ++++
12 files changed, 251 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e57a0b7ea3..8fd0f10c08 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -187,6 +187,34 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelationGetRelid(relation) == pscan->phs_relid);
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 8c53d1834e..e4733ca5a3 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -40,6 +40,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -296,6 +297,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeEstimate((MemoizeState *) planstate, e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
default:
break;
}
@@ -520,6 +526,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeDSM((MemoizeState *) planstate, d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
default:
break;
}
@@ -1006,6 +1017,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
default:
break;
@@ -1372,6 +1388,10 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeWorker((MemoizeState *) planstate, pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate, pwcxt);
+ break;
default:
break;
}
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 9aa7683d7e..b1553b990d 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -403,3 +403,84 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->pscan_len = table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->pscan_len);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel heap scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ /* disable syncscan in parallel tid range scan */
+ pscan->phs_syncscan = false;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+}
+
+/* ----------------------------------------------------------------
+ * ExecSeqScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ee23ed7835..fee603d048 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1435,6 +1435,21 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
startup_cost += path->pathtarget->cost.startup;
run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2ae5ddfe43..3c52ef911e 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -46,6 +46,7 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/cost.h"
/*
@@ -496,7 +497,22 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+ if (parallel_workers > 0)
+ {
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root, rel, tidrangequals,
+ required_outer, parallel_workers));
+ }
+ }
}
/*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3cf1dac087..7ceeaf8688 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1206,7 +1206,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1215,9 +1216,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8e583b45cd..2cffd813a5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1175,6 +1175,16 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index 1cfc7a07be..977cb8eb6e 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d927ac44a8..81eec34730 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1862,6 +1862,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size pscan_len; /* size of parallel tid range scan descriptor */
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index c5c4756b0f..d7683ec1c3 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -66,7 +66,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 87273fa635..61e6700194 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -1293,4 +1293,55 @@ SELECT 1 FROM tenk1_vw_sec
Filter: (f1 < tenk1_vw_sec.unique1)
(9 rows)
+-- test parallel tid range scan
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid > '(0,1)' LIMIT 1;
+ QUERY PLAN
+-----------------------------------------------
+ Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1
+ TID Cond: (ctid > '(0,1)'::tid)
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid < '(400,1)' LIMIT 1;
+ QUERY PLAN
+-------------------------------------------------
+ Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1
+ TID Cond: (ctid < '(400,1)'::tid)
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid >= '(0,1)' AND ctid <= '(400,1)' LIMIT 1;
+ QUERY PLAN
+-------------------------------------------------------------------------------
+ Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1
+ TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <= '(400,1)'::tid))
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM tenk1 t,
+LATERAL (SELECT count(*) c FROM tenk1 t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)' LIMIT 1;
+ QUERY PLAN
+------------------------------------------------------
+ Limit
+ -> Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1 t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on tenk1 t2
+ TID Cond: (ctid <= t.ctid)
+(9 rows)
+
rollback;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 20376c03fa..1d4ef68790 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -495,4 +495,19 @@ EXPLAIN (COSTS OFF)
SELECT 1 FROM tenk1_vw_sec
WHERE (SELECT sum(f1) FROM int4_tbl WHERE f1 < unique1) < 100;
+-- test parallel tid range scan
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid > '(0,1)' LIMIT 1;
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid < '(400,1)' LIMIT 1;
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid >= '(0,1)' AND ctid <= '(400,1)' LIMIT 1;
+
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM tenk1 t,
+LATERAL (SELECT count(*) c FROM tenk1 t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)' LIMIT 1;
+
rollback;
--
2.17.1
On Sat, 4 May 2024 at 06:55, Cary Huang <cary.huang@highgo.ca> wrote:
With syncscan enabled, the "table_block_parallelscan_nextpage()" would
return the next block since the end of the first tid rangescan instead of the
correct start block that should be scanned. I see that single tid rangescan
does not have SO_ALLOW_SYNC set, so I figure syncscan should also be
disabled in parallel case. With this change, then it would be okay to call
heap_setscanlimits() in parallel case, so I added this call back to
heap_set_tidrange() in both serial and parallel cases.
This now calls heap_setscanlimits() for the parallel version, it's
just that table_block_parallelscan_nextpage() does nothing to obey
those limits.
The only reason the code isn't reading the entire table is due to the
optimisation in heap_getnextslot_tidrange() which returns false when
the ctid goes out of range. i.e, this code:
/*
* When scanning forward, the TIDs will be in ascending order.
* Future tuples in this direction will be higher still, so we can
* just return false to indicate there will be no more tuples.
*/
if (ScanDirectionIsForward(direction))
return false;
If I comment out that line, I see all pages are accessed:
postgres=# explain (analyze, buffers) select count(*) from a where
ctid >= '(0,1)' and ctid < '(11,0)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=18.80..18.81 rows=1 width=8) (actual
time=33.530..36.118 rows=1 loops=1)
Buffers: shared read=4425
-> Gather (cost=18.78..18.79 rows=2 width=8) (actual
time=33.456..36.102 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=4425
-> Partial Aggregate (cost=18.78..18.79 rows=1 width=8)
(actual time=20.389..20.390 rows=1 loops=3)
Buffers: shared read=4425
-> Parallel Tid Range Scan on a (cost=0.01..16.19
rows=1035 width=0) (actual time=9.375..20.349 rows=829 loops=3)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
Buffers: shared read=4425 <---- this is all
pages in the table instead of 11 pages.
With that code still commented out, the non-parallel version still
won't read all pages due to the setscanlimits being obeyed.
postgres=# set max_parallel_workers_per_gather=0;
SET
postgres=# explain (analyze, buffers) select count(*) from a where
ctid >= '(0,1)' and ctid < '(11,0)';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Aggregate (cost=45.07..45.08 rows=1 width=8) (actual
time=0.302..0.302 rows=1 loops=1)
Buffers: shared hit=11
-> Tid Range Scan on a (cost=0.01..38.86 rows=2485 width=0)
(actual time=0.019..0.188 rows=2486 loops=1)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid))
Buffers: shared hit=11
If I put that code back in, how many pages are read depends on the
number of parallel workers as workers will keep running with higher
page numbers and heap_getnextslot_tidrange() will just (inefficiently)
filter those out.
max_parallel_workers_per_gather=2;
-> Parallel Tid Range Scan on a (cost=0.01..16.19
rows=1035 width=0) (actual time=0.191..0.310 rows=829 loops=3)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
Buffers: shared read=17
max_parallel_workers_per_gather=3;
-> Parallel Tid Range Scan on a (cost=0.01..12.54
rows=802 width=0) (actual time=0.012..0.114 rows=622 loops=4)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
Buffers: shared hit=19
max_parallel_workers_per_gather=4;
-> Parallel Tid Range Scan on a (cost=0.01..9.72
rows=621 width=0) (actual time=0.014..0.135 rows=497 loops=5)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
Buffers: shared hit=21
To fix this you need to make table_block_parallelscan_nextpage obey
the limits imposed by heap_setscanlimits().
The equivalent code in the non-parallel version is in
heapgettup_advance_block().
/* check if the limit imposed by heap_setscanlimits() is met */
if (scan->rs_numblocks != InvalidBlockNumber)
{
if (--scan->rs_numblocks == 0)
return InvalidBlockNumber;
}
I've not studied exactly how you'd get the rs_numblock information
down to the parallel scan descriptor. But when you figure that out,
just remember that you can't do the --scan->rs_numblocks from
table_block_parallelscan_nextpage() as that's not parallel safe. You
might be able to add an or condition to: "if (nallocated >=
pbscan->phs_nblocks)" to make it "if (nallocated >=
pbscan->phs_nblocks || nallocated >= pbscan->phs_numblocks)",
although the field names don't seem very intuitive there. It would be
nicer if the HeapScanDesc field was called rs_blocklimit rather than
rs_numblocks. It's not for this patch to go messing with that,
however.
David
Thank you very much for the test and review. Greatly appreciated!
This now calls heap_setscanlimits() for the parallel version, it's
just that table_block_parallelscan_nextpage() does nothing to obey
those limits.
yes, you are absolutely right. Though heap_setscanlimits() is now called by
parallel tid range scan, table_block_parallelscan_nextpage() does nothing
to obey these limits, resulting in more blocks being inefficiently filtered out
by the optimization code you mentioned in heap_getnextslot_tidrange().
I've not studied exactly how you'd get the rs_numblock information
down to the parallel scan descriptor. But when you figure that out,
just remember that you can't do the --scan->rs_numblocks from
table_block_parallelscan_nextpage() as that's not parallel safe. You
might be able to add an or condition to: "if (nallocated >=
pbscan->phs_nblocks)" to make it "if (nallocated >=
pbscan->phs_nblocks || nallocated >= pbscan->phs_numblocks)",
although the field names don't seem very intuitive there. It would be
nicer if the HeapScanDesc field was called rs_blocklimit rather than
rs_numblocks. It's not for this patch to go messing with that,
however.
rs_numblock was not passed down to the parallel scan context and
table_block_parallelscan_nextpage() did not seem to have a logic to limit
the block scan range set by heap_setscanlimits() in parallel scan. Also, I
noticed that the rs_startblock was also not passed to the parallel scan
context, which causes the parallel scan always start from 0 even when a
lower ctid bound is specified.
so I added a logic in heap_set_tidrange() to pass these 2 values to parallel
scan descriptor as "phs_startblock" and "phs_numblock". These will be
available in table_block_parallelscan_nextpage() in parallel scan.
I followed your recommendation and modified the condition to:
if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
nallocated >= pbscan->phs_numblock))
so that the parallel tid range scan will stop when the upper scan limit is
reached. With these changes, I see that the number of buffer reads is
consistent between single and parallel ctid range scans. The v3 patch is
attached.
postgres=# explain (analyze, buffers) select count(*) from test where ctid >= '(0,1)' and ctid < '(11,0)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Aggregate (cost=39.43..39.44 rows=1 width=8) (actual time=1.007..1.008 rows=1 loops=1)
Buffers: shared read=11
-> Tid Range Scan on test (cost=0.01..34.35 rows=2034 width=0) (actual time=0.076..0.639 rows=2035 loops=1)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid))
Buffers: shared read=11
postgres=# set max_parallel_workers_per_gather=2;
SET
postgres=# explain (analyze, buffers) select count(*) from test where ctid >= '(0,1)' and ctid < '(11,0)';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=16.45..16.46 rows=1 width=8) (actual time=14.329..16.840 rows=1 loops=1)
Buffers: shared hit=11
-> Gather (cost=16.43..16.44 rows=2 width=8) (actual time=3.197..16.814 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=11
-> Partial Aggregate (cost=16.43..16.44 rows=1 width=8) (actual time=0.705..0.706 rows=1 loops=3)
Buffers: shared hit=11
-> Parallel Tid Range Scan on test (cost=0.01..14.31 rows=848 width=0) (actual time=0.022..0.423 rows=678 loops=3)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid))
Buffers: shared hit=11
postgres=# set max_parallel_workers_per_gather=3;
SET
postgres=# explain (analyze, buffers) select count(*) from test where ctid >= '(0,1)' and ctid < '(11,0)';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=12.74..12.75 rows=1 width=8) (actual time=16.793..19.053 rows=1 loops=1)
Buffers: shared hit=11
-> Gather (cost=12.72..12.73 rows=3 width=8) (actual time=2.827..19.012 rows=4 loops=1)
Workers Planned: 3
Workers Launched: 3
Buffers: shared hit=11
-> Partial Aggregate (cost=12.72..12.73 rows=1 width=8) (actual time=0.563..0.565 rows=1 loops=4)
Buffers: shared hit=11
-> Parallel Tid Range Scan on test (cost=0.01..11.08 rows=656 width=0) (actual time=0.018..0.338 rows=509 loops=4)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid))
Buffers: shared hit=11
thank you!
Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca
Attachments:
v3-0001-add-parallel-tid-rangescan.patchapplication/octet-stream; name=v3-0001-add-parallel-tid-rangescan.patchDownload
From dd3aaafc4ca9c294e78424ef9341ee1dd66d0ff7 Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Wed, 8 May 2024 13:41:37 -0700
Subject: [PATCH] v3 parallel tid range scan: 1) corrected the startblock and
numblock values when parallel tid range scan is used
---
src/backend/access/heap/heapam.c | 14 ++++
src/backend/access/table/tableam.c | 31 ++++++-
src/backend/executor/execParallel.c | 20 +++++
src/backend/executor/nodeTidrangescan.c | 81 +++++++++++++++++++
src/backend/optimizer/path/costsize.c | 15 ++++
src/backend/optimizer/path/tidpath.c | 18 ++++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/relscan.h | 1 +
src/include/access/tableam.h | 10 +++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 1 +
src/include/optimizer/pathnode.h | 3 +-
src/test/regress/expected/select_parallel.out | 51 ++++++++++++
src/test/regress/sql/select_parallel.sql | 15 ++++
14 files changed, 268 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4be0dee4de..a696cba458 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -307,6 +307,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
* results for a non-MVCC snapshot, the caller must hold some higher-level
* lock that ensures the interesting tuple(s) won't change.)
*/
+
if (scan->rs_base.rs_parallel != NULL)
{
bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
@@ -1391,6 +1392,19 @@ heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
/* Set the start block and number of blocks to scan */
heap_setscanlimits(sscan, startBlk, numBlks);
+ /*
+ * if parallel mode is used, store startblock and numblocks in parallel
+ * scan descriptor as well
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan = NULL;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = scan->rs_startblock;
+ bpscan->phs_numblock = scan->rs_numblocks;
+ }
+
/* Finally, set the TID range in sscan */
ItemPointerCopy(&lowestItem, &sscan->rs_mintid);
ItemPointerCopy(&highestItem, &sscan->rs_maxtid);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e57a0b7ea3..61959d2b7d 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -187,6 +187,34 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelationGetRelid(relation) == pscan->phs_relid);
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
@@ -576,7 +604,8 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
- if (nallocated >= pbscan->phs_nblocks)
+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
+ nallocated >= pbscan->phs_numblock))
page = InvalidBlockNumber; /* all blocks have been allocated */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 8c53d1834e..e4733ca5a3 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -40,6 +40,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -296,6 +297,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeEstimate((MemoizeState *) planstate, e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
default:
break;
}
@@ -520,6 +526,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeDSM((MemoizeState *) planstate, d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
default:
break;
}
@@ -1006,6 +1017,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
default:
break;
@@ -1372,6 +1388,10 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeWorker((MemoizeState *) planstate, pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate, pwcxt);
+ break;
default:
break;
}
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 9aa7683d7e..b1553b990d 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -403,3 +403,84 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->pscan_len = table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->pscan_len);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel heap scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ /* disable syncscan in parallel tid range scan */
+ pscan->phs_syncscan = false;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+}
+
+/* ----------------------------------------------------------------
+ * ExecSeqScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ee23ed7835..fee603d048 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1435,6 +1435,21 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
startup_cost += path->pathtarget->cost.startup;
run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2ae5ddfe43..3c52ef911e 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -46,6 +46,7 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/cost.h"
/*
@@ -496,7 +497,22 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+ if (parallel_workers > 0)
+ {
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root, rel, tidrangequals,
+ required_outer, parallel_workers));
+ }
+ }
}
/*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3cf1dac087..7ceeaf8688 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1206,7 +1206,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1215,9 +1216,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304a..9bede2d6e6 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -81,6 +81,7 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_startblock; /* starting block number */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
+ BlockNumber phs_numblock; /* max number of blocks to scan */
} ParallelBlockTableScanDescData;
typedef struct ParallelBlockTableScanDescData *ParallelBlockTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8e583b45cd..2cffd813a5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1175,6 +1175,16 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index 1cfc7a07be..977cb8eb6e 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d927ac44a8..81eec34730 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1862,6 +1862,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size pscan_len; /* size of parallel tid range scan descriptor */
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index c5c4756b0f..d7683ec1c3 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -66,7 +66,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 87273fa635..61e6700194 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -1293,4 +1293,55 @@ SELECT 1 FROM tenk1_vw_sec
Filter: (f1 < tenk1_vw_sec.unique1)
(9 rows)
+-- test parallel tid range scan
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid > '(0,1)' LIMIT 1;
+ QUERY PLAN
+-----------------------------------------------
+ Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1
+ TID Cond: (ctid > '(0,1)'::tid)
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid < '(400,1)' LIMIT 1;
+ QUERY PLAN
+-------------------------------------------------
+ Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1
+ TID Cond: (ctid < '(400,1)'::tid)
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid >= '(0,1)' AND ctid <= '(400,1)' LIMIT 1;
+ QUERY PLAN
+-------------------------------------------------------------------------------
+ Limit
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1
+ TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <= '(400,1)'::tid))
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM tenk1 t,
+LATERAL (SELECT count(*) c FROM tenk1 t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)' LIMIT 1;
+ QUERY PLAN
+------------------------------------------------------
+ Limit
+ -> Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on tenk1 t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on tenk1 t2
+ TID Cond: (ctid <= t.ctid)
+(9 rows)
+
rollback;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 20376c03fa..1d4ef68790 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -495,4 +495,19 @@ EXPLAIN (COSTS OFF)
SELECT 1 FROM tenk1_vw_sec
WHERE (SELECT sum(f1) FROM int4_tbl WHERE f1 < unique1) < 100;
+-- test parallel tid range scan
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid > '(0,1)' LIMIT 1;
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid < '(400,1)' LIMIT 1;
+
+EXPLAIN (COSTS OFF)
+SELECT ctid FROM tenk1 WHERE ctid >= '(0,1)' AND ctid <= '(400,1)' LIMIT 1;
+
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM tenk1 t,
+LATERAL (SELECT count(*) c FROM tenk1 t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)' LIMIT 1;
+
rollback;
--
2.17.1
On Thu, 9 May 2024 at 10:23, Cary Huang <cary.huang@highgo.ca> wrote:
The v3 patch is attached.
I've not looked at the patch, but please add it to the July CF. I'll
try and look in more detail then.
David
I've not looked at the patch, but please add it to the July CF. I'll
try and look in more detail then.
Thanks David, I have added this patch on July commitfest under the
server feature category.
I understand that the regression tests for parallel ctid range scan is a
bit lacking now. It only has a few EXPLAIN clauses to ensure parallel
workers are used when tid ranges are specified. They are added as
part of select_parallel.sql test. I am not sure if it is more appropriate
to have them as part of tidrangescan.sql test instead. So basically
re-run the same test cases in tidrangescan.sql but in parallel?
thank you
Cary
On Fri, 10 May 2024 at 05:16, Cary Huang <cary.huang@highgo.ca> wrote:
I understand that the regression tests for parallel ctid range scan is a
bit lacking now. It only has a few EXPLAIN clauses to ensure parallel
workers are used when tid ranges are specified. They are added as
part of select_parallel.sql test. I am not sure if it is more appropriate
to have them as part of tidrangescan.sql test instead. So basically
re-run the same test cases in tidrangescan.sql but in parallel?
I think tidrangescan.sql is a more suitable location than
select_parallel.sql I don't think you need to repeat all the tests as
many of them are testing the tid qual processing which is the same
code as it is in the parallel version.
You should add a test that creates a table with a very low fillfactor,
low enough so only 1 tuple can fit on each page and insert a few dozen
tuples. The test would do SELECT COUNT(*) to ensure you find the
correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in
there too for extra coverage to ensure that the correct tuples are
being counted. Just make sure and EXPLAIN (COSTS OFF) the query first
in the test to ensure that it's always testing the plan you're
expecting to test.
David
You should add a test that creates a table with a very low fillfactor,
low enough so only 1 tuple can fit on each page and insert a few dozen
tuples. The test would do SELECT COUNT(*) to ensure you find the
correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in
there too for extra coverage to ensure that the correct tuples are
being counted. Just make sure and EXPLAIN (COSTS OFF) the query first
in the test to ensure that it's always testing the plan you're
expecting to test.
thank you for the test suggestion. I moved the regress tests from select_parallel
to tidrangescan instead. I follow the existing test table creation in tidrangescan
with the lowest fillfactor of 10, I am able to get consistent 5 tuples per page
instead of 1. It should be okay as long as it is consistently 5 tuples per page so
the tuple count results from parallel tests would be in multiples of 5.
The attached v4 patch includes the improved regression tests.
Thank you very much!
Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca
Attachments:
v4-0001-add-parallel-tid-rangescan.patchapplication/octet-stream; name=v4-0001-add-parallel-tid-rangescan.patchDownload
From a77fd44d81482d48ccf0ad7b251b69266a00884d Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Tue, 14 May 2024 14:14:34 -0700
Subject: [PATCH] v4 parallel tid range scan: improve regression tests
---
src/backend/access/heap/heapam.c | 14 +++
src/backend/access/table/tableam.c | 31 +++++-
src/backend/executor/execParallel.c | 20 ++++
src/backend/executor/nodeTidrangescan.c | 81 ++++++++++++++++
src/backend/optimizer/path/costsize.c | 15 +++
src/backend/optimizer/path/tidpath.c | 18 +++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/relscan.h | 1 +
src/include/access/tableam.h | 10 ++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 1 +
src/include/optimizer/pathnode.h | 3 +-
src/test/regress/expected/tidrangescan.out | 106 +++++++++++++++++++++
src/test/regress/sql/tidrangescan.sql | 45 +++++++++
14 files changed, 353 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4be0dee4de..a696cba458 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -307,6 +307,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
* results for a non-MVCC snapshot, the caller must hold some higher-level
* lock that ensures the interesting tuple(s) won't change.)
*/
+
if (scan->rs_base.rs_parallel != NULL)
{
bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
@@ -1391,6 +1392,19 @@ heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
/* Set the start block and number of blocks to scan */
heap_setscanlimits(sscan, startBlk, numBlks);
+ /*
+ * if parallel mode is used, store startblock and numblocks in parallel
+ * scan descriptor as well
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan = NULL;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = scan->rs_startblock;
+ bpscan->phs_numblock = scan->rs_numblocks;
+ }
+
/* Finally, set the TID range in sscan */
ItemPointerCopy(&lowestItem, &sscan->rs_mintid);
ItemPointerCopy(&highestItem, &sscan->rs_maxtid);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e57a0b7ea3..61959d2b7d 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -187,6 +187,34 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelationGetRelid(relation) == pscan->phs_relid);
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
@@ -576,7 +604,8 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
- if (nallocated >= pbscan->phs_nblocks)
+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
+ nallocated >= pbscan->phs_numblock))
page = InvalidBlockNumber; /* all blocks have been allocated */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 8c53d1834e..e4733ca5a3 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -40,6 +40,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -296,6 +297,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeEstimate((MemoizeState *) planstate, e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
default:
break;
}
@@ -520,6 +526,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeDSM((MemoizeState *) planstate, d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
default:
break;
}
@@ -1006,6 +1017,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
default:
break;
@@ -1372,6 +1388,10 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeWorker((MemoizeState *) planstate, pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate, pwcxt);
+ break;
default:
break;
}
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 9aa7683d7e..b1553b990d 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -403,3 +403,84 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->pscan_len = table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->pscan_len);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel heap scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ /* disable syncscan in parallel tid range scan */
+ pscan->phs_syncscan = false;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+}
+
+/* ----------------------------------------------------------------
+ * ExecSeqScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ee23ed7835..fee603d048 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1435,6 +1435,21 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
startup_cost += path->pathtarget->cost.startup;
run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2ae5ddfe43..3c52ef911e 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -46,6 +46,7 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/cost.h"
/*
@@ -496,7 +497,22 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+ if (parallel_workers > 0)
+ {
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root, rel, tidrangequals,
+ required_outer, parallel_workers));
+ }
+ }
}
/*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3cf1dac087..7ceeaf8688 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1206,7 +1206,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1215,9 +1216,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304a..9bede2d6e6 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -81,6 +81,7 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_startblock; /* starting block number */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
+ BlockNumber phs_numblock; /* max number of blocks to scan */
} ParallelBlockTableScanDescData;
typedef struct ParallelBlockTableScanDescData *ParallelBlockTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8e583b45cd..2cffd813a5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1175,6 +1175,16 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index 1cfc7a07be..977cb8eb6e 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d927ac44a8..81eec34730 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1862,6 +1862,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size pscan_len; /* size of parallel tid range scan descriptor */
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index c5c4756b0f..d7683ec1c3 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -66,7 +66,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 721f3b94e0..7d48b430d3 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,4 +297,110 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) with (fillfactor=10);
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+-- ensure there are 40 pages for parallel test
+select min(ctid), max(ctid) from parallel_tidrangescan;
+ min | max
+-------+--------
+ (0,1) | (39,5)
+(1 row)
+
+-- parallel range scans with upper bound
+explain (costs off)
+select count(*) from parallel_tidrangescan where ctid<'(30,1)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid < '(30,1)'::tid)
+(6 rows)
+
+select count(*) from parallel_tidrangescan where ctid<'(30,1)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with lower bound
+explain (costs off)
+select count(*) from parallel_tidrangescan where ctid>'(10,0)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid > '(10,0)'::tid)
+(6 rows)
+
+select count(*) from parallel_tidrangescan where ctid>'(10,0)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with both bounds
+explain (costs off)
+select count(*) from parallel_tidrangescan where ctid>'(10,0)' and ctid<'(30,1)';
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
+(6 rows)
+
+select count(*) from parallel_tidrangescan where ctid>'(10,0)' and ctid<'(30,1)';
+ count
+-------
+ 100
+(1 row)
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on parallel_tidrangescan t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on parallel_tidrangescan t2
+ TID Cond: (ctid <= t.ctid)
+(8 rows)
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ ctid | c
+-------+---
+ (0,1) | 1
+ (0,2) | 2
+ (0,3) | 3
+ (0,4) | 4
+ (0,5) | 5
+(5 rows)
+
+DROP TABLE parallel_tidrangescan;
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index ac09ebb626..5b9bf8efc7 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,4 +98,49 @@ COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+
+CREATE TABLE parallel_tidrangescan(id integer, data text) with (fillfactor=10);
+
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+
+-- ensure there are 40 pages for parallel test
+select min(ctid), max(ctid) from parallel_tidrangescan;
+
+-- parallel range scans with upper bound
+explain (costs off)
+select count(*) from parallel_tidrangescan where ctid<'(30,1)';
+select count(*) from parallel_tidrangescan where ctid<'(30,1)';
+
+-- parallel range scans with lower bound
+explain (costs off)
+select count(*) from parallel_tidrangescan where ctid>'(10,0)';
+select count(*) from parallel_tidrangescan where ctid>'(10,0)';
+
+-- parallel range scans with both bounds
+explain (costs off)
+select count(*) from parallel_tidrangescan where ctid>'(10,0)' and ctid<'(30,1)';
+select count(*) from parallel_tidrangescan where ctid>'(10,0)' and ctid<'(30,1)';
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+DROP TABLE parallel_tidrangescan;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
--
2.17.1
Hi Cary,
On Wed, May 15, 2024 at 5:33 AM Cary Huang <cary.huang@highgo.ca> wrote:
You should add a test that creates a table with a very low fillfactor,
low enough so only 1 tuple can fit on each page and insert a few dozen
tuples. The test would do SELECT COUNT(*) to ensure you find the
correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in
there too for extra coverage to ensure that the correct tuples are
being counted. Just make sure and EXPLAIN (COSTS OFF) the query first
in the test to ensure that it's always testing the plan you're
expecting to test.thank you for the test suggestion. I moved the regress tests from select_parallel
to tidrangescan instead. I follow the existing test table creation in tidrangescan
with the lowest fillfactor of 10, I am able to get consistent 5 tuples per page
instead of 1. It should be okay as long as it is consistently 5 tuples per page so
the tuple count results from parallel tests would be in multiples of 5.The attached v4 patch includes the improved regression tests.
Thank you very much!
Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca
+++ b/src/backend/access/heap/heapam.c
@@ -307,6 +307,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool
keep_startblock)
* results for a non-MVCC snapshot, the caller must hold some higher-level
* lock that ensures the interesting tuple(s) won't change.)
*/
+
I see no reason why you add a blank line here, is it a typo?
+/* ----------------------------------------------------------------
+ * ExecSeqScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
Function name in the comment is not consistent.
@@ -81,6 +81,7 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_startblock; /* starting block number */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
+ BlockNumber phs_numblock; /* max number of blocks to scan */
} ParallelBlockTableScanDescData;
Can this be reorganized by putting phs_numblock after phs_startblock?
--
Regards
Junwang Zhao
This is a good idea to extend parallelism in postgres.
I went through this patch, and here are a few review comments,
+ Size pscan_len; /* size of parallel tid range scan descriptor */
The other name for this var could be tidrs_PscanLen, following the pattern
in indexScanState and IndexOnlyScanState.
Also add it and its description in the comment above the struct.
/* ----------------------------------------------------------------
* ExecTidRangeScanInitializeDSM
*
* Set up a parallel heap scan descriptor.
* ----------------------------------------------------------------
*/
This comment doesn't seem right, please correct it to say for Tid range
scan descriptor.
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;
I do not see any requirement of using this sscan var.
- if (nallocated >= pbscan->phs_nblocks)
+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
+ nallocated >= pbscan->phs_numblock))
page = InvalidBlockNumber; /* all blocks have been allocated */
Please add a comment for the reason for this change. As far as I
understood, this is only for the case of TIDRangeScan so it requires an
explanation for the case.
On Sun, 11 Aug 2024 at 09:03, Junwang Zhao <zhjwpku@gmail.com> wrote:
Hi Cary,
On Wed, May 15, 2024 at 5:33 AM Cary Huang <cary.huang@highgo.ca> wrote:
You should add a test that creates a table with a very low fillfactor,
low enough so only 1 tuple can fit on each page and insert a few dozen
tuples. The test would do SELECT COUNT(*) to ensure you find the
correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in
there too for extra coverage to ensure that the correct tuples are
being counted. Just make sure and EXPLAIN (COSTS OFF) the query first
in the test to ensure that it's always testing the plan you're
expecting to test.thank you for the test suggestion. I moved the regress tests from
select_parallel
to tidrangescan instead. I follow the existing test table creation in
tidrangescan
with the lowest fillfactor of 10, I am able to get consistent 5 tuples
per page
instead of 1. It should be okay as long as it is consistently 5 tuples
per page so
the tuple count results from parallel tests would be in multiples of 5.
The attached v4 patch includes the improved regression tests.
Thank you very much!
Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca+++ b/src/backend/access/heap/heapam.c @@ -307,6 +307,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) * results for a non-MVCC snapshot, the caller must hold some higher-level * lock that ensures the interesting tuple(s) won't change.) */ +I see no reason why you add a blank line here, is it a typo?
+/* ---------------------------------------------------------------- + * ExecSeqScanInitializeWorker + * + * Copy relevant information from TOC into planstate. + * ---------------------------------------------------------------- + */ +void +ExecTidRangeScanInitializeWorker(TidRangeScanState *node, + ParallelWorkerContext *pwcxt) +{ + ParallelTableScanDesc pscan;Function name in the comment is not consistent.
@@ -81,6 +81,7 @@ typedef struct ParallelBlockTableScanDescData BlockNumber phs_startblock; /* starting block number */ pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to * workers so far. */ + BlockNumber phs_numblock; /* max number of blocks to scan */ } ParallelBlockTableScanDescData;Can this be reorganized by putting phs_numblock after phs_startblock?
--
Regards
Junwang Zhao
--
Regards,
Rafia Sabih
Hello
Sorry David and all who have reviewed the patch, it's been awhile since the patch
was last worked on :(. Thank you all for the reviews and comments! Attached is
the rebased patch that adds support for parallel TID range scans. This feature is
particularly useful scanning large tables where the data needs to be scanned in
sizable segments using a TID range in the WHERE clause to define each segment.
By enabling parallelism, this approach can improve performance compared to
both non-parallel TID range scans and traditional sequential scans.
Regards
Cary Huang
Attachments:
v5-0001-add-parallel-tid-rangescan.patchapplication/octet-stream; name=v5-0001-add-parallel-tid-rangescan.patchDownload
From 2e08f77ccab24fbba604b3af0be2674e68bc97e5 Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Thu, 5 Jun 2025 14:11:32 -0700
Subject: [PATCH] v5 parallel tid range scan: rebase to pg18 master and
addressed community comments
---
src/backend/access/heap/heapam.c | 13 +++
src/backend/access/table/tableam.c | 42 +++++++-
src/backend/executor/execParallel.c | 20 ++++
src/backend/executor/nodeTidrangescan.c | 81 ++++++++++++++++
src/backend/optimizer/path/costsize.c | 15 +++
src/backend/optimizer/path/tidpath.c | 18 +++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/relscan.h | 1 +
src/include/access/tableam.h | 10 ++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 2 +
src/include/optimizer/pathnode.h | 3 +-
src/test/regress/expected/tidrangescan.out | 106 +++++++++++++++++++++
src/test/regress/sql/tidrangescan.sql | 45 +++++++++
14 files changed, 364 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817..f79d0331bd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1478,6 +1478,19 @@ heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
/* Set the start block and number of blocks to scan */
heap_setscanlimits(sscan, startBlk, numBlks);
+ /*
+ * if parallel mode is used, store startblock and numblocks in parallel
+ * scan descriptor as well.
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan = NULL;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = scan->rs_startblock;
+ bpscan->phs_numblock = scan->rs_numblocks;
+ }
+
/* Finally, set the TID range in sscan */
ItemPointerCopy(&lowestItem, &sscan->st.tidrange.rs_mintid);
ItemPointerCopy(&highestItem, &sscan->st.tidrange.rs_maxtid);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb1..fd74901b07 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -188,6 +188,34 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
@@ -577,7 +605,19 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
- if (nallocated >= pbscan->phs_nblocks)
+ /*
+ * In a parallel TID range scan, 'pbscan->phs_numblock' will be non-zero
+ * that defines the upper limit on the number of blocks to scan based on
+ * the specified TID range. This value may be less than or equal to
+ * 'pbscan->phs_nblocks', which is the total number of blocks in the
+ * relation.
+ *
+ * The scan can terminate early once 'nallocated' reaches 'phs_numblock',
+ * even if the full relation has more blocks. This ensures that parallel
+ * workers only scan the subset of blocks that fall within the TID range.
+ */
+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
+ nallocated >= pbscan->phs_numblock))
page = InvalidBlockNumber; /* all blocks have been allocated */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f3e77bda27..3b548cce79 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -41,6 +41,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -305,6 +306,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeEstimate((MemoizeState *) planstate, e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
default:
break;
}
@@ -532,6 +538,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeDSM((MemoizeState *) planstate, d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
default:
break;
}
@@ -1019,6 +1030,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
default:
break;
@@ -1401,6 +1417,10 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeWorker((MemoizeState *) planstate, pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate, pwcxt);
+ break;
default:
break;
}
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index ab2eab9596..22dd88c3d4 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -403,3 +403,84 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->trss_pscanlen = table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->trss_pscanlen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel TID scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ /* disable syncscan in parallel tid range scan. */
+ pscan->phs_syncscan = false;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 3d44815ed5..cebe161330 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1438,6 +1438,21 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
startup_cost += path->pathtarget->cost.startup;
run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
+
/* we should not generate this path type when enable_tidscan=false */
Assert(enable_tidscan);
path->disabled_nodes = 0;
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2bfb338b81..9c78eedcf5 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -47,6 +47,7 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/cost.h"
/*
@@ -553,7 +554,22 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan. */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+ if (parallel_workers > 0)
+ {
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root, rel, tidrangequals,
+ required_outer, parallel_workers));
+ }
+ }
}
/*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e0192d4a49..bfee57e5c1 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1262,7 +1262,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1271,9 +1272,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c..2e2c4c03b3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -96,6 +96,7 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_nblocks; /* # blocks in relation at start of scan */
slock_t phs_mutex; /* mutual exclusion for setting startblock */
BlockNumber phs_startblock; /* starting block number */
+ BlockNumber phs_numblock; /* max number of blocks to scan */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
} ParallelBlockTableScanDescData;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbf..2d0e655bd8 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1125,6 +1125,16 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index a831f1202c..2b5465b3ce 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2492282213..a92dc7fbc2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1921,6 +1921,7 @@ typedef struct TidScanState
* trss_mintid the lowest TID in the scan range
* trss_maxtid the highest TID in the scan range
* trss_inScan is a scan currently in progress?
+ * trss_pscanlen size of parallel TID range scan descriptor
* ----------------
*/
typedef struct TidRangeScanState
@@ -1930,6 +1931,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size trss_pscanlen;
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 60dcdb77e4..4b8dbc2a90 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,7 +67,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 721f3b94e0..32cd2bd9f4 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,4 +297,110 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+ min | max
+-------+--------
+ (0,1) | (39,5)
+(1 row)
+
+-- parallel range scans with upper bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid < '(30,1)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with lower bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid > '(10,0)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with both bounds
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+ count
+-------
+ 100
+(1 row)
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on parallel_tidrangescan t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on parallel_tidrangescan t2
+ TID Cond: (ctid <= t.ctid)
+(8 rows)
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ ctid | c
+-------+---
+ (0,1) | 1
+ (0,2) | 2
+ (0,3) | 3
+ (0,4) | 4
+ (0,5) | 5
+(5 rows)
+
+DROP TABLE parallel_tidrangescan;
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index ac09ebb626..1d18b8a61d 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,4 +98,49 @@ COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+
+-- parallel range scans with upper bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+
+-- parallel range scans with lower bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+
+-- parallel range scans with both bounds
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+DROP TABLE parallel_tidrangescan;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
--
2.17.1
Hi Cary,
On Fri, Jun 6, 2025 at 5:24 AM Cary Huang <cary.huang@highgo.ca> wrote:
Hello
Sorry David and all who have reviewed the patch, it's been awhile since the patch
was last worked on :(. Thank you all for the reviews and comments! Attached is
the rebased patch that adds support for parallel TID range scans. This feature is
particularly useful scanning large tables where the data needs to be scanned in
sizable segments using a TID range in the WHERE clause to define each segment.
By enabling parallelism, this approach can improve performance compared to
both non-parallel TID range scans and traditional sequential scans.Regards
Cary Huang
Thanks for updating the patch. I have a few comments on it.
+ /*
+ * if parallel mode is used, store startblock and numblocks in parallel
+ * scan descriptor as well.
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan = NULL;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = scan->rs_startblock;
+ bpscan->phs_numblock = scan->rs_numblocks;
+ }
It would be more intuitive and clearer to directly use startBlk and numBlks
to set these values. Since scan->rs_startblock and scan->rs_numblocks
are already set using these variables, using the same approach for bpscan
would make the code easier to understand.
Another nitty-gritty is that you might want to use a capital `If` in the
comments to maintain the same style.
+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
+ nallocated >= pbscan->phs_numblock))
I'd suggest explictly setting phs_numblock to InvalidBlockNumber in
table_block_parallelscan_initialize, and compare with InvalidBlockNumber
here.
--
Regards
Junwang Zhao
Hi Junwang
Thank you so much for the review!
+ /* + * if parallel mode is used, store startblock and numblocks in parallel + * scan descriptor as well. + */ + if (scan->rs_base.rs_parallel != NULL) + { + ParallelBlockTableScanDesc bpscan = NULL; + + bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel; + bpscan->phs_startblock = scan->rs_startblock; + bpscan->phs_numblock = scan->rs_numblocks; + }It would be more intuitive and clearer to directly use startBlk and numBlks
to set these values. Since scan->rs_startblock and scan->rs_numblocks
are already set using these variables, using the same approach for bpscan
would make the code easier to understand.Another nitty-gritty is that you might want to use a capital `If` in the
comments to maintain the same style.
Agreed, made the adjustment in the attached patch.
+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 && + nallocated >= pbscan->phs_numblock))I'd suggest explictly setting phs_numblock to InvalidBlockNumber in
table_block_parallelscan_initialize, and compare with InvalidBlockNumber
here.
Also agreed, phs_numblock should be initialized in
table_block_parallelscan_initialize just like all other parameters in parallel scan
context. You are right, it is much neater to use InvalidBlockNumber rather
than 0 to indicate if an upper bound has been specified in the TID range scan.
I have addressed your comment in the attached v6 patch. Thank you again for
the review.
Best regards
Cary Huang
Attachments:
v6-0001-add-parallel-tid-rangescan.patchapplication/octet-stream; name=v6-0001-add-parallel-tid-rangescan.patchDownload
From 1b2d1f5494ab4edfc6007e2de10cb9d0b5eb1fc1 Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Mon, 9 Jun 2025 16:00:38 -0700
Subject: [PATCH] v6 parallel TID range scan patch
---
src/backend/access/heap/heapam.c | 13 +++
src/backend/access/table/tableam.c | 43 ++++++++-
src/backend/executor/execParallel.c | 20 ++++
src/backend/executor/nodeTidrangescan.c | 81 ++++++++++++++++
src/backend/optimizer/path/costsize.c | 15 +++
src/backend/optimizer/path/tidpath.c | 18 +++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/relscan.h | 1 +
src/include/access/tableam.h | 10 ++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 2 +
src/include/optimizer/pathnode.h | 3 +-
src/test/regress/expected/tidrangescan.out | 106 +++++++++++++++++++++
src/test/regress/sql/tidrangescan.sql | 45 +++++++++
14 files changed, 365 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817..5105a2c8ad 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1478,6 +1478,19 @@ heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
/* Set the start block and number of blocks to scan */
heap_setscanlimits(sscan, startBlk, numBlks);
+ /*
+ * If parallel mode is used, store startBlk and numBlks in parallel
+ * scan descriptor as well.
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan = NULL;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = startBlk;
+ bpscan->phs_numblock = numBlks;
+ }
+
/* Finally, set the TID range in sscan */
ItemPointerCopy(&lowestItem, &sscan->st.tidrange.rs_mintid);
ItemPointerCopy(&highestItem, &sscan->st.tidrange.rs_maxtid);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb1..8e2e2a23bc 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -188,6 +188,34 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
@@ -398,6 +426,7 @@ table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
bpscan->phs_nblocks > NBuffers / 4;
SpinLockInit(&bpscan->phs_mutex);
bpscan->phs_startblock = InvalidBlockNumber;
+ bpscan->phs_numblock = InvalidBlockNumber;
pg_atomic_init_u64(&bpscan->phs_nallocated, 0);
return sizeof(ParallelBlockTableScanDescData);
@@ -577,7 +606,19 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
- if (nallocated >= pbscan->phs_nblocks)
+ /*
+ * In a parallel TID range scan, 'pbscan->phs_numblock' is non-zero if an
+ * upper TID range limit is specified, or InvalidBlockNumber if no limit
+ * is given. This value may be less than or equal to 'pbscan->phs_nblocks'
+ * , which is the total number of blocks in the relation.
+ *
+ * The scan can terminate early once 'nallocated' reaches 'phs_numblock',
+ * even if the full relation has remaining blocks to scan. This ensures
+ * that parallel workers only scan the subset of blocks that fall within
+ * the TID range.
+ */
+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock !=
+ InvalidBlockNumber && nallocated >= pbscan->phs_numblock))
page = InvalidBlockNumber; /* all blocks have been allocated */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f3e77bda27..3b548cce79 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -41,6 +41,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -305,6 +306,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeEstimate((MemoizeState *) planstate, e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
default:
break;
}
@@ -532,6 +538,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeDSM((MemoizeState *) planstate, d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
default:
break;
}
@@ -1019,6 +1030,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
default:
break;
@@ -1401,6 +1417,10 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
/* even when not parallel-aware, for EXPLAIN ANALYZE */
ExecMemoizeInitializeWorker((MemoizeState *) planstate, pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate, pwcxt);
+ break;
default:
break;
}
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index ab2eab9596..22dd88c3d4 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -403,3 +403,84 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->trss_pscanlen = table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->trss_pscanlen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel TID scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ /* disable syncscan in parallel tid range scan. */
+ pscan->phs_syncscan = false;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 3d44815ed5..cebe161330 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1438,6 +1438,21 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
startup_cost += path->pathtarget->cost.startup;
run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
+
/* we should not generate this path type when enable_tidscan=false */
Assert(enable_tidscan);
path->disabled_nodes = 0;
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2bfb338b81..9c78eedcf5 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -47,6 +47,7 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/cost.h"
/*
@@ -553,7 +554,22 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan. */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+ if (parallel_workers > 0)
+ {
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root, rel, tidrangequals,
+ required_outer, parallel_workers));
+ }
+ }
}
/*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e0192d4a49..bfee57e5c1 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1262,7 +1262,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1271,9 +1272,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c..2e2c4c03b3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -96,6 +96,7 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_nblocks; /* # blocks in relation at start of scan */
slock_t phs_mutex; /* mutual exclusion for setting startblock */
BlockNumber phs_startblock; /* starting block number */
+ BlockNumber phs_numblock; /* max number of blocks to scan */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
} ParallelBlockTableScanDescData;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbf..2d0e655bd8 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1125,6 +1125,16 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index a831f1202c..2b5465b3ce 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2492282213..a92dc7fbc2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1921,6 +1921,7 @@ typedef struct TidScanState
* trss_mintid the lowest TID in the scan range
* trss_maxtid the highest TID in the scan range
* trss_inScan is a scan currently in progress?
+ * trss_pscanlen size of parallel TID range scan descriptor
* ----------------
*/
typedef struct TidRangeScanState
@@ -1930,6 +1931,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size trss_pscanlen;
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 60dcdb77e4..4b8dbc2a90 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,7 +67,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 721f3b94e0..32cd2bd9f4 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,4 +297,110 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+ min | max
+-------+--------
+ (0,1) | (39,5)
+(1 row)
+
+-- parallel range scans with upper bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid < '(30,1)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with lower bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid > '(10,0)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with both bounds
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+ count
+-------
+ 100
+(1 row)
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on parallel_tidrangescan t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on parallel_tidrangescan t2
+ TID Cond: (ctid <= t.ctid)
+(8 rows)
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ ctid | c
+-------+---
+ (0,1) | 1
+ (0,2) | 2
+ (0,3) | 3
+ (0,4) | 4
+ (0,5) | 5
+(5 rows)
+
+DROP TABLE parallel_tidrangescan;
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index ac09ebb626..1d18b8a61d 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,4 +98,49 @@ COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+
+-- parallel range scans with upper bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+
+-- parallel range scans with lower bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+
+-- parallel range scans with both bounds
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+DROP TABLE parallel_tidrangescan;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
--
2.17.1
Hi, Cary,
I have two comments:
1. Does table_beginscan_parallel_tidrange() need an assert of relid,
like what table_beginscan_parallel() did?
Assert(RelationGetRelid(relation) == pscan->phs_relid);
2. The new field phs_numblock in ParallelBlockTableScanDescData
structure has almost the same name as another field phs_nblocks. Would
you consider changing it to another name, for example,
phs_maxnumblocktoscan?
Thanks,
Steven
在 2025/6/10 7:04, Cary Huang 写道:
Show quoted text
Hi Junwang
Thank you so much for the review!
+ /* + * if parallel mode is used, store startblock and numblocks in parallel + * scan descriptor as well. + */ + if (scan->rs_base.rs_parallel != NULL) + { + ParallelBlockTableScanDesc bpscan = NULL; + + bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel; + bpscan->phs_startblock = scan->rs_startblock; + bpscan->phs_numblock = scan->rs_numblocks; + }It would be more intuitive and clearer to directly use startBlk and numBlks
to set these values. Since scan->rs_startblock and scan->rs_numblocks
are already set using these variables, using the same approach for bpscan
would make the code easier to understand.Another nitty-gritty is that you might want to use a capital `If` in the
comments to maintain the same style.Agreed, made the adjustment in the attached patch.
+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 && + nallocated >= pbscan->phs_numblock))I'd suggest explictly setting phs_numblock to InvalidBlockNumber in
table_block_parallelscan_initialize, and compare with InvalidBlockNumber
here.Also agreed, phs_numblock should be initialized in
table_block_parallelscan_initialize just like all other parameters in parallel scan
context. You are right, it is much neater to use InvalidBlockNumber rather
than 0 to indicate if an upper bound has been specified in the TID range scan.I have addressed your comment in the attached v6 patch. Thank you again for
the review.Best regards
Cary Huang
Hi Steven
thanks for the review!
I have two comments:
1. Does table_beginscan_parallel_tidrange() need an assert of relid,
like what table_beginscan_parallel() did?
Assert(RelationGetRelid(relation) == pscan->phs_relid);
In the v6 rebased patch, the assert has become:
Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
rather than:
Assert(RelationGetRelid(relation) == pscan->phs_relid);
table_beginscan_parallel_tidrange() already has the proper assert line
similar to what table_beginscan_parallel() has.
2. The new field phs_numblock in ParallelBlockTableScanDescData
structure has almost the same name as another field phs_nblocks. Would
you consider changing it to another name, for example,
phs_maxnumblocktoscan?
I actually had a similar thought too, phs_nblocks and phs_numblock are
very similar but are quite different. But I still left the name as phs_numblock
because I want to keep it consistent (kind of) with the 'numBlks' used in
heap_set_tidrange() in heapam.c. The comments besides their declaration
should be enough to describe their differences without causing confusion.
Best regards
Cary
On Tue, 10 Jun 2025 at 11:04, Cary Huang <cary.huang@highgo.ca> wrote:
I have addressed your comment in the attached v6 patch. Thank you again for
the review.
Here's a review of v6:
1. In cost_tidrangescan() you're dividing the total costs by the
number of workers yet the comment is claiming that's CPU cost. I think
this needs to follow the lead of cost_seqscan() and separate out the
CPU and IO cost then add the IO cost at the end, after the divide by
the number of workers.
2. In execParallel.c, could you move the case for T_TidRangeScanState
below T_ForeignScanState? What you have right now is now quite
following the unwritten standard set out by the other nodes, i.e
non-parallel aware nodes are last. A good spot seems to be putting it
at the end of the scan types... Custom scan seems slightly misplaced,
but probably can ignore that one and put it after T_ForeignScanState
3. The following comment should mention what behaviour occurs when the
field is set to InvalidBlockNumber:
BlockNumber phs_numblock; /* max number of blocks to scan */
Something like /* # of blocks to scan, or InvalidBlockNumber if no limit */
4. I think the following would be clearer if written using an else if:
if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock !=
InvalidBlockNumber && nallocated >= pbscan->phs_numblock))
page = InvalidBlockNumber; /* all blocks have been allocated */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
e.g:
if (nallocated >= pbscan->phs_nblocks)
page = InvalidBlockNumber; /* all blocks have been allocated */
else if (pbscan->phs_numblock != InvalidBlockNumber &&
nallocated >= pbscan->phs_numblock)
page = InvalidBlockNumber; /* upper scan limit reached */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
That way the comment after the assignment is accurate.
5. For the tests, is there any reason not to reuse the tidrangescan table?
I don't see any other issues, but I've not tested the patch yet. I'll
do that if you can fix the 5 above.
Thanks
David
Hi David
Thank you so much for the review! I have addressed the comments in the
attached v7 patch.
1. In cost_tidrangescan() you're dividing the total costs by the
number of workers yet the comment is claiming that's CPU cost. I think
this needs to follow the lead of cost_seqscan() and separate out the
CPU and IO cost then add the IO cost at the end, after the divide by
the number of workers.
I have separated the costs into disk and CPU costs similar to the style in
cost_seqscan().
2. In execParallel.c, could you move the case for T_TidRangeScanState
below T_ForeignScanState? What you have right now is now quite
following the unwritten standard set out by the other nodes, i.e
non-parallel aware nodes are last. A good spot seems to be putting it
at the end of the scan types... Custom scan seems slightly misplaced,
but probably can ignore that one and put it after T_ForeignScanState
Yes, it's been done.
3. The following comment should mention what behaviour occurs when the
field is set to InvalidBlockNumber:
Also addressed
4. I think the following would be clearer if written using an else if:
if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock !=
InvalidBlockNumber && nallocated >= pbscan->phs_numblock))
page = InvalidBlockNumber; /* all blocks have been allocated */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;e.g:
if (nallocated >= pbscan->phs_nblocks)
page = InvalidBlockNumber; /* all blocks have been allocated */
else if (pbscan->phs_numblock != InvalidBlockNumber &&
nallocated >= pbscan->phs_numblock)
page = InvalidBlockNumber; /* upper scan limit reached */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;That way the comment after the assignment is accurate.
Agreed, and also addressed.
5. For the tests, is there any reason not to reuse the tidrangescan table?
To test TID range scan in parallel, I have a new table created with very low fill
factor such that there will be more pages created with small and fixed amount
of tuples in each. Then, the test would do SELECT COUNT(*) to ensure correct
amount of tuples are being counted by the parallel workers during parallel TID
range scan. With this new table, it is easy to ensure the count is correct since
we know how many tuples exist in each page and how many pages are to be
scanned based on the WHERE predicates. Reusing the tidrangescan table
would be hard to tell if the count is correct in my case.
thank you!
Cary.
Attachments:
v7-0001-add-parallel-tid-rangescan.patchapplication/octet-stream; name=v7-0001-add-parallel-tid-rangescan.patchDownload
From c0023b02e130906300409fbe1fff75874822642d Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Thu, 24 Jul 2025 11:19:14 -0700
Subject: [PATCH] v7 parallel TID range scan patch
---
src/backend/access/heap/heapam.c | 13 +++
src/backend/access/table/tableam.c | 45 ++++++++-
src/backend/executor/execParallel.c | 22 ++++-
src/backend/executor/nodeTidrangescan.c | 81 ++++++++++++++++
src/backend/optimizer/path/costsize.c | 34 ++++---
src/backend/optimizer/path/tidpath.c | 18 +++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/relscan.h | 2 +
src/include/access/tableam.h | 10 ++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 2 +
src/include/optimizer/pathnode.h | 3 +-
src/test/regress/expected/tidrangescan.out | 106 +++++++++++++++++++++
src/test/regress/sql/tidrangescan.sql | 45 +++++++++
14 files changed, 377 insertions(+), 18 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817..5105a2c8ad 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1478,6 +1478,19 @@ heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
/* Set the start block and number of blocks to scan */
heap_setscanlimits(sscan, startBlk, numBlks);
+ /*
+ * If parallel mode is used, store startBlk and numBlks in parallel
+ * scan descriptor as well.
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan = NULL;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = startBlk;
+ bpscan->phs_numblock = numBlks;
+ }
+
/* Finally, set the TID range in sscan */
ItemPointerCopy(&lowestItem, &sscan->st.tidrange.rs_mintid);
ItemPointerCopy(&highestItem, &sscan->st.tidrange.rs_maxtid);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb1..5a76cec81e 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -188,6 +188,34 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
@@ -398,6 +426,7 @@ table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
bpscan->phs_nblocks > NBuffers / 4;
SpinLockInit(&bpscan->phs_mutex);
bpscan->phs_startblock = InvalidBlockNumber;
+ bpscan->phs_numblock = InvalidBlockNumber;
pg_atomic_init_u64(&bpscan->phs_nallocated, 0);
return sizeof(ParallelBlockTableScanDescData);
@@ -577,8 +606,22 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
+ /*
+ * In a parallel TID range scan, 'pbscan->phs_numblock' is non-zero if an
+ * upper TID range limit is specified, or InvalidBlockNumber if no limit
+ * is given. This value may be less than or equal to 'pbscan->phs_nblocks'
+ * , which is the total number of blocks in the relation.
+ *
+ * The scan can terminate early once 'nallocated' reaches
+ * 'pbscan->phs_numblock', even if the full relation has remaining blocks
+ * to scan. This ensures that parallel workers only scan the subset of
+ * blocks that fall within the TID range.
+ */
if (nallocated >= pbscan->phs_nblocks)
- page = InvalidBlockNumber; /* all blocks have been allocated */
+ page = InvalidBlockNumber; /* all blocks have been allocated */
+ else if (pbscan->phs_numblock != InvalidBlockNumber &&
+ nallocated >= pbscan->phs_numblock)
+ page = InvalidBlockNumber; /* upper scan limit reached */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index fc76f22fb8..3255d92cff 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -41,6 +41,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -266,6 +267,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
ExecForeignScanEstimate((ForeignScanState *) planstate,
e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendEstimate((AppendState *) planstate,
@@ -493,6 +499,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeDSM((AppendState *) planstate,
@@ -994,6 +1005,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
ExecForeignScanReInitializeDSM((ForeignScanState *) planstate,
pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendReInitializeDSM((AppendState *) planstate, pcxt);
@@ -1020,7 +1036,6 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is required */
break;
-
default:
break;
}
@@ -1362,6 +1377,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate,
+ pwcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeWorker((AppendState *) planstate, pwcxt);
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 26f7420b64..06a1037d51 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -405,3 +405,84 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->trss_pscanlen = table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->trss_pscanlen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel TID scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ /* disable syncscan in parallel tid range scan. */
+ pscan->phs_syncscan = false;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1f04a2c182..9f0215db21 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1367,7 +1367,8 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
Selectivity selectivity;
double pages;
Cost startup_cost = 0;
- Cost run_cost = 0;
+ Cost cpu_run_cost = 0;
+ Cost disk_run_cost = 0;
QualCost qpqual_cost;
Cost cpu_per_tuple;
QualCost tid_qual_cost;
@@ -1396,11 +1397,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* The first page in a range requires a random seek, but each subsequent
- * page is just a normal sequential page read. NOTE: it's desirable for
- * TID Range Scans to cost more than the equivalent Sequential Scans,
- * because Seq Scans have some performance advantages such as scan
- * synchronization and parallelizability, and we'd prefer one of them to
- * be picked unless a TID Range Scan really is better.
+ * page is just a normal sequential page read.
*/
ntuples = selectivity * baserel->tuples;
nseqpages = pages - 1.0;
@@ -1417,7 +1414,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
&spc_seq_page_cost);
/* disk costs; 1 random page and the remainder as seq pages */
- run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
+ disk_run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
/* Add scanning CPU costs */
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
@@ -1425,24 +1422,39 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* XXX currently we assume TID quals are a subset of qpquals at this
* point; they will be removed (if possible) when we create the plan, so
- * we subtract their cost from the total qpqual cost. (If the TID quals
+ * we subtract their cost from the total qpqual cost. (If the TID quals
* can't be removed, this is a mistake and we're going to underestimate
* the CPU cost a bit.)
*/
startup_cost += qpqual_cost.startup + tid_qual_cost.per_tuple;
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple -
tid_qual_cost.per_tuple;
- run_cost += cpu_per_tuple * ntuples;
+ cpu_run_cost += cpu_per_tuple * ntuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->pathtarget->cost.startup;
- run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows;
+
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ cpu_run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
/* we should not generate this path type when enable_tidscan=false */
Assert(enable_tidscan);
path->disabled_nodes = 0;
path->startup_cost = startup_cost;
- path->total_cost = startup_cost + run_cost;
+ path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
}
/*
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2bfb338b81..9c78eedcf5 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -47,6 +47,7 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/cost.h"
/*
@@ -553,7 +554,22 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan. */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+ if (parallel_workers > 0)
+ {
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root, rel, tidrangequals,
+ required_outer, parallel_workers));
+ }
+ }
}
/*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 9cc602788e..3ad70ac958 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1262,7 +1262,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1271,9 +1272,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c..3da43557a1 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -96,6 +96,8 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_nblocks; /* # blocks in relation at start of scan */
slock_t phs_mutex; /* mutual exclusion for setting startblock */
BlockNumber phs_startblock; /* starting block number */
+ BlockNumber phs_numblock; /* # blocks to scan, or InvalidBlockNumber if
+ * no limit */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
} ParallelBlockTableScanDescData;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b..0f46a47c2e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1125,6 +1125,16 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index a831f1202c..2b5465b3ce 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e107d6e5f8..958c78f66c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1929,6 +1929,7 @@ typedef struct TidScanState
* trss_mintid the lowest TID in the scan range
* trss_maxtid the highest TID in the scan range
* trss_inScan is a scan currently in progress?
+ * trss_pscanlen size of parallel TID range scan descriptor
* ----------------
*/
typedef struct TidRangeScanState
@@ -1938,6 +1939,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size trss_pscanlen;
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 60dcdb77e4..4b8dbc2a90 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,7 +67,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 721f3b94e0..32cd2bd9f4 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,4 +297,110 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+ min | max
+-------+--------
+ (0,1) | (39,5)
+(1 row)
+
+-- parallel range scans with upper bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid < '(30,1)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with lower bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid > '(10,0)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with both bounds
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+ count
+-------
+ 100
+(1 row)
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on parallel_tidrangescan t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on parallel_tidrangescan t2
+ TID Cond: (ctid <= t.ctid)
+(8 rows)
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ ctid | c
+-------+---
+ (0,1) | 1
+ (0,2) | 2
+ (0,3) | 3
+ (0,4) | 4
+ (0,5) | 5
+(5 rows)
+
+DROP TABLE parallel_tidrangescan;
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index ac09ebb626..1d18b8a61d 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,4 +98,49 @@ COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+
+-- parallel range scans with upper bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+
+-- parallel range scans with lower bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+
+-- parallel range scans with both bounds
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+DROP TABLE parallel_tidrangescan;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
--
2.17.1
On Fri, 25 Jul 2025 at 06:46, Cary Huang <cary.huang@highgo.ca> wrote:
Thank you so much for the review! I have addressed the comments in the
attached v7 patch.
I've now spent quite a bit of time going over this patch and testing
it. One issue I found was in heap_set_tidrange() where you were not
correctly setting the scan limits for the "if
(ItemPointerCompare(&highestItem, &lowestItem) < 0)" case. Through a
bit of manually overwriting the planner's choice using the debugger, I
could get the executor to read an entire table despite the tid range
being completely empty. I likely could have got this to misbehave
without the debugger if I'd used PREPAREd statements and made the
ctids parameters to that. It's just the planner didn't choose a
parallel plan with an empty TID range due to the costs being too low.
For the record, here's the unpatched output below:
# explain (analyze) select count(*) from parallel_tidrangescan where
ctid >= '(10,0)' and ctid <= '(9,10)';
Aggregate (cost=2.00..2.01 rows=1 width=8) (actual
time=18440.132..18440.174 rows=1.00 loops=1)
Buffers: shared read=40
-> Gather (cost=0.01..2.00 rows=1 width=0) (actual
time=18440.126..18440.166 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=40
-> Parallel Tid Range Scan on parallel_tidrangescan
(cost=0.01..2.00 rows=1 width=0) (actual time=2414.495..2414.495
rows=0.00 loops=3)
TID Cond: ((ctid >= '(10,0)'::tid) AND (ctid <= '(9,10)'::tid))
Buffers: shared read=40
Note the "Buffers: shared read=40", which is this entire table. After
moving the code which sets the ParallelBlockTableScanDesc's limits
into heap_setscanlimits(), I get:
# explain (analyze) select count(*) from parallel_tidrangescan where
ctid >= '(10,0)' and ctid <= '(9,10)';
Aggregate (cost=2.00..2.01 rows=1 width=8) (actual
time=17.787..19.531 rows=1.00 loops=1)
-> Gather (cost=0.01..2.00 rows=1 width=0) (actual
time=17.783..19.527 rows=0.00 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Tid Range Scan on parallel_tidrangescan
(cost=0.01..2.00 rows=1 width=0) (actual time=0.003..0.004 rows=0.00
loops=3)
TID Cond: ((ctid >= '(10,0)'::tid) AND (ctid <= '(9,10)'::tid))
I'm now trying to convince myself that it's safe to adjust the
ParallelBlockTableScanDesc fields in heap_setscanlimits(). These
fields are being adjusted during the call to TidRangeNext() via
table_rescan_tidrange(), which is *during* execution, so there could
be any number of parallel workers doing this concurrently. I'm unsure
at this stage if all those workers want to be using the same scan
limits, either.
Currently, I think the above is a problem and it doesn't quite feel
like committer duty to fix this part. There's a chance I may get more
time, but if I don't, I've attached your v7 patch plus the adjustments
I've made to it so far.
David
Attachments:
v8-0001-v7-parallel-TID-range-scan-patch.patchapplication/octet-stream; name=v8-0001-v7-parallel-TID-range-scan-patch.patchDownload
From 90c4551d8278715f934aef9bb6af0bf3e84f3dbd Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Thu, 24 Jul 2025 11:19:14 -0700
Subject: [PATCH v8 1/2] v7 parallel TID range scan patch
---
src/backend/access/heap/heapam.c | 13 +++
src/backend/access/table/tableam.c | 45 ++++++++-
src/backend/executor/execParallel.c | 22 ++++-
src/backend/executor/nodeTidrangescan.c | 81 ++++++++++++++++
src/backend/optimizer/path/costsize.c | 34 ++++---
src/backend/optimizer/path/tidpath.c | 18 +++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/relscan.h | 2 +
src/include/access/tableam.h | 10 ++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 2 +
src/include/optimizer/pathnode.h | 3 +-
src/test/regress/expected/tidrangescan.out | 106 +++++++++++++++++++++
src/test/regress/sql/tidrangescan.sql | 45 +++++++++
14 files changed, 377 insertions(+), 18 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817e..5105a2c8ad3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1478,6 +1478,19 @@ heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
/* Set the start block and number of blocks to scan */
heap_setscanlimits(sscan, startBlk, numBlks);
+ /*
+ * If parallel mode is used, store startBlk and numBlks in parallel
+ * scan descriptor as well.
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan = NULL;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = startBlk;
+ bpscan->phs_numblock = numBlks;
+ }
+
/* Finally, set the TID range in sscan */
ItemPointerCopy(&lowestItem, &sscan->st.tidrange.rs_mintid);
ItemPointerCopy(&highestItem, &sscan->st.tidrange.rs_maxtid);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..5a76cec81e9 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -188,6 +188,34 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
@@ -398,6 +426,7 @@ table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
bpscan->phs_nblocks > NBuffers / 4;
SpinLockInit(&bpscan->phs_mutex);
bpscan->phs_startblock = InvalidBlockNumber;
+ bpscan->phs_numblock = InvalidBlockNumber;
pg_atomic_init_u64(&bpscan->phs_nallocated, 0);
return sizeof(ParallelBlockTableScanDescData);
@@ -577,8 +606,22 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
+ /*
+ * In a parallel TID range scan, 'pbscan->phs_numblock' is non-zero if an
+ * upper TID range limit is specified, or InvalidBlockNumber if no limit
+ * is given. This value may be less than or equal to 'pbscan->phs_nblocks'
+ * , which is the total number of blocks in the relation.
+ *
+ * The scan can terminate early once 'nallocated' reaches
+ * 'pbscan->phs_numblock', even if the full relation has remaining blocks
+ * to scan. This ensures that parallel workers only scan the subset of
+ * blocks that fall within the TID range.
+ */
if (nallocated >= pbscan->phs_nblocks)
- page = InvalidBlockNumber; /* all blocks have been allocated */
+ page = InvalidBlockNumber; /* all blocks have been allocated */
+ else if (pbscan->phs_numblock != InvalidBlockNumber &&
+ nallocated >= pbscan->phs_numblock)
+ page = InvalidBlockNumber; /* upper scan limit reached */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index fc76f22fb82..3255d92cffd 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -41,6 +41,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -266,6 +267,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
ExecForeignScanEstimate((ForeignScanState *) planstate,
e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendEstimate((AppendState *) planstate,
@@ -493,6 +499,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeDSM((AppendState *) planstate,
@@ -994,6 +1005,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
ExecForeignScanReInitializeDSM((ForeignScanState *) planstate,
pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendReInitializeDSM((AppendState *) planstate, pcxt);
@@ -1020,7 +1036,6 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is required */
break;
-
default:
break;
}
@@ -1362,6 +1377,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate,
+ pwcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeWorker((AppendState *) planstate, pwcxt);
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 26f7420b64b..06a1037d51e 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -405,3 +405,84 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->trss_pscanlen = table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->trss_pscanlen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel TID scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ /* disable syncscan in parallel tid range scan. */
+ pscan->phs_syncscan = false;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 344a3188317..fdb58d094f2 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1367,7 +1367,8 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
Selectivity selectivity;
double pages;
Cost startup_cost = 0;
- Cost run_cost = 0;
+ Cost cpu_run_cost = 0;
+ Cost disk_run_cost = 0;
QualCost qpqual_cost;
Cost cpu_per_tuple;
QualCost tid_qual_cost;
@@ -1396,11 +1397,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* The first page in a range requires a random seek, but each subsequent
- * page is just a normal sequential page read. NOTE: it's desirable for
- * TID Range Scans to cost more than the equivalent Sequential Scans,
- * because Seq Scans have some performance advantages such as scan
- * synchronization and parallelizability, and we'd prefer one of them to
- * be picked unless a TID Range Scan really is better.
+ * page is just a normal sequential page read.
*/
ntuples = selectivity * baserel->tuples;
nseqpages = pages - 1.0;
@@ -1417,7 +1414,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
&spc_seq_page_cost);
/* disk costs; 1 random page and the remainder as seq pages */
- run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
+ disk_run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
/* Add scanning CPU costs */
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
@@ -1425,24 +1422,39 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* XXX currently we assume TID quals are a subset of qpquals at this
* point; they will be removed (if possible) when we create the plan, so
- * we subtract their cost from the total qpqual cost. (If the TID quals
+ * we subtract their cost from the total qpqual cost. (If the TID quals
* can't be removed, this is a mistake and we're going to underestimate
* the CPU cost a bit.)
*/
startup_cost += qpqual_cost.startup + tid_qual_cost.per_tuple;
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple -
tid_qual_cost.per_tuple;
- run_cost += cpu_per_tuple * ntuples;
+ cpu_run_cost += cpu_per_tuple * ntuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->pathtarget->cost.startup;
- run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows;
+
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ cpu_run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
/* we should not generate this path type when enable_tidscan=false */
Assert(enable_tidscan);
path->disabled_nodes = 0;
path->startup_cost = startup_cost;
- path->total_cost = startup_cost + run_cost;
+ path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
}
/*
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2bfb338b81c..9c78eedcf51 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -47,6 +47,7 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/cost.h"
/*
@@ -553,7 +554,22 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan. */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+ if (parallel_workers > 0)
+ {
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root, rel, tidrangequals,
+ required_outer, parallel_workers));
+ }
+ }
}
/*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index a4c5867cdcb..ebfcc42551a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1262,7 +1262,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1271,9 +1272,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..3da43557a13 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -96,6 +96,8 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_nblocks; /* # blocks in relation at start of scan */
slock_t phs_mutex; /* mutual exclusion for setting startblock */
BlockNumber phs_startblock; /* starting block number */
+ BlockNumber phs_numblock; /* # blocks to scan, or InvalidBlockNumber if
+ * no limit */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
} ParallelBlockTableScanDescData;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b1..0f46a47c2e2 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1125,6 +1125,16 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index a831f1202ca..2b5465b3ce4 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e107d6e5f81..958c78f66c9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1929,6 +1929,7 @@ typedef struct TidScanState
* trss_mintid the lowest TID in the scan range
* trss_maxtid the highest TID in the scan range
* trss_inScan is a scan currently in progress?
+ * trss_pscanlen size of parallel TID range scan descriptor
* ----------------
*/
typedef struct TidRangeScanState
@@ -1938,6 +1939,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size trss_pscanlen;
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 58936e963cb..cbfb98454c0 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,7 +67,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 721f3b94e04..32cd2bd9f49 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,4 +297,110 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+ min | max
+-------+--------
+ (0,1) | (39,5)
+(1 row)
+
+-- parallel range scans with upper bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid < '(30,1)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with lower bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid > '(10,0)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with both bounds
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+ count
+-------
+ 100
+(1 row)
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on parallel_tidrangescan t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on parallel_tidrangescan t2
+ TID Cond: (ctid <= t.ctid)
+(8 rows)
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ ctid | c
+-------+---
+ (0,1) | 1
+ (0,2) | 2
+ (0,3) | 3
+ (0,4) | 4
+ (0,5) | 5
+(5 rows)
+
+DROP TABLE parallel_tidrangescan;
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index ac09ebb6262..1d18b8a61dc 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,4 +98,49 @@ COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+
+-- parallel range scans with upper bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+
+-- parallel range scans with lower bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+
+-- parallel range scans with both bounds
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+DROP TABLE parallel_tidrangescan;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
--
2.43.0
v8-0002-fixup-v7-parallel-TID-range-scan-patch.patchapplication/octet-stream; name=v8-0002-fixup-v7-parallel-TID-range-scan-patch.patchDownload
From 4d829073090d882bd7d61a1b1ffcb2fea280515a Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Mon, 28 Jul 2025 17:58:58 +1200
Subject: [PATCH v8 2/2] fixup! v7 parallel TID range scan patch
---
src/backend/access/heap/heapam.c | 28 ++++++++++----------
src/backend/access/table/tableam.c | 16 +++++-------
src/backend/executor/execParallel.c | 1 +
src/backend/executor/nodeTidrangescan.c | 13 ++++------
src/backend/optimizer/path/costsize.c | 14 +++++-----
src/backend/optimizer/path/tidpath.c | 10 +++++---
src/include/nodes/execnodes.h | 2 +-
src/test/regress/expected/tidrangescan.out | 30 +++++++++++-----------
src/test/regress/sql/tidrangescan.sql | 30 +++++++++++-----------
9 files changed, 71 insertions(+), 73 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5105a2c8ad3..d0e650de573 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -490,6 +490,21 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
scan->rs_startblock = startBlk;
scan->rs_numblocks = numBlks;
+
+ /* set the limits in the ParallelBlockTableScanDesc, when present */
+
+ /*
+ * XXX no lock is being taken here. What guarantees are there that there
+ * isn't some worker using the old limits when the new limits are imposed?
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = startBlk;
+ bpscan->phs_numblock = numBlks;
+ }
}
/*
@@ -1478,19 +1493,6 @@ heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
/* Set the start block and number of blocks to scan */
heap_setscanlimits(sscan, startBlk, numBlks);
- /*
- * If parallel mode is used, store startBlk and numBlks in parallel
- * scan descriptor as well.
- */
- if (scan->rs_base.rs_parallel != NULL)
- {
- ParallelBlockTableScanDesc bpscan = NULL;
-
- bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
- bpscan->phs_startblock = startBlk;
- bpscan->phs_numblock = numBlks;
- }
-
/* Finally, set the TID range in sscan */
ItemPointerCopy(&lowestItem, &sscan->st.tidrange.rs_mintid);
ItemPointerCopy(&highestItem, &sscan->st.tidrange.rs_maxtid);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5a76cec81e9..8036654c773 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -197,6 +197,9 @@ table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan
Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
+ /* disable syncscan in parallel tid range scan. */
+ pscan->phs_syncscan = false;
+
if (!pscan->phs_snapshot_any)
{
/* Snapshot was serialized -- restore it */
@@ -607,18 +610,11 @@ table_block_parallelscan_nextpage(Relation rel,
}
/*
- * In a parallel TID range scan, 'pbscan->phs_numblock' is non-zero if an
- * upper TID range limit is specified, or InvalidBlockNumber if no limit
- * is given. This value may be less than or equal to 'pbscan->phs_nblocks'
- * , which is the total number of blocks in the relation.
- *
- * The scan can terminate early once 'nallocated' reaches
- * 'pbscan->phs_numblock', even if the full relation has remaining blocks
- * to scan. This ensures that parallel workers only scan the subset of
- * blocks that fall within the TID range.
+ * Check if we've allocated every block in the relation, or if we've
+ * reached the limit imposed by pbscan->phs_numblock (if set).
*/
if (nallocated >= pbscan->phs_nblocks)
- page = InvalidBlockNumber; /* all blocks have been allocated */
+ page = InvalidBlockNumber; /* all blocks have been allocated */
else if (pbscan->phs_numblock != InvalidBlockNumber &&
nallocated >= pbscan->phs_numblock)
page = InvalidBlockNumber; /* upper scan limit reached */
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 3255d92cffd..3f0dc5322ce 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1036,6 +1036,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+
default:
break;
}
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 06a1037d51e..afa47b01dec 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -418,13 +418,13 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
* ----------------------------------------------------------------
*/
void
-ExecTidRangeScanEstimate(TidRangeScanState *node,
- ParallelContext *pcxt)
+ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt)
{
EState *estate = node->ss.ps.state;
- node->trss_pscanlen = table_parallelscan_estimate(node->ss.ss_currentRelation,
- estate->es_snapshot);
+ node->trss_pscanlen =
+ table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
shm_toc_estimate_chunk(&pcxt->estimator, node->trss_pscanlen);
shm_toc_estimate_keys(&pcxt->estimator, 1);
}
@@ -436,8 +436,7 @@ ExecTidRangeScanEstimate(TidRangeScanState *node,
* ----------------------------------------------------------------
*/
void
-ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
- ParallelContext *pcxt)
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
{
EState *estate = node->ss.ps.state;
ParallelTableScanDesc pscan;
@@ -446,8 +445,6 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
table_parallelscan_initialize(node->ss.ss_currentRelation,
pscan,
estate->es_snapshot);
- /* disable syncscan in parallel tid range scan. */
- pscan->phs_syncscan = false;
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index fdb58d094f2..eab1b18d30e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1366,9 +1366,9 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
{
Selectivity selectivity;
double pages;
- Cost startup_cost = 0;
- Cost cpu_run_cost = 0;
- Cost disk_run_cost = 0;
+ Cost startup_cost;
+ Cost cpu_run_cost;
+ Cost disk_run_cost;
QualCost qpqual_cost;
Cost cpu_per_tuple;
QualCost tid_qual_cost;
@@ -1414,7 +1414,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
&spc_seq_page_cost);
/* disk costs; 1 random page and the remainder as seq pages */
- disk_run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
+ disk_run_cost = spc_random_page_cost + spc_seq_page_cost * nseqpages;
/* Add scanning CPU costs */
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
@@ -1422,14 +1422,14 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* XXX currently we assume TID quals are a subset of qpquals at this
* point; they will be removed (if possible) when we create the plan, so
- * we subtract their cost from the total qpqual cost. (If the TID quals
+ * we subtract their cost from the total qpqual cost. (If the TID quals
* can't be removed, this is a mistake and we're going to underestimate
* the CPU cost a bit.)
*/
- startup_cost += qpqual_cost.startup + tid_qual_cost.per_tuple;
+ startup_cost = qpqual_cost.startup + tid_qual_cost.per_tuple;
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple -
tid_qual_cost.per_tuple;
- cpu_run_cost += cpu_per_tuple * ntuples;
+ cpu_run_cost = cpu_per_tuple * ntuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->pathtarget->cost.startup;
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 9c78eedcf51..e48c85833e7 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -564,11 +564,13 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
max_parallel_workers_per_gather);
+
if (parallel_workers > 0)
- {
- add_partial_path(rel, (Path *) create_tidrangescan_path(root, rel, tidrangequals,
- required_outer, parallel_workers));
- }
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root,
+ rel,
+ tidrangequals,
+ required_outer,
+ parallel_workers));
}
}
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 958c78f66c9..4947b6cca00 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1929,7 +1929,7 @@ typedef struct TidScanState
* trss_mintid the lowest TID in the scan range
* trss_maxtid the highest TID in the scan range
* trss_inScan is a scan currently in progress?
- * trss_pscanlen size of parallel TID range scan descriptor
+ * trss_pscanlen size of parallel heap scan descriptor
* ----------------
*/
typedef struct TidRangeScanState
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 32cd2bd9f49..3c5fc9e102a 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -298,13 +298,13 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
-- tests for parallel tidrangescans
-SET parallel_setup_cost=0;
-SET parallel_tuple_cost=0;
-SET min_parallel_table_scan_size=0;
-SET max_parallel_workers_per_gather=4;
-CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
-INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
-- ensure there are 40 pages for parallel test
SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
min | max
@@ -313,8 +313,8 @@ SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
(1 row)
-- parallel range scans with upper bound
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
QUERY PLAN
--------------------------------------------------------------------
Finalize Aggregate
@@ -325,15 +325,15 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
TID Cond: (ctid < '(30,1)'::tid)
(6 rows)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
count
-------
150
(1 row)
-- parallel range scans with lower bound
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
QUERY PLAN
--------------------------------------------------------------------
Finalize Aggregate
@@ -344,15 +344,15 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
TID Cond: (ctid > '(10,0)'::tid)
(6 rows)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
count
-------
150
(1 row)
-- parallel range scans with both bounds
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
QUERY PLAN
-----------------------------------------------------------------------------------
Finalize Aggregate
@@ -363,7 +363,7 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)'
TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
(6 rows)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
count
-------
100
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index 1d18b8a61dc..0f1e43c6d05 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -99,33 +99,33 @@ COMMIT;
DROP TABLE tidrangescan;
-- tests for parallel tidrangescans
-SET parallel_setup_cost=0;
-SET parallel_tuple_cost=0;
-SET min_parallel_table_scan_size=0;
-SET max_parallel_workers_per_gather=4;
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
-CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
-INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
-- ensure there are 40 pages for parallel test
SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
-- parallel range scans with upper bound
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
-- parallel range scans with lower bound
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
-- parallel range scans with both bounds
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
-- parallel rescans
EXPLAIN (COSTS OFF)
--
2.43.0
Hello David
Thank you so much for testing the patch more thoroughly and for sharing your patch.
I see that you moved the setting of ParallelBlockTableScanDesc fields from
heap_set_tidrange() to be within heap_setscanlimits() so that the missing case of
(ItemPointerCompare(&highestItem, &lowestItem) < 0) can be covered. This
placement should be fine.
The workers, however, still end up in heap_set_tidrange() during a call to
TidRangeNext() via table_rescan_tidrange() at the beginning of the scan as you have
described. Each of the worker does attempt to update the tid range to basically the
same range to the parallel context shared among the workers in shared memory,
which may seem a little racy here.
The problem is that the logic in TidRangeNext() is mostly designed as non-parallel
mode and that table_rescan_tidrange() is incorrectly called in parallel mode, causing
each worker to try to update TID ranges at the same time.
The initialization of scan descriptor, rescan and setting TID ranges in fact take place
at different places between parallel and non-parallel modes. In non-parallel mode,
everything takes place in TidRangeNext() while in parallel mode, they take place in
ExecTidRangeScanInitializeDSM() and ExecTidRangeScanReInitializeDSM() and they
are called only by the leader.
I have updated TidRangeNext() to not do any rescan or set TID limit in parallel mode.
I have also updated ExecTidRangeScanInitializeDSM() to set the TID range after
initializing parallel scan descriptor, which is shared to the workers via shared memory.
Similarly, in ExecTidRangeScanReInitializeDSM(), a new TID range will be set after the
re-initialization of parallel scan descriptor during the rescan case.
ExecTidRangeScanInitializeWorker() is called by each parallel worker and is also
updated such that it will not set the TID limits again.
Since ExecTidRangeScanInitializeDSM() and ExecTidRangeScanReInitializeDSM() are
only called by the leader process, there will not be any concurrent call to
heap_setscanlimits(), so no locking is needed there.
On top of the 2 patches you shared, I have attached the third patch with the changes
I describe above, so it is easy to spot the new diffs.
Thank you once again for your review!
Cary
Attachments:
v9-0002-fixup-v7-parallel-TID-range-scan-patch.patchapplication/octet-stream; name=v9-0002-fixup-v7-parallel-TID-range-scan-patch.patchDownload
From 749b78bddecac6f5d57d3ef94be89c65841fd162 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Mon, 28 Jul 2025 17:58:58 +1200
Subject: [PATCH v9 2/3] fixup! v7 parallel TID range scan patch
---
src/backend/access/heap/heapam.c | 28 ++++++++++----------
src/backend/access/table/tableam.c | 16 +++++-------
src/backend/executor/execParallel.c | 1 +
src/backend/executor/nodeTidrangescan.c | 13 ++++------
src/backend/optimizer/path/costsize.c | 14 +++++-----
src/backend/optimizer/path/tidpath.c | 10 +++++---
src/include/nodes/execnodes.h | 2 +-
src/test/regress/expected/tidrangescan.out | 30 +++++++++++-----------
src/test/regress/sql/tidrangescan.sql | 30 +++++++++++-----------
9 files changed, 71 insertions(+), 73 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5105a2c8ad..d0e650de57 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -490,6 +490,21 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
scan->rs_startblock = startBlk;
scan->rs_numblocks = numBlks;
+
+ /* set the limits in the ParallelBlockTableScanDesc, when present */
+
+ /*
+ * XXX no lock is being taken here. What guarantees are there that there
+ * isn't some worker using the old limits when the new limits are imposed?
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = startBlk;
+ bpscan->phs_numblock = numBlks;
+ }
}
/*
@@ -1478,19 +1493,6 @@ heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
/* Set the start block and number of blocks to scan */
heap_setscanlimits(sscan, startBlk, numBlks);
- /*
- * If parallel mode is used, store startBlk and numBlks in parallel
- * scan descriptor as well.
- */
- if (scan->rs_base.rs_parallel != NULL)
- {
- ParallelBlockTableScanDesc bpscan = NULL;
-
- bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
- bpscan->phs_startblock = startBlk;
- bpscan->phs_numblock = numBlks;
- }
-
/* Finally, set the TID range in sscan */
ItemPointerCopy(&lowestItem, &sscan->st.tidrange.rs_mintid);
ItemPointerCopy(&highestItem, &sscan->st.tidrange.rs_maxtid);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5a76cec81e..8036654c77 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -197,6 +197,9 @@ table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan
Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
+ /* disable syncscan in parallel tid range scan. */
+ pscan->phs_syncscan = false;
+
if (!pscan->phs_snapshot_any)
{
/* Snapshot was serialized -- restore it */
@@ -607,18 +610,11 @@ table_block_parallelscan_nextpage(Relation rel,
}
/*
- * In a parallel TID range scan, 'pbscan->phs_numblock' is non-zero if an
- * upper TID range limit is specified, or InvalidBlockNumber if no limit
- * is given. This value may be less than or equal to 'pbscan->phs_nblocks'
- * , which is the total number of blocks in the relation.
- *
- * The scan can terminate early once 'nallocated' reaches
- * 'pbscan->phs_numblock', even if the full relation has remaining blocks
- * to scan. This ensures that parallel workers only scan the subset of
- * blocks that fall within the TID range.
+ * Check if we've allocated every block in the relation, or if we've
+ * reached the limit imposed by pbscan->phs_numblock (if set).
*/
if (nallocated >= pbscan->phs_nblocks)
- page = InvalidBlockNumber; /* all blocks have been allocated */
+ page = InvalidBlockNumber; /* all blocks have been allocated */
else if (pbscan->phs_numblock != InvalidBlockNumber &&
nallocated >= pbscan->phs_numblock)
page = InvalidBlockNumber; /* upper scan limit reached */
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a58f7eafc9..7b1eb2e82c 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1036,6 +1036,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is required */
break;
+
default:
break;
}
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 06a1037d51..afa47b01de 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -418,13 +418,13 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
* ----------------------------------------------------------------
*/
void
-ExecTidRangeScanEstimate(TidRangeScanState *node,
- ParallelContext *pcxt)
+ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt)
{
EState *estate = node->ss.ps.state;
- node->trss_pscanlen = table_parallelscan_estimate(node->ss.ss_currentRelation,
- estate->es_snapshot);
+ node->trss_pscanlen =
+ table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
shm_toc_estimate_chunk(&pcxt->estimator, node->trss_pscanlen);
shm_toc_estimate_keys(&pcxt->estimator, 1);
}
@@ -436,8 +436,7 @@ ExecTidRangeScanEstimate(TidRangeScanState *node,
* ----------------------------------------------------------------
*/
void
-ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
- ParallelContext *pcxt)
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
{
EState *estate = node->ss.ps.state;
ParallelTableScanDesc pscan;
@@ -446,8 +445,6 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
table_parallelscan_initialize(node->ss.ss_currentRelation,
pscan,
estate->es_snapshot);
- /* disable syncscan in parallel tid range scan. */
- pscan->phs_syncscan = false;
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
node->ss.ss_currentScanDesc =
table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index fdb58d094f..eab1b18d30 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1366,9 +1366,9 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
{
Selectivity selectivity;
double pages;
- Cost startup_cost = 0;
- Cost cpu_run_cost = 0;
- Cost disk_run_cost = 0;
+ Cost startup_cost;
+ Cost cpu_run_cost;
+ Cost disk_run_cost;
QualCost qpqual_cost;
Cost cpu_per_tuple;
QualCost tid_qual_cost;
@@ -1414,7 +1414,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
&spc_seq_page_cost);
/* disk costs; 1 random page and the remainder as seq pages */
- disk_run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
+ disk_run_cost = spc_random_page_cost + spc_seq_page_cost * nseqpages;
/* Add scanning CPU costs */
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
@@ -1422,14 +1422,14 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* XXX currently we assume TID quals are a subset of qpquals at this
* point; they will be removed (if possible) when we create the plan, so
- * we subtract their cost from the total qpqual cost. (If the TID quals
+ * we subtract their cost from the total qpqual cost. (If the TID quals
* can't be removed, this is a mistake and we're going to underestimate
* the CPU cost a bit.)
*/
- startup_cost += qpqual_cost.startup + tid_qual_cost.per_tuple;
+ startup_cost = qpqual_cost.startup + tid_qual_cost.per_tuple;
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple -
tid_qual_cost.per_tuple;
- cpu_run_cost += cpu_per_tuple * ntuples;
+ cpu_run_cost = cpu_per_tuple * ntuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->pathtarget->cost.startup;
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 9c78eedcf5..e48c85833e 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -564,11 +564,13 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
max_parallel_workers_per_gather);
+
if (parallel_workers > 0)
- {
- add_partial_path(rel, (Path *) create_tidrangescan_path(root, rel, tidrangequals,
- required_outer, parallel_workers));
- }
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root,
+ rel,
+ tidrangequals,
+ required_outer,
+ parallel_workers));
}
}
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 958c78f66c..4947b6cca0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1929,7 +1929,7 @@ typedef struct TidScanState
* trss_mintid the lowest TID in the scan range
* trss_maxtid the highest TID in the scan range
* trss_inScan is a scan currently in progress?
- * trss_pscanlen size of parallel TID range scan descriptor
+ * trss_pscanlen size of parallel heap scan descriptor
* ----------------
*/
typedef struct TidRangeScanState
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 32cd2bd9f4..3c5fc9e102 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -298,13 +298,13 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
-- tests for parallel tidrangescans
-SET parallel_setup_cost=0;
-SET parallel_tuple_cost=0;
-SET min_parallel_table_scan_size=0;
-SET max_parallel_workers_per_gather=4;
-CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
-INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
-- ensure there are 40 pages for parallel test
SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
min | max
@@ -313,8 +313,8 @@ SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
(1 row)
-- parallel range scans with upper bound
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
QUERY PLAN
--------------------------------------------------------------------
Finalize Aggregate
@@ -325,15 +325,15 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
TID Cond: (ctid < '(30,1)'::tid)
(6 rows)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
count
-------
150
(1 row)
-- parallel range scans with lower bound
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
QUERY PLAN
--------------------------------------------------------------------
Finalize Aggregate
@@ -344,15 +344,15 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
TID Cond: (ctid > '(10,0)'::tid)
(6 rows)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
count
-------
150
(1 row)
-- parallel range scans with both bounds
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
QUERY PLAN
-----------------------------------------------------------------------------------
Finalize Aggregate
@@ -363,7 +363,7 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)'
TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
(6 rows)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
count
-------
100
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index 1d18b8a61d..0f1e43c6d0 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -99,33 +99,33 @@ COMMIT;
DROP TABLE tidrangescan;
-- tests for parallel tidrangescans
-SET parallel_setup_cost=0;
-SET parallel_tuple_cost=0;
-SET min_parallel_table_scan_size=0;
-SET max_parallel_workers_per_gather=4;
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
-CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
-INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
-- ensure there are 40 pages for parallel test
SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
-- parallel range scans with upper bound
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
-- parallel range scans with lower bound
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
-- parallel range scans with both bounds
-EXPLAIN (costs off)
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
-SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
-- parallel rescans
EXPLAIN (COSTS OFF)
--
2.17.1
v9-0003-fix-the-incorrect-call-to-scan_set_tidrange.patchapplication/octet-stream; name=v9-0003-fix-the-incorrect-call-to-scan_set_tidrange.patchDownload
From de650c53515ca822abb78ff0e7720638a55564ee Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Wed, 13 Aug 2025 14:26:56 -0700
Subject: [PATCH v9 3/3] fix the incorrect call to scan_set_tidrange()
---
src/backend/access/heap/heapam.c | 5 ----
src/backend/access/table/tableam.c | 7 ++++-
src/backend/executor/nodeTidrangescan.c | 37 +++++++++++++++++++++----
src/include/access/tableam.h | 4 ++-
4 files changed, 40 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d0e650de57..46f5df2ec4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -492,11 +492,6 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
scan->rs_numblocks = numBlks;
/* set the limits in the ParallelBlockTableScanDesc, when present */
-
- /*
- * XXX no lock is being taken here. What guarantees are there that there
- * isn't some worker using the old limits when the new limits are imposed?
- */
if (scan->rs_base.rs_parallel != NULL)
{
ParallelBlockTableScanDesc bpscan;
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 8036654c77..01ca264ba4 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -189,7 +189,8 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
}
TableScanDesc
-table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan)
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan,
+ ItemPointerData * mintid, ItemPointerData * maxtid)
{
Snapshot snapshot;
uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
@@ -216,6 +217,10 @@ table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan
sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
pscan, flags);
+ /* Set the TID range if needed */
+ if (mintid && maxtid)
+ relation->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
+
return sscan;
}
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index afa47b01de..eef44e9d78 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -250,9 +250,13 @@ TidRangeNext(TidRangeScanState *node)
}
else
{
- /* rescan with the updated TID range */
- table_rescan_tidrange(scandesc, &node->trss_mintid,
- &node->trss_maxtid);
+ /* rescan with the updated TID range only in non-parallel mode */
+ if (scandesc->rs_parallel == NULL)
+ {
+ /* rescan with the updated TID range */
+ table_rescan_tidrange(scandesc, &node->trss_mintid,
+ &node->trss_maxtid);
+ }
}
node->trss_inScan = true;
@@ -446,8 +450,19 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
pscan,
estate->es_snapshot);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
- node->ss.ss_currentScanDesc =
- table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+
+ /*
+ * Initialize parallel scan descriptor with given TID range if it can be
+ * evaluated successfully.
+ */
+ if (TidRangeEval(node))
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ &node->trss_mintid, &node->trss_maxtid);
+ else
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ NULL, NULL);
}
/* ----------------------------------------------------------------
@@ -465,6 +480,11 @@ ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
pscan = node->ss.ss_currentScanDesc->rs_parallel;
table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+ /* Set the new TID range if it can be evaluated successfully */
+ if (TidRangeEval(node))
+ node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(
+ node->ss.ss_currentScanDesc, &node->trss_mintid,
+ &node->trss_maxtid);
}
/* ----------------------------------------------------------------
@@ -480,6 +500,11 @@ ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
ParallelTableScanDesc pscan;
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+ /*
+ * As a worker, there is no need to set TID range as it has already been set
+ * by the leader and available in shared memory.
+ */
node->ss.ss_currentScanDesc =
- table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan, NULL, NULL);
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0f46a47c2e..99596d6258 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1133,7 +1133,9 @@ extern TableScanDesc table_beginscan_parallel(Relation relation,
* Caller must hold a suitable lock on the relation.
*/
extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
- ParallelTableScanDesc pscan);
+ ParallelTableScanDesc pscan,
+ ItemPointerData * mintid,
+ ItemPointerData * maxtid);
/*
* Restart a parallel scan. Call this in the leader process. Caller is
--
2.17.1
v9-0001-v7-parallel-TID-range-scan-patch.patchapplication/octet-stream; name=v9-0001-v7-parallel-TID-range-scan-patch.patchDownload
From 32e09c7c8930c05d87a02ef5369c6bae38a38e34 Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Thu, 24 Jul 2025 11:19:14 -0700
Subject: [PATCH v9 1/3] v7 parallel TID range scan patch
---
src/backend/access/heap/heapam.c | 13 +++
src/backend/access/table/tableam.c | 45 ++++++++-
src/backend/executor/execParallel.c | 22 ++++-
src/backend/executor/nodeTidrangescan.c | 81 ++++++++++++++++
src/backend/optimizer/path/costsize.c | 34 ++++---
src/backend/optimizer/path/tidpath.c | 18 +++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/relscan.h | 2 +
src/include/access/tableam.h | 10 ++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 2 +
src/include/optimizer/pathnode.h | 3 +-
src/test/regress/expected/tidrangescan.out | 106 +++++++++++++++++++++
src/test/regress/sql/tidrangescan.sql | 45 +++++++++
14 files changed, 377 insertions(+), 18 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817..5105a2c8ad 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1478,6 +1478,19 @@ heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
/* Set the start block and number of blocks to scan */
heap_setscanlimits(sscan, startBlk, numBlks);
+ /*
+ * If parallel mode is used, store startBlk and numBlks in parallel
+ * scan descriptor as well.
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan = NULL;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = startBlk;
+ bpscan->phs_numblock = numBlks;
+ }
+
/* Finally, set the TID range in sscan */
ItemPointerCopy(&lowestItem, &sscan->st.tidrange.rs_mintid);
ItemPointerCopy(&highestItem, &sscan->st.tidrange.rs_maxtid);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb1..5a76cec81e 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -188,6 +188,34 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
@@ -398,6 +426,7 @@ table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
bpscan->phs_nblocks > NBuffers / 4;
SpinLockInit(&bpscan->phs_mutex);
bpscan->phs_startblock = InvalidBlockNumber;
+ bpscan->phs_numblock = InvalidBlockNumber;
pg_atomic_init_u64(&bpscan->phs_nallocated, 0);
return sizeof(ParallelBlockTableScanDescData);
@@ -577,8 +606,22 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
+ /*
+ * In a parallel TID range scan, 'pbscan->phs_numblock' is non-zero if an
+ * upper TID range limit is specified, or InvalidBlockNumber if no limit
+ * is given. This value may be less than or equal to 'pbscan->phs_nblocks'
+ * , which is the total number of blocks in the relation.
+ *
+ * The scan can terminate early once 'nallocated' reaches
+ * 'pbscan->phs_numblock', even if the full relation has remaining blocks
+ * to scan. This ensures that parallel workers only scan the subset of
+ * blocks that fall within the TID range.
+ */
if (nallocated >= pbscan->phs_nblocks)
- page = InvalidBlockNumber; /* all blocks have been allocated */
+ page = InvalidBlockNumber; /* all blocks have been allocated */
+ else if (pbscan->phs_numblock != InvalidBlockNumber &&
+ nallocated >= pbscan->phs_numblock)
+ page = InvalidBlockNumber; /* upper scan limit reached */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557c..a58f7eafc9 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -41,6 +41,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -266,6 +267,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
ExecForeignScanEstimate((ForeignScanState *) planstate,
e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendEstimate((AppendState *) planstate,
@@ -493,6 +499,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeDSM((AppendState *) planstate,
@@ -994,6 +1005,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
ExecForeignScanReInitializeDSM((ForeignScanState *) planstate,
pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendReInitializeDSM((AppendState *) planstate, pcxt);
@@ -1020,7 +1036,6 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is required */
break;
-
default:
break;
}
@@ -1362,6 +1377,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate,
+ pwcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeWorker((AppendState *) planstate, pwcxt);
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 26f7420b64..06a1037d51 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -405,3 +405,84 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->trss_pscanlen = table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->trss_pscanlen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel TID scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ /* disable syncscan in parallel tid range scan. */
+ pscan->phs_syncscan = false;
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 344a318831..fdb58d094f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1367,7 +1367,8 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
Selectivity selectivity;
double pages;
Cost startup_cost = 0;
- Cost run_cost = 0;
+ Cost cpu_run_cost = 0;
+ Cost disk_run_cost = 0;
QualCost qpqual_cost;
Cost cpu_per_tuple;
QualCost tid_qual_cost;
@@ -1396,11 +1397,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* The first page in a range requires a random seek, but each subsequent
- * page is just a normal sequential page read. NOTE: it's desirable for
- * TID Range Scans to cost more than the equivalent Sequential Scans,
- * because Seq Scans have some performance advantages such as scan
- * synchronization and parallelizability, and we'd prefer one of them to
- * be picked unless a TID Range Scan really is better.
+ * page is just a normal sequential page read.
*/
ntuples = selectivity * baserel->tuples;
nseqpages = pages - 1.0;
@@ -1417,7 +1414,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
&spc_seq_page_cost);
/* disk costs; 1 random page and the remainder as seq pages */
- run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
+ disk_run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
/* Add scanning CPU costs */
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
@@ -1425,24 +1422,39 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* XXX currently we assume TID quals are a subset of qpquals at this
* point; they will be removed (if possible) when we create the plan, so
- * we subtract their cost from the total qpqual cost. (If the TID quals
+ * we subtract their cost from the total qpqual cost. (If the TID quals
* can't be removed, this is a mistake and we're going to underestimate
* the CPU cost a bit.)
*/
startup_cost += qpqual_cost.startup + tid_qual_cost.per_tuple;
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple -
tid_qual_cost.per_tuple;
- run_cost += cpu_per_tuple * ntuples;
+ cpu_run_cost += cpu_per_tuple * ntuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->pathtarget->cost.startup;
- run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows;
+
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ cpu_run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
/* we should not generate this path type when enable_tidscan=false */
Assert(enable_tidscan);
path->disabled_nodes = 0;
path->startup_cost = startup_cost;
- path->total_cost = startup_cost + run_cost;
+ path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
}
/*
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2bfb338b81..9c78eedcf5 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -47,6 +47,7 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/cost.h"
/*
@@ -553,7 +554,22 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan. */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+ if (parallel_workers > 0)
+ {
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root, rel, tidrangequals,
+ required_outer, parallel_workers));
+ }
+ }
}
/*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index a4c5867cdc..ebfcc42551 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1262,7 +1262,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1271,9 +1272,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c..3da43557a1 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -96,6 +96,8 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_nblocks; /* # blocks in relation at start of scan */
slock_t phs_mutex; /* mutual exclusion for setting startblock */
BlockNumber phs_startblock; /* starting block number */
+ BlockNumber phs_numblock; /* # blocks to scan, or InvalidBlockNumber if
+ * no limit */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
} ParallelBlockTableScanDescData;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b..0f46a47c2e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1125,6 +1125,16 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index a831f1202c..2b5465b3ce 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e107d6e5f8..958c78f66c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1929,6 +1929,7 @@ typedef struct TidScanState
* trss_mintid the lowest TID in the scan range
* trss_maxtid the highest TID in the scan range
* trss_inScan is a scan currently in progress?
+ * trss_pscanlen size of parallel TID range scan descriptor
* ----------------
*/
typedef struct TidRangeScanState
@@ -1938,6 +1939,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size trss_pscanlen;
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 58936e963c..cbfb98454c 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,7 +67,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 721f3b94e0..32cd2bd9f4 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,4 +297,110 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+ min | max
+-------+--------
+ (0,1) | (39,5)
+(1 row)
+
+-- parallel range scans with upper bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid < '(30,1)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with lower bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid > '(10,0)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with both bounds
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+ count
+-------
+ 100
+(1 row)
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on parallel_tidrangescan t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on parallel_tidrangescan t2
+ TID Cond: (ctid <= t.ctid)
+(8 rows)
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ ctid | c
+-------+---
+ (0,1) | 1
+ (0,2) | 2
+ (0,3) | 3
+ (0,4) | 4
+ (0,5) | 5
+(5 rows)
+
+DROP TABLE parallel_tidrangescan;
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index ac09ebb626..1d18b8a61d 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,4 +98,49 @@ COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost=0;
+SET parallel_tuple_cost=0;
+SET min_parallel_table_scan_size=0;
+SET max_parallel_workers_per_gather=4;
+
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor=10);
+
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i,repeat('x', 100) FROM generate_series(1,200) AS s(i);
+
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+
+-- parallel range scans with upper bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid<'(30,1)';
+
+-- parallel range scans with lower bound
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)';
+
+-- parallel range scans with both bounds
+EXPLAIN (costs off)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid>'(10,0)' AND ctid<'(30,1)';
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+DROP TABLE parallel_tidrangescan;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
--
2.17.1
On Thu, 14 Aug 2025 at 10:03, Cary Huang <cary.huang@highgo.ca> wrote:
ExecTidRangeScanInitializeWorker() is called by each parallel worker and is also
updated such that it will not set the TID limits again.
This only works for setting the block range. What about the
TableScanDescData.rs_mintid and rs_maxtid? They'll be left unset in
the parallel worker, and heap_getnextslot_tidrange() needs to do
filtering based on those, which isn't going to work correctly when
they don't get set.
Here are the results from scanning a 10 million row table with the v9 patch:
# set parallel_setup_Cost=0;
# set parallel_tuple_cost=0;
# select count(*) from huge where ctid >= '(10,10)' and ctid <= '(10000,10)';
count
--------
629175
# select count(*) from huge where ctid >= '(10,10)' and ctid <= '(10000,10)';
count
--------
600247
# select count(*) from huge where ctid >= '(10,10)' and ctid <= '(10000,10)';
count
--------
621943
(1 row)
The workers are ending their scan early because
heap_getnextslot_tidrange() returns false on the first call from the
parallel worker.
# set max_parallel_workers_per_Gather=0;
# select count(*) from huge where ctid >= '(10,10)' and ctid <= '(10000,10)';
count
---------
2257741
David
The workers are ending their scan early because
heap_getnextslot_tidrange() returns false on the first call from the
parallel worker.
Hi David, thank you for the testing!
Yes, the previous v9 patch missed setting node->trss_mintid and
node->trss_maxtid, causing the parallel workers to exit early due to
heap_getnextslot_tidrange() returning false.
With the attached v10 patch, the parallel leader and workers now
have to evaluate (TidRangeEval()) the given TID ranges and set them
via ExecTidRangeScanInitializeDSM(),
ExecTidRangeScanReInitializeDSM() and
ExecTidRangeScanInitializeWorker().
To prevent the leader and the workers from calling heap_setscanlimits()
and trying to set phs_startblock and phs_numblock concurrently in
shared memory, I added a condition to only allow parallel
leader to set them. Since node->trss_mintid and node->trss_maxtid
reside in local memory, the workers still have to call heap_set_tidrange
() to have them set to return correct scan results:
# SET parallel_setup_cost TO 0;
# SET parallel_tuple_cost TO 0;
# select count(*) from test where ctid >= '(10,10)' and ctid <= '(10000,10)';
count
---------
1848151
(1 row)
# SET max_parallel_workers_per_gather TO 0;
=# select count(*) from test where ctid >= '(10,10)' and ctid <= '(10000,10)';
count
---------
1848151
(1 row)
thank you again!
Cary
Attachments:
v10-0001-v10-parallel-tid-range-scan.patchapplication/octet-stream; name=v10-0001-v10-parallel-tid-range-scan.patchDownload
From 5e0dddd1686304d4809102381565be7b3bc3a58f Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Fri, 29 Aug 2025 10:10:51 -0700
Subject: [PATCH v10] v10 parallel tid range scan
---
src/backend/access/heap/heapam.c | 10 ++
src/backend/access/table/tableam.c | 46 ++++++++-
src/backend/executor/execParallel.c | 21 ++++
src/backend/executor/nodeTidrangescan.c | 114 ++++++++++++++++++++-
src/backend/optimizer/path/costsize.c | 36 ++++---
src/backend/optimizer/path/tidpath.c | 20 +++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/relscan.h | 2 +
src/include/access/tableam.h | 12 +++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 2 +
src/include/optimizer/pathnode.h | 3 +-
src/test/regress/expected/tidrangescan.out | 106 +++++++++++++++++++
src/test/regress/sql/tidrangescan.sql | 45 ++++++++
14 files changed, 410 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817..65abe09333 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -490,6 +490,16 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
scan->rs_startblock = startBlk;
scan->rs_numblocks = numBlks;
+
+ /* set the limits in the ParallelBlockTableScanDesc, when present as leader */
+ if (scan->rs_base.rs_parallel != NULL && !IsParallelWorker())
+ {
+ ParallelBlockTableScanDesc bpscan;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = startBlk;
+ bpscan->phs_numblock = numBlks;
+ }
}
/*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb1..01ca264ba4 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -188,6 +188,42 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan,
+ ItemPointerData * mintid, ItemPointerData * maxtid)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
+
+ /* disable syncscan in parallel tid range scan. */
+ pscan->phs_syncscan = false;
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ /* Set the TID range if needed */
+ if (mintid && maxtid)
+ relation->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
@@ -398,6 +434,7 @@ table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
bpscan->phs_nblocks > NBuffers / 4;
SpinLockInit(&bpscan->phs_mutex);
bpscan->phs_startblock = InvalidBlockNumber;
+ bpscan->phs_numblock = InvalidBlockNumber;
pg_atomic_init_u64(&bpscan->phs_nallocated, 0);
return sizeof(ParallelBlockTableScanDescData);
@@ -577,8 +614,15 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
+ /*
+ * Check if we've allocated every block in the relation, or if we've
+ * reached the limit imposed by pbscan->phs_numblock (if set).
+ */
if (nallocated >= pbscan->phs_nblocks)
- page = InvalidBlockNumber; /* all blocks have been allocated */
+ page = InvalidBlockNumber; /* all blocks have been allocated */
+ else if (pbscan->phs_numblock != InvalidBlockNumber &&
+ nallocated >= pbscan->phs_numblock)
+ page = InvalidBlockNumber; /* upper scan limit reached */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557c..7b1eb2e82c 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -41,6 +41,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -266,6 +267,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
ExecForeignScanEstimate((ForeignScanState *) planstate,
e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendEstimate((AppendState *) planstate,
@@ -493,6 +499,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeDSM((AppendState *) planstate,
@@ -994,6 +1005,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
ExecForeignScanReInitializeDSM((ForeignScanState *) planstate,
pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendReInitializeDSM((AppendState *) planstate, pcxt);
@@ -1362,6 +1378,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate,
+ pwcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeWorker((AppendState *) planstate, pwcxt);
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 26f7420b64..5e7357fe66 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -250,9 +250,13 @@ TidRangeNext(TidRangeScanState *node)
}
else
{
- /* rescan with the updated TID range */
- table_rescan_tidrange(scandesc, &node->trss_mintid,
- &node->trss_maxtid);
+ /* rescan with the updated TID range only in non-parallel mode */
+ if (scandesc->rs_parallel == NULL)
+ {
+ /* rescan with the updated TID range */
+ table_rescan_tidrange(scandesc, &node->trss_mintid,
+ &node->trss_maxtid);
+ }
}
node->trss_inScan = true;
@@ -405,3 +409,107 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->trss_pscanlen =
+ table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->trss_pscanlen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel TID scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+
+ /*
+ * Initialize parallel scan descriptor with given TID range if it can be
+ * evaluated successfully.
+ */
+ if (TidRangeEval(node))
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ &node->trss_mintid, &node->trss_maxtid);
+ else
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ NULL, NULL);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+ /* Set the new TID range if it can be evaluated successfully */
+ if (TidRangeEval(node))
+ node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(
+ node->ss.ss_currentScanDesc, &node->trss_mintid,
+ &node->trss_maxtid);
+ else
+ node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(
+ node->ss.ss_currentScanDesc, NULL, NULL);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+ if (TidRangeEval(node))
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ &node->trss_mintid, &node->trss_maxtid);
+ else
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ NULL, NULL);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 344a318831..eab1b18d30 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1366,8 +1366,9 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
{
Selectivity selectivity;
double pages;
- Cost startup_cost = 0;
- Cost run_cost = 0;
+ Cost startup_cost;
+ Cost cpu_run_cost;
+ Cost disk_run_cost;
QualCost qpqual_cost;
Cost cpu_per_tuple;
QualCost tid_qual_cost;
@@ -1396,11 +1397,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* The first page in a range requires a random seek, but each subsequent
- * page is just a normal sequential page read. NOTE: it's desirable for
- * TID Range Scans to cost more than the equivalent Sequential Scans,
- * because Seq Scans have some performance advantages such as scan
- * synchronization and parallelizability, and we'd prefer one of them to
- * be picked unless a TID Range Scan really is better.
+ * page is just a normal sequential page read.
*/
ntuples = selectivity * baserel->tuples;
nseqpages = pages - 1.0;
@@ -1417,7 +1414,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
&spc_seq_page_cost);
/* disk costs; 1 random page and the remainder as seq pages */
- run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
+ disk_run_cost = spc_random_page_cost + spc_seq_page_cost * nseqpages;
/* Add scanning CPU costs */
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
@@ -1429,20 +1426,35 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
* can't be removed, this is a mistake and we're going to underestimate
* the CPU cost a bit.)
*/
- startup_cost += qpqual_cost.startup + tid_qual_cost.per_tuple;
+ startup_cost = qpqual_cost.startup + tid_qual_cost.per_tuple;
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple -
tid_qual_cost.per_tuple;
- run_cost += cpu_per_tuple * ntuples;
+ cpu_run_cost = cpu_per_tuple * ntuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->pathtarget->cost.startup;
- run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows;
+
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ cpu_run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
/* we should not generate this path type when enable_tidscan=false */
Assert(enable_tidscan);
path->disabled_nodes = 0;
path->startup_cost = startup_cost;
- path->total_cost = startup_cost + run_cost;
+ path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
}
/*
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2bfb338b81..e48c85833e 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -47,6 +47,7 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/cost.h"
/*
@@ -553,7 +554,24 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan. */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+
+ if (parallel_workers > 0)
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root,
+ rel,
+ tidrangequals,
+ required_outer,
+ parallel_workers));
+ }
}
/*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index a4c5867cdc..ebfcc42551 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1262,7 +1262,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1271,9 +1272,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c..3da43557a1 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -96,6 +96,8 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_nblocks; /* # blocks in relation at start of scan */
slock_t phs_mutex; /* mutual exclusion for setting startblock */
BlockNumber phs_startblock; /* starting block number */
+ BlockNumber phs_numblock; /* # blocks to scan, or InvalidBlockNumber if
+ * no limit */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
} ParallelBlockTableScanDescData;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b..99596d6258 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1125,6 +1125,18 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan,
+ ItemPointerData * mintid,
+ ItemPointerData * maxtid);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index a831f1202c..2b5465b3ce 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e107d6e5f8..4947b6cca0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1929,6 +1929,7 @@ typedef struct TidScanState
* trss_mintid the lowest TID in the scan range
* trss_maxtid the highest TID in the scan range
* trss_inScan is a scan currently in progress?
+ * trss_pscanlen size of parallel heap scan descriptor
* ----------------
*/
typedef struct TidRangeScanState
@@ -1938,6 +1939,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size trss_pscanlen;
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 58936e963c..cbfb98454c 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,7 +67,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 721f3b94e0..3c5fc9e102 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,4 +297,110 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+ min | max
+-------+--------
+ (0,1) | (39,5)
+(1 row)
+
+-- parallel range scans with upper bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid < '(30,1)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with lower bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid > '(10,0)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with both bounds
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+ count
+-------
+ 100
+(1 row)
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on parallel_tidrangescan t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on parallel_tidrangescan t2
+ TID Cond: (ctid <= t.ctid)
+(8 rows)
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ ctid | c
+-------+---
+ (0,1) | 1
+ (0,2) | 2
+ (0,3) | 3
+ (0,4) | 4
+ (0,5) | 5
+(5 rows)
+
+DROP TABLE parallel_tidrangescan;
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index ac09ebb626..0f1e43c6d0 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,4 +98,49 @@ COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
+
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
+
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
+
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+
+-- parallel range scans with upper bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+
+-- parallel range scans with lower bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+
+-- parallel range scans with both bounds
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+DROP TABLE parallel_tidrangescan;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
--
2.17.1
On Sat, 30 Aug 2025 at 05:32, Cary Huang <cary.huang@highgo.ca> wrote:
The workers are ending their scan early because
heap_getnextslot_tidrange() returns false on the first call from the
parallel worker.
Yes, the previous v9 patch missed setting node->trss_mintid and
node->trss_maxtid, causing the parallel workers to exit early due to
heap_getnextslot_tidrange() returning false.With the attached v10 patch, the parallel leader and workers now
have to evaluate (TidRangeEval()) the given TID ranges and set them
via ExecTidRangeScanInitializeDSM(),
ExecTidRangeScanReInitializeDSM() and
ExecTidRangeScanInitializeWorker().To prevent the leader and the workers from calling heap_setscanlimits()
and trying to set phs_startblock and phs_numblock concurrently in
shared memory, I added a condition to only allow parallel
leader to set them. Since node->trss_mintid and node->trss_maxtid
reside in local memory, the workers still have to call heap_set_tidrange
() to have them set to return correct scan results:
I spent quite a bit of time looking at this. I didn't like the way
heap_setscanlimits() did:
+ /* set the limits in the ParallelBlockTableScanDesc, when present as leader */
+ if (scan->rs_base.rs_parallel != NULL && !IsParallelWorker())
as it wasn't clear to me that this didn't break completely when the
leader didn't make it there first.
I've made quite a few revisions to the v10 patch, which I've attached
as 11-0002. v11-0001 your v10 rebased atop of master.
Here's a summary of the changes:
1. Moved block limiting logic for parallel scans into
table_block_parallelscan_startblock_init(). There's currently a lock
here to ensure only 1 worker can set the shared memory fields at a
time. I've hooked into the same lock to set the startblock and
numblocks.
2. Fixed chunk size ramp-down code which is meant to divvy up the scan
into smaller and smaller chunks as it nears completion so that one
worker doesn't get left with too much work and leave the others with
nothing. That code still thought that it was scanning every block in
the table.
3. Changed things around so that the min/max TID for the parallel scan
is specified via table_rescan_tidrange(). This means zero changes to
TidRangeNext, and the only additions to nodeTidrangescan.c are for the
shared memory handling.
4. The rest of the changes are mostly cosmetic.
With this version table_block_parallelscan_startblock_init() has grown
2 extra fields. I considered we should instead rename this function
append "_with_limit" to its name then add another function with the
original name that calls the renamed function passing in
InvalidBlockNumber for both new fields. I didn't do that as we only
have a single call to the existing function, so doing that would only
be for the benefit of extensions that happen to use that function. It
doesn't seem overly difficult for them to adjust their code. I didn't
find any code using that function in codesearch.debian.net.
I still need to do a bit more testing on this, but in the meantime
thought I'd share what I've done with it so that other people can look
in parallel.
David
Attachments:
v11-0001-v10-parallel-tid-range-scan.patchapplication/octet-stream; name=v11-0001-v10-parallel-tid-range-scan.patchDownload
From 1fce169e29a5cfa380b92c8239bf3fce2977acbc Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Fri, 29 Aug 2025 10:10:51 -0700
Subject: [PATCH v11 1/2] v10 parallel tid range scan
---
src/backend/access/heap/heapam.c | 10 ++
src/backend/access/table/tableam.c | 46 ++++++++-
src/backend/executor/execParallel.c | 21 ++++
src/backend/executor/nodeTidrangescan.c | 114 ++++++++++++++++++++-
src/backend/optimizer/path/costsize.c | 36 ++++---
src/backend/optimizer/path/tidpath.c | 20 +++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/relscan.h | 2 +
src/include/access/tableam.h | 12 +++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 2 +
src/include/optimizer/pathnode.h | 3 +-
src/test/regress/expected/tidrangescan.out | 106 +++++++++++++++++++
src/test/regress/sql/tidrangescan.sql | 45 ++++++++
14 files changed, 410 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 36fee9c994e..f1693e79c31 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -490,6 +490,16 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
scan->rs_startblock = startBlk;
scan->rs_numblocks = numBlks;
+
+ /* set the limits in the ParallelBlockTableScanDesc, when present as leader */
+ if (scan->rs_base.rs_parallel != NULL && !IsParallelWorker())
+ {
+ ParallelBlockTableScanDesc bpscan;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = startBlk;
+ bpscan->phs_numblock = numBlks;
+ }
}
/*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..baef7459b6b 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -188,6 +188,42 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan,
+ ItemPointerData * mintid, ItemPointerData * maxtid)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
+
+ /* disable syncscan in parallel tid range scan. */
+ pscan->phs_syncscan = false;
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ /* Set the TID range if needed */
+ if (mintid && maxtid)
+ relation->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
@@ -398,6 +434,7 @@ table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
bpscan->phs_nblocks > NBuffers / 4;
SpinLockInit(&bpscan->phs_mutex);
bpscan->phs_startblock = InvalidBlockNumber;
+ bpscan->phs_numblock = InvalidBlockNumber;
pg_atomic_init_u64(&bpscan->phs_nallocated, 0);
return sizeof(ParallelBlockTableScanDescData);
@@ -577,8 +614,15 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
+ /*
+ * Check if we've allocated every block in the relation, or if we've
+ * reached the limit imposed by pbscan->phs_numblock (if set).
+ */
if (nallocated >= pbscan->phs_nblocks)
- page = InvalidBlockNumber; /* all blocks have been allocated */
+ page = InvalidBlockNumber; /* all blocks have been allocated */
+ else if (pbscan->phs_numblock != InvalidBlockNumber &&
+ nallocated >= pbscan->phs_numblock)
+ page = InvalidBlockNumber; /* upper scan limit reached */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557cf..7b1eb2e82c7 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -41,6 +41,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -266,6 +267,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
ExecForeignScanEstimate((ForeignScanState *) planstate,
e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendEstimate((AppendState *) planstate,
@@ -493,6 +499,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeDSM((AppendState *) planstate,
@@ -994,6 +1005,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
ExecForeignScanReInitializeDSM((ForeignScanState *) planstate,
pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendReInitializeDSM((AppendState *) planstate, pcxt);
@@ -1362,6 +1378,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate,
+ pwcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeWorker((AppendState *) planstate, pwcxt);
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 1bce8d6cbfe..39088755e90 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -250,9 +250,13 @@ TidRangeNext(TidRangeScanState *node)
}
else
{
- /* rescan with the updated TID range */
- table_rescan_tidrange(scandesc, &node->trss_mintid,
- &node->trss_maxtid);
+ /* rescan with the updated TID range only in non-parallel mode */
+ if (scandesc->rs_parallel == NULL)
+ {
+ /* rescan with the updated TID range */
+ table_rescan_tidrange(scandesc, &node->trss_mintid,
+ &node->trss_maxtid);
+ }
}
node->trss_inScan = true;
@@ -415,3 +419,107 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->trss_pscanlen =
+ table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->trss_pscanlen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel TID scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+
+ /*
+ * Initialize parallel scan descriptor with given TID range if it can be
+ * evaluated successfully.
+ */
+ if (TidRangeEval(node))
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ &node->trss_mintid, &node->trss_maxtid);
+ else
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ NULL, NULL);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+ /* Set the new TID range if it can be evaluated successfully */
+ if (TidRangeEval(node))
+ node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(
+ node->ss.ss_currentScanDesc, &node->trss_mintid,
+ &node->trss_maxtid);
+ else
+ node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(
+ node->ss.ss_currentScanDesc, NULL, NULL);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+ if (TidRangeEval(node))
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ &node->trss_mintid, &node->trss_maxtid);
+ else
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ NULL, NULL);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8335cf5b5c5..01976226d19 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1340,8 +1340,9 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
{
Selectivity selectivity;
double pages;
- Cost startup_cost = 0;
- Cost run_cost = 0;
+ Cost startup_cost;
+ Cost cpu_run_cost;
+ Cost disk_run_cost;
QualCost qpqual_cost;
Cost cpu_per_tuple;
QualCost tid_qual_cost;
@@ -1370,11 +1371,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* The first page in a range requires a random seek, but each subsequent
- * page is just a normal sequential page read. NOTE: it's desirable for
- * TID Range Scans to cost more than the equivalent Sequential Scans,
- * because Seq Scans have some performance advantages such as scan
- * synchronization and parallelizability, and we'd prefer one of them to
- * be picked unless a TID Range Scan really is better.
+ * page is just a normal sequential page read.
*/
ntuples = selectivity * baserel->tuples;
nseqpages = pages - 1.0;
@@ -1391,7 +1388,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
&spc_seq_page_cost);
/* disk costs; 1 random page and the remainder as seq pages */
- run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
+ disk_run_cost = spc_random_page_cost + spc_seq_page_cost * nseqpages;
/* Add scanning CPU costs */
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
@@ -1403,20 +1400,35 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
* can't be removed, this is a mistake and we're going to underestimate
* the CPU cost a bit.)
*/
- startup_cost += qpqual_cost.startup + tid_qual_cost.per_tuple;
+ startup_cost = qpqual_cost.startup + tid_qual_cost.per_tuple;
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple -
tid_qual_cost.per_tuple;
- run_cost += cpu_per_tuple * ntuples;
+ cpu_run_cost = cpu_per_tuple * ntuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->pathtarget->cost.startup;
- run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows;
+
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ cpu_run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
/* we should not generate this path type when enable_tidscan=false */
Assert(enable_tidscan);
path->disabled_nodes = 0;
path->startup_cost = startup_cost;
- path->total_cost = startup_cost + run_cost;
+ path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
}
/*
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2bfb338b81c..e48c85833e7 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -47,6 +47,7 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/cost.h"
/*
@@ -553,7 +554,24 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan. */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+
+ if (parallel_workers > 0)
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root,
+ rel,
+ tidrangequals,
+ required_outer,
+ parallel_workers));
+ }
}
/*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e4fd6950fad..fd4bd5f93f0 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1262,7 +1262,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1271,9 +1272,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..3da43557a13 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -96,6 +96,8 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_nblocks; /* # blocks in relation at start of scan */
slock_t phs_mutex; /* mutual exclusion for setting startblock */
BlockNumber phs_startblock; /* starting block number */
+ BlockNumber phs_numblock; /* # blocks to scan, or InvalidBlockNumber if
+ * no limit */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
} ParallelBlockTableScanDescData;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..8e97fc5f0be 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1130,6 +1130,18 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan,
+ ItemPointerData * mintid,
+ ItemPointerData * maxtid);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index a831f1202ca..2b5465b3ce4 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 18ae8f0d4bb..64ff6996431 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1930,6 +1930,7 @@ typedef struct TidScanState
* trss_mintid the lowest TID in the scan range
* trss_maxtid the highest TID in the scan range
* trss_inScan is a scan currently in progress?
+ * trss_pscanlen size of parallel heap scan descriptor
* ----------------
*/
typedef struct TidRangeScanState
@@ -1939,6 +1940,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size trss_pscanlen;
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 955e9056858..6b010f0b1a5 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,7 +67,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 721f3b94e04..3c5fc9e102a 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,4 +297,110 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+ min | max
+-------+--------
+ (0,1) | (39,5)
+(1 row)
+
+-- parallel range scans with upper bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid < '(30,1)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with lower bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid > '(10,0)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with both bounds
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+ count
+-------
+ 100
+(1 row)
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on parallel_tidrangescan t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on parallel_tidrangescan t2
+ TID Cond: (ctid <= t.ctid)
+(8 rows)
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ ctid | c
+-------+---
+ (0,1) | 1
+ (0,2) | 2
+ (0,3) | 3
+ (0,4) | 4
+ (0,5) | 5
+(5 rows)
+
+DROP TABLE parallel_tidrangescan;
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index ac09ebb6262..0f1e43c6d05 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,4 +98,49 @@ COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
+
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
+
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
+
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+
+-- parallel range scans with upper bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+
+-- parallel range scans with lower bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+
+-- parallel range scans with both bounds
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+DROP TABLE parallel_tidrangescan;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
--
2.43.0
v11-0002-fixup-v10-parallel-tid-range-scan.patchapplication/octet-stream; name=v11-0002-fixup-v10-parallel-tid-range-scan.patchDownload
From 76ee9f2037080a58e53ad2b7fd5cf9d0f86e15e9 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Fri, 7 Nov 2025 18:03:09 +1300
Subject: [PATCH v11 2/2] fixup! v10 parallel tid range scan
---
src/backend/access/heap/heapam.c | 14 +--
src/backend/access/table/tableam.c | 134 ++++++++++++---------
src/backend/executor/execParallel.c | 6 +-
src/backend/executor/nodeTidrangescan.c | 52 ++------
src/backend/optimizer/path/costsize.c | 6 +-
src/backend/optimizer/path/tidpath.c | 6 +-
src/include/access/tableam.h | 8 +-
src/test/regress/expected/tidrangescan.out | 29 ++---
src/test/regress/sql/tidrangescan.sql | 31 +++--
9 files changed, 136 insertions(+), 150 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f1693e79c31..1ad442c1b2c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -258,7 +258,9 @@ heap_scan_stream_read_next_parallel(ReadStream *stream,
/* parallel scan */
table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
scan->rs_parallelworkerdata,
- (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+ (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel,
+ scan->rs_startblock,
+ scan->rs_numblocks);
/* may return InvalidBlockNumber if there are no more blocks */
scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
@@ -490,16 +492,6 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
scan->rs_startblock = startBlk;
scan->rs_numblocks = numBlks;
-
- /* set the limits in the ParallelBlockTableScanDesc, when present as leader */
- if (scan->rs_base.rs_parallel != NULL && !IsParallelWorker())
- {
- ParallelBlockTableScanDesc bpscan;
-
- bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
- bpscan->phs_startblock = startBlk;
- bpscan->phs_numblock = numBlks;
- }
}
/*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index baef7459b6b..9c3347ba12b 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -189,8 +189,8 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
}
TableScanDesc
-table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan,
- ItemPointerData * mintid, ItemPointerData * maxtid)
+table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan)
{
Snapshot snapshot;
uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
@@ -216,11 +216,6 @@ table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan
sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
pscan, flags);
-
- /* Set the TID range if needed */
- if (mintid && maxtid)
- relation->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
-
return sscan;
}
@@ -453,14 +448,22 @@ table_block_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
*
* Determine where the parallel seq scan should start. This function may be
* called many times, once by each parallel worker. We must be careful only
- * to set the startblock once.
+ * to set the phs_startblock and phs_numblock fields once.
+ *
+ * Callers may optionally specify a non-InvalidBlockNumber value for
+ * 'startblock' to force the scan to start at the given page. Likewise,
+ * 'numblocks' can be specified as a non-InvalidBlockNumber to limit the
+ * number of blocks to scan to that many blocks.
*/
void
table_block_parallelscan_startblock_init(Relation rel,
ParallelBlockTableScanWorker pbscanwork,
- ParallelBlockTableScanDesc pbscan)
+ ParallelBlockTableScanDesc pbscan,
+ BlockNumber startblock,
+ BlockNumber numblocks)
{
BlockNumber sync_startpage = InvalidBlockNumber;
+ BlockNumber scan_nblocks;
/* Reset the state we use for controlling allocation size. */
memset(pbscanwork, 0, sizeof(*pbscanwork));
@@ -468,42 +471,36 @@ table_block_parallelscan_startblock_init(Relation rel,
StaticAssertStmt(MaxBlockNumber <= 0xFFFFFFFE,
"pg_nextpower2_32 may be too small for non-standard BlockNumber width");
- /*
- * We determine the chunk size based on the size of the relation. First we
- * split the relation into PARALLEL_SEQSCAN_NCHUNKS chunks but we then
- * take the next highest power of 2 number of the chunk size. This means
- * we split the relation into somewhere between PARALLEL_SEQSCAN_NCHUNKS
- * and PARALLEL_SEQSCAN_NCHUNKS / 2 chunks.
- */
- pbscanwork->phsw_chunk_size = pg_nextpower2_32(Max(pbscan->phs_nblocks /
- PARALLEL_SEQSCAN_NCHUNKS, 1));
-
- /*
- * Ensure we don't go over the maximum chunk size with larger tables. This
- * means we may get much more than PARALLEL_SEQSCAN_NCHUNKS for larger
- * tables. Too large a chunk size has been shown to be detrimental to
- * synchronous scan performance.
- */
- pbscanwork->phsw_chunk_size = Min(pbscanwork->phsw_chunk_size,
- PARALLEL_SEQSCAN_MAX_CHUNK_SIZE);
-
retry:
/* Grab the spinlock. */
SpinLockAcquire(&pbscan->phs_mutex);
/*
- * If the scan's startblock has not yet been initialized, we must do so
- * now. If this is not a synchronized scan, we just start at block 0, but
- * if it is a synchronized scan, we must get the starting position from
- * the synchronized scan machinery. We can't hold the spinlock while
- * doing that, though, so release the spinlock, get the information we
- * need, and retry. If nobody else has initialized the scan in the
- * meantime, we'll fill in the value we fetched on the second time
- * through.
+ * When the caller specified a limit on the number of blocks to scan, set
+ * that in the ParallelBlockTableScanDesc, if it's not been done by
+ * another worker already.
+ */
+ if (numblocks != InvalidBlockNumber &&
+ pbscan->phs_numblock == InvalidBlockNumber)
+ {
+ pbscan->phs_numblock = numblocks;
+ }
+
+ /*
+ * If the scan's phs_startblock has not yet been initialized, we must do
+ * so now. If a startblock was specified, start there, otherwise if this
+ * is not a synchronized scan, we just start at block 0, but if it is a
+ * synchronized scan, we must get the starting position from the
+ * synchronized scan machinery. We can't hold the spinlock while doing
+ * that, though, so release the spinlock, get the information we need, and
+ * retry. If nobody else has initialized the scan in the meantime, we'll
+ * fill in the value we fetched on the second time through.
*/
if (pbscan->phs_startblock == InvalidBlockNumber)
{
- if (!pbscan->base.phs_syncscan)
+ if (startblock != InvalidBlockNumber)
+ pbscan->phs_startblock = startblock;
+ else if (!pbscan->base.phs_syncscan)
pbscan->phs_startblock = 0;
else if (sync_startpage != InvalidBlockNumber)
pbscan->phs_startblock = sync_startpage;
@@ -515,6 +512,34 @@ retry:
}
}
SpinLockRelease(&pbscan->phs_mutex);
+
+ /*
+ * Figure out how many blocks we're going to scan; either all of them, or
+ * just phs_numblock's worth, if a limit has been imposed.
+ */
+ if (pbscan->phs_numblock == InvalidBlockNumber)
+ scan_nblocks = pbscan->phs_nblocks;
+ else
+ scan_nblocks = pbscan->phs_numblock;
+
+ /*
+ * We determine the chunk size based on scan_nblocks. First we split
+ * scan_nblocks into PARALLEL_SEQSCAN_NCHUNKS chunks then we calculate the
+ * next highest power of 2 number of the result. This means we split the
+ * blocks we're scanning into somewhere between PARALLEL_SEQSCAN_NCHUNKS
+ * and PARALLEL_SEQSCAN_NCHUNKS / 2 chunks.
+ */
+ pbscanwork->phsw_chunk_size = pg_nextpower2_32(Max(scan_nblocks /
+ PARALLEL_SEQSCAN_NCHUNKS, 1));
+
+ /*
+ * Ensure we don't go over the maximum chunk size with larger tables. This
+ * means we may get much more than PARALLEL_SEQSCAN_NCHUNKS for larger
+ * tables. Too large a chunk size has been shown to be detrimental to
+ * synchronous scan performance.
+ */
+ pbscanwork->phsw_chunk_size = Min(pbscanwork->phsw_chunk_size,
+ PARALLEL_SEQSCAN_MAX_CHUNK_SIZE);
}
/*
@@ -530,6 +555,7 @@ table_block_parallelscan_nextpage(Relation rel,
ParallelBlockTableScanWorker pbscanwork,
ParallelBlockTableScanDesc pbscan)
{
+ BlockNumber scan_nblocks;
BlockNumber page;
uint64 nallocated;
@@ -550,7 +576,7 @@ table_block_parallelscan_nextpage(Relation rel,
*
* Here we name these ranges of blocks "chunks". The initial size of
* these chunks is determined in table_block_parallelscan_startblock_init
- * based on the size of the relation. Towards the end of the scan, we
+ * based on the number of blocks to scan. Towards the end of the scan, we
* start making reductions in the size of the chunks in order to attempt
* to divide the remaining work over all the workers as evenly as
* possible.
@@ -567,17 +593,23 @@ table_block_parallelscan_nextpage(Relation rel,
* phs_nallocated counter will exceed rs_nblocks, because workers will
* still increment the value, when they try to allocate the next block but
* all blocks have been allocated already. The counter must be 64 bits
- * wide because of that, to avoid wrapping around when rs_nblocks is close
- * to 2^32.
+ * wide because of that, to avoid wrapping around when scan_nblocks is
+ * close to 2^32.
*
* The actual block to return is calculated by adding the counter to the
- * starting block number, modulo nblocks.
+ * starting block number, modulo phs_nblocks.
*/
+ /* First, figure out how many blocks we're planning on scanning */
+ if (pbscan->phs_numblock == InvalidBlockNumber)
+ scan_nblocks = pbscan->phs_nblocks;
+ else
+ scan_nblocks = pbscan->phs_numblock;
+
/*
- * First check if we have any remaining blocks in a previous chunk for
- * this worker. We must consume all of the blocks from that before we
- * allocate a new chunk to the worker.
+ * Now check if we have any remaining blocks in a previous chunk for this
+ * worker. We must consume all of the blocks from that before we allocate
+ * a new chunk to the worker.
*/
if (pbscanwork->phsw_chunk_remaining > 0)
{
@@ -599,7 +631,7 @@ table_block_parallelscan_nextpage(Relation rel,
* chunk size set to 1.
*/
if (pbscanwork->phsw_chunk_size > 1 &&
- pbscanwork->phsw_nallocated > pbscan->phs_nblocks -
+ pbscanwork->phsw_nallocated > scan_nblocks -
(pbscanwork->phsw_chunk_size * PARALLEL_SEQSCAN_RAMPDOWN_CHUNKS))
pbscanwork->phsw_chunk_size >>= 1;
@@ -614,15 +646,9 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
- /*
- * Check if we've allocated every block in the relation, or if we've
- * reached the limit imposed by pbscan->phs_numblock (if set).
- */
- if (nallocated >= pbscan->phs_nblocks)
- page = InvalidBlockNumber; /* all blocks have been allocated */
- else if (pbscan->phs_numblock != InvalidBlockNumber &&
- nallocated >= pbscan->phs_numblock)
- page = InvalidBlockNumber; /* upper scan limit reached */
+ /* Check if we've run out of blocks to scan */
+ if (nallocated >= scan_nblocks)
+ page = InvalidBlockNumber; /* all blocks have been allocated */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 7b1eb2e82c7..0125464d942 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -40,8 +40,8 @@
#include "executor/nodeSeqscan.h"
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
-#include "executor/tqueue.h"
#include "executor/nodeTidrangescan.h"
+#include "executor/tqueue.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -502,7 +502,7 @@ ExecParallelInitializeDSM(PlanState *planstate,
case T_TidRangeScanState:
if (planstate->plan->parallel_aware)
ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
- d->pcxt);
+ d->pcxt);
break;
case T_AppendState:
if (planstate->plan->parallel_aware)
@@ -1008,7 +1008,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_TidRangeScanState:
if (planstate->plan->parallel_aware)
ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
- pcxt);
+ pcxt);
break;
case T_AppendState:
if (planstate->plan->parallel_aware)
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 39088755e90..03ce8525f89 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -250,13 +250,9 @@ TidRangeNext(TidRangeScanState *node)
}
else
{
- /* rescan with the updated TID range only in non-parallel mode */
- if (scandesc->rs_parallel == NULL)
- {
- /* rescan with the updated TID range */
- table_rescan_tidrange(scandesc, &node->trss_mintid,
- &node->trss_maxtid);
- }
+ /* rescan with the updated TID range */
+ table_rescan_tidrange(scandesc, &node->trss_mintid,
+ &node->trss_maxtid);
}
node->trss_inScan = true;
@@ -419,6 +415,7 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+
/* ----------------------------------------------------------------
* Parallel Scan Support
* ----------------------------------------------------------------
@@ -460,19 +457,9 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
pscan,
estate->es_snapshot);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
-
- /*
- * Initialize parallel scan descriptor with given TID range if it can be
- * evaluated successfully.
- */
- if (TidRangeEval(node))
- node->ss.ss_currentScanDesc =
- table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
- &node->trss_mintid, &node->trss_maxtid);
- else
- node->ss.ss_currentScanDesc =
- table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
- NULL, NULL);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
+ pscan);
}
/* ----------------------------------------------------------------
@@ -483,21 +470,12 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
*/
void
ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
- ParallelContext *pcxt)
+ ParallelContext *pcxt)
{
ParallelTableScanDesc pscan;
pscan = node->ss.ss_currentScanDesc->rs_parallel;
table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
-
- /* Set the new TID range if it can be evaluated successfully */
- if (TidRangeEval(node))
- node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(
- node->ss.ss_currentScanDesc, &node->trss_mintid,
- &node->trss_maxtid);
- else
- node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(
- node->ss.ss_currentScanDesc, NULL, NULL);
}
/* ----------------------------------------------------------------
@@ -508,18 +486,12 @@ ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
*/
void
ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
- ParallelWorkerContext *pwcxt)
+ ParallelWorkerContext *pwcxt)
{
ParallelTableScanDesc pscan;
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
-
- if (TidRangeEval(node))
- node->ss.ss_currentScanDesc =
- table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
- &node->trss_mintid, &node->trss_maxtid);
- else
- node->ss.ss_currentScanDesc =
- table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
- NULL, NULL);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
+ pscan);
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 01976226d19..5a7283bd2f5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1371,7 +1371,11 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* The first page in a range requires a random seek, but each subsequent
- * page is just a normal sequential page read.
+ * page is just a normal sequential page read. NOTE: it's desirable for
+ * TID Range Scans to cost more than the equivalent Sequential Scans,
+ * because Seq Scans have some performance advantages such as scan
+ * synchronization, and we'd prefer one of them to be picked unless a TID
+ * Range Scan really is better.
*/
ntuples = selectivity * baserel->tuples;
nseqpages = pages - 1.0;
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index e48c85833e7..3ddbc10bbdf 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -47,7 +47,6 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
-#include "optimizer/cost.h"
/*
@@ -491,9 +490,8 @@ ec_member_matches_ctid(PlannerInfo *root, RelOptInfo *rel,
/*
* create_tidscan_paths
- * Create paths corresponding to direct TID scans of the given rel.
- *
- * Candidate paths are added to the rel's pathlist (using add_path).
+ * Create paths corresponding to direct TID scans of the given rel and add
+ * them to the corresponding path list via add_path or add_partial_path.
*/
bool
create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8e97fc5f0be..5ef8de3f141 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1138,9 +1138,7 @@ extern TableScanDesc table_beginscan_parallel(Relation relation,
* Caller must hold a suitable lock on the relation.
*/
extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
- ParallelTableScanDesc pscan,
- ItemPointerData * mintid,
- ItemPointerData * maxtid);
+ ParallelTableScanDesc pscan);
/*
* Restart a parallel scan. Call this in the leader process. Caller is
@@ -2040,7 +2038,9 @@ extern BlockNumber table_block_parallelscan_nextpage(Relation rel,
ParallelBlockTableScanDesc pbscan);
extern void table_block_parallelscan_startblock_init(Relation rel,
ParallelBlockTableScanWorker pbscanwork,
- ParallelBlockTableScanDesc pbscan);
+ ParallelBlockTableScanDesc pbscan,
+ BlockNumber startblock,
+ BlockNumber numblocks);
/* ----------------------------------------------------------------------------
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 3c5fc9e102a..ce75c96e7c8 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,22 +297,23 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
--- tests for parallel tidrangescans
-SET parallel_setup_cost TO 0;
-SET parallel_tuple_cost TO 0;
-SET min_parallel_table_scan_size TO 0;
-SET max_parallel_workers_per_gather TO 4;
+-- Tests for parallel tidrangescans
+BEGIN;
+SET LOCAL parallel_setup_cost TO 0;
+SET LOCAL parallel_tuple_cost TO 0;
+SET LOCAL min_parallel_table_scan_size TO 0;
+SET LOCAL max_parallel_workers_per_gather TO 4;
CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
--- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+-- Insert enough tuples such that each page gets 5 tuples with fillfactor = 10
INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
--- ensure there are 40 pages for parallel test
+-- Ensure there are 40 pages for parallel test
SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
min | max
-------+--------
(0,1) | (39,5)
(1 row)
--- parallel range scans with upper bound
+-- Parallel range scans with upper bound
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
QUERY PLAN
@@ -331,7 +332,7 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
150
(1 row)
--- parallel range scans with lower bound
+-- Parallel range scans with lower bound
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
QUERY PLAN
@@ -350,7 +351,7 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
150
(1 row)
--- parallel range scans with both bounds
+-- Parallel range scans with both bounds
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
QUERY PLAN
@@ -369,7 +370,7 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30
100
(1 row)
--- parallel rescans
+-- Parallel rescans
EXPLAIN (COSTS OFF)
SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
@@ -398,9 +399,5 @@ WHERE t.ctid < '(1,0)';
(0,5) | 5
(5 rows)
-DROP TABLE parallel_tidrangescan;
-RESET parallel_setup_cost;
-RESET parallel_tuple_cost;
-RESET min_parallel_table_scan_size;
-RESET max_parallel_workers_per_gather;
+ROLLBACK;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index 0f1e43c6d05..c9a63b10ddd 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,36 +98,38 @@ COMMIT;
DROP TABLE tidrangescan;
--- tests for parallel tidrangescans
-SET parallel_setup_cost TO 0;
-SET parallel_tuple_cost TO 0;
-SET min_parallel_table_scan_size TO 0;
-SET max_parallel_workers_per_gather TO 4;
+-- Tests for parallel tidrangescans
+BEGIN;
+
+SET LOCAL parallel_setup_cost TO 0;
+SET LOCAL parallel_tuple_cost TO 0;
+SET LOCAL min_parallel_table_scan_size TO 0;
+SET LOCAL max_parallel_workers_per_gather TO 4;
CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
--- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+-- Insert enough tuples such that each page gets 5 tuples with fillfactor = 10
INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
--- ensure there are 40 pages for parallel test
+-- Ensure there are 40 pages for parallel test
SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
--- parallel range scans with upper bound
+-- Parallel range scans with upper bound
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
--- parallel range scans with lower bound
+-- Parallel range scans with lower bound
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
--- parallel range scans with both bounds
+-- Parallel range scans with both bounds
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
--- parallel rescans
+-- Parallel rescans
EXPLAIN (COSTS OFF)
SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
@@ -137,10 +139,5 @@ SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
WHERE t.ctid < '(1,0)';
-DROP TABLE parallel_tidrangescan;
-
-RESET parallel_setup_cost;
-RESET parallel_tuple_cost;
-RESET min_parallel_table_scan_size;
-RESET max_parallel_workers_per_gather;
+ROLLBACK;
RESET enable_seqscan;
--
2.43.0
On Fri, 7 Nov 2025 at 18:31, David Rowley <dgrowleyml@gmail.com> wrote:
I still need to do a bit more testing on this, but in the meantime
thought I'd share what I've done with it so that other people can look
in parallel.
I've been looking at the v11 patch again. I did some testing and can't break it.
I noted down a couple of things:
1. table_parallelscan_initialize() is called first in a parallel TID
Range Scan which calls table_block_parallelscan_initialize() and may
set phs_syncscan to true. We directly then call
table_beginscan_parallel_tidrange(), which sets phs_syncscan = false
unconditionally. No bugs, but it is a little strange. One way to get
around this weirdness would be to move the responsibility of setting
phs_syncscan into table_parallelscan_initialize() and then use
table_beginscan_parallel_tidrange() to set phs_syncscan = false. I
wasn't overly concerned about this, so I didn't do that. I just wanted
to mention it here as someone else might think it's worth making
better.
2. I've made it so each worker calls TidRangeEval() to figure out the
TID range to scan. The first worker to get the lock in
table_block_parallelscan_startblock_init() gets to set the range of
blocks to scan for all workers. In the planner, the suitability of the
TID Range quals are checked with IsBinaryTidClause(), which allows
OpExprs with the ctid column compared to a Var-less expression that
contains no volatile functions. If someone coded up a parallel safe
volatile function and marked it as stable, each worker could end up
with different ctid values and one worker would win the race to set
the blocks to scan based on the TID values it got, which wouldn't
really suit the other workers and the tid values they ended up with.
I'm thinking of someone marks the volatile function as stable, then
we're entitled to having things behave strangely for them. To make
this right, I'd have to make it so only 1 worker evaluates the TID
expressions and then sets those somehow for the other workers. There'd
have to be some additional shared memory for that, and I don't think
the complexity of making that work is worthwhile. Mentioning it as
someone else might feel differently.
I've attached v12, which adds a mention in the docs about Parallel TID
Range scans being supported. It also does very minor adjustments to
the comments. Again, I've kept Cary's v10 and the changes I've made
separate. Of course, I'd squash these before commit.
Does anyone have any opinions on #1 or #2 or want to look at this? I'd
like to get this in soon.
David
Attachments:
v12-0001-v10-parallel-tid-range-scan.patchapplication/octet-stream; name=v12-0001-v10-parallel-tid-range-scan.patchDownload
From 38d6e8a1661fd81b2ddbcb867c724f4bad7dcd09 Mon Sep 17 00:00:00 2001
From: Cary Huang <cary.huang@highgo.ca>
Date: Fri, 29 Aug 2025 10:10:51 -0700
Subject: [PATCH v12 1/2] v10 parallel tid range scan
---
src/backend/access/heap/heapam.c | 10 ++
src/backend/access/table/tableam.c | 46 ++++++++-
src/backend/executor/execParallel.c | 21 ++++
src/backend/executor/nodeTidrangescan.c | 114 ++++++++++++++++++++-
src/backend/optimizer/path/costsize.c | 36 ++++---
src/backend/optimizer/path/tidpath.c | 20 +++-
src/backend/optimizer/util/pathnode.c | 7 +-
src/include/access/relscan.h | 2 +
src/include/access/tableam.h | 12 +++
src/include/executor/nodeTidrangescan.h | 7 ++
src/include/nodes/execnodes.h | 2 +
src/include/optimizer/pathnode.h | 3 +-
src/test/regress/expected/tidrangescan.out | 106 +++++++++++++++++++
src/test/regress/sql/tidrangescan.sql | 45 ++++++++
14 files changed, 410 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4b0c49f4bb0..de0a3a8b219 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -490,6 +490,16 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
scan->rs_startblock = startBlk;
scan->rs_numblocks = numBlks;
+
+ /* set the limits in the ParallelBlockTableScanDesc, when present as leader */
+ if (scan->rs_base.rs_parallel != NULL && !IsParallelWorker())
+ {
+ ParallelBlockTableScanDesc bpscan;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = startBlk;
+ bpscan->phs_numblock = numBlks;
+ }
}
/*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..baef7459b6b 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -188,6 +188,42 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
pscan, flags);
}
+TableScanDesc
+table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan,
+ ItemPointerData * mintid, ItemPointerData * maxtid)
+{
+ Snapshot snapshot;
+ uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
+ TableScanDesc sscan;
+
+ Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
+
+ /* disable syncscan in parallel tid range scan. */
+ pscan->phs_syncscan = false;
+
+ if (!pscan->phs_snapshot_any)
+ {
+ /* Snapshot was serialized -- restore it */
+ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+ RegisterSnapshot(snapshot);
+ flags |= SO_TEMP_SNAPSHOT;
+ }
+ else
+ {
+ /* SnapshotAny passed by caller (not serialized) */
+ snapshot = SnapshotAny;
+ }
+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ /* Set the TID range if needed */
+ if (mintid && maxtid)
+ relation->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
+
+ return sscan;
+}
+
/* ----------------------------------------------------------------------------
* Index scan related functions.
@@ -398,6 +434,7 @@ table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
bpscan->phs_nblocks > NBuffers / 4;
SpinLockInit(&bpscan->phs_mutex);
bpscan->phs_startblock = InvalidBlockNumber;
+ bpscan->phs_numblock = InvalidBlockNumber;
pg_atomic_init_u64(&bpscan->phs_nallocated, 0);
return sizeof(ParallelBlockTableScanDescData);
@@ -577,8 +614,15 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
+ /*
+ * Check if we've allocated every block in the relation, or if we've
+ * reached the limit imposed by pbscan->phs_numblock (if set).
+ */
if (nallocated >= pbscan->phs_nblocks)
- page = InvalidBlockNumber; /* all blocks have been allocated */
+ page = InvalidBlockNumber; /* all blocks have been allocated */
+ else if (pbscan->phs_numblock != InvalidBlockNumber &&
+ nallocated >= pbscan->phs_numblock)
+ page = InvalidBlockNumber; /* upper scan limit reached */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557cf..7b1eb2e82c7 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -41,6 +41,7 @@
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "executor/nodeTidrangescan.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -266,6 +267,11 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
ExecForeignScanEstimate((ForeignScanState *) planstate,
e->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanEstimate((TidRangeScanState *) planstate,
+ e->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendEstimate((AppendState *) planstate,
@@ -493,6 +499,11 @@ ExecParallelInitializeDSM(PlanState *planstate,
ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
d->pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
+ d->pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeDSM((AppendState *) planstate,
@@ -994,6 +1005,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
ExecForeignScanReInitializeDSM((ForeignScanState *) planstate,
pcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
+ pcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendReInitializeDSM((AppendState *) planstate, pcxt);
@@ -1362,6 +1378,11 @@ ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
pwcxt);
break;
+ case T_TidRangeScanState:
+ if (planstate->plan->parallel_aware)
+ ExecTidRangeScanInitializeWorker((TidRangeScanState *) planstate,
+ pwcxt);
+ break;
case T_AppendState:
if (planstate->plan->parallel_aware)
ExecAppendInitializeWorker((AppendState *) planstate, pwcxt);
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 1bce8d6cbfe..39088755e90 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -250,9 +250,13 @@ TidRangeNext(TidRangeScanState *node)
}
else
{
- /* rescan with the updated TID range */
- table_rescan_tidrange(scandesc, &node->trss_mintid,
- &node->trss_maxtid);
+ /* rescan with the updated TID range only in non-parallel mode */
+ if (scandesc->rs_parallel == NULL)
+ {
+ /* rescan with the updated TID range */
+ table_rescan_tidrange(scandesc, &node->trss_mintid,
+ &node->trss_maxtid);
+ }
}
node->trss_inScan = true;
@@ -415,3 +419,107 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->trss_pscanlen =
+ table_parallelscan_estimate(node->ss.ss_currentRelation,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->trss_pscanlen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeDSM
+ *
+ * Set up a parallel TID scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen);
+ table_parallelscan_initialize(node->ss.ss_currentRelation,
+ pscan,
+ estate->es_snapshot);
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
+
+ /*
+ * Initialize parallel scan descriptor with given TID range if it can be
+ * evaluated successfully.
+ */
+ if (TidRangeEval(node))
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ &node->trss_mintid, &node->trss_maxtid);
+ else
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ NULL, NULL);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
+ ParallelContext *pcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = node->ss.ss_currentScanDesc->rs_parallel;
+ table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
+
+ /* Set the new TID range if it can be evaluated successfully */
+ if (TidRangeEval(node))
+ node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(
+ node->ss.ss_currentScanDesc, &node->trss_mintid,
+ &node->trss_maxtid);
+ else
+ node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(
+ node->ss.ss_currentScanDesc, NULL, NULL);
+}
+
+/* ----------------------------------------------------------------
+ * ExecTidRangeScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
+
+ pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+ if (TidRangeEval(node))
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ &node->trss_mintid, &node->trss_maxtid);
+ else
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
+ NULL, NULL);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8335cf5b5c5..01976226d19 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1340,8 +1340,9 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
{
Selectivity selectivity;
double pages;
- Cost startup_cost = 0;
- Cost run_cost = 0;
+ Cost startup_cost;
+ Cost cpu_run_cost;
+ Cost disk_run_cost;
QualCost qpqual_cost;
Cost cpu_per_tuple;
QualCost tid_qual_cost;
@@ -1370,11 +1371,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* The first page in a range requires a random seek, but each subsequent
- * page is just a normal sequential page read. NOTE: it's desirable for
- * TID Range Scans to cost more than the equivalent Sequential Scans,
- * because Seq Scans have some performance advantages such as scan
- * synchronization and parallelizability, and we'd prefer one of them to
- * be picked unless a TID Range Scan really is better.
+ * page is just a normal sequential page read.
*/
ntuples = selectivity * baserel->tuples;
nseqpages = pages - 1.0;
@@ -1391,7 +1388,7 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
&spc_seq_page_cost);
/* disk costs; 1 random page and the remainder as seq pages */
- run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
+ disk_run_cost = spc_random_page_cost + spc_seq_page_cost * nseqpages;
/* Add scanning CPU costs */
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
@@ -1403,20 +1400,35 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
* can't be removed, this is a mistake and we're going to underestimate
* the CPU cost a bit.)
*/
- startup_cost += qpqual_cost.startup + tid_qual_cost.per_tuple;
+ startup_cost = qpqual_cost.startup + tid_qual_cost.per_tuple;
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple -
tid_qual_cost.per_tuple;
- run_cost += cpu_per_tuple * ntuples;
+ cpu_run_cost = cpu_per_tuple * ntuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->pathtarget->cost.startup;
- run_cost += path->pathtarget->cost.per_tuple * path->rows;
+ cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows;
+
+ /* Adjust costing for parallelism, if used. */
+ if (path->parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(path);
+
+ /* The CPU cost is divided among all the workers. */
+ cpu_run_cost /= parallel_divisor;
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->rows = clamp_row_est(path->rows / parallel_divisor);
+ }
/* we should not generate this path type when enable_tidscan=false */
Assert(enable_tidscan);
path->disabled_nodes = 0;
path->startup_cost = startup_cost;
- path->total_cost = startup_cost + run_cost;
+ path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
}
/*
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index 2bfb338b81c..e48c85833e7 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -47,6 +47,7 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/cost.h"
/*
@@ -553,7 +554,24 @@ create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
add_path(rel, (Path *) create_tidrangescan_path(root, rel,
tidrangequals,
- required_outer));
+ required_outer,
+ 0));
+
+ /* If appropriate, consider parallel tid range scan. */
+ if (rel->consider_parallel && required_outer == NULL)
+ {
+ int parallel_workers;
+
+ parallel_workers = compute_parallel_worker(rel, rel->pages, -1,
+ max_parallel_workers_per_gather);
+
+ if (parallel_workers > 0)
+ add_partial_path(rel, (Path *) create_tidrangescan_path(root,
+ rel,
+ tidrangequals,
+ required_outer,
+ parallel_workers));
+ }
}
/*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e4fd6950fad..fd4bd5f93f0 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1262,7 +1262,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
*/
TidRangePath *
create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
- List *tidrangequals, Relids required_outer)
+ List *tidrangequals, Relids required_outer,
+ int parallel_workers)
{
TidRangePath *pathnode = makeNode(TidRangePath);
@@ -1271,9 +1272,9 @@ create_tidrangescan_path(PlannerInfo *root, RelOptInfo *rel,
pathnode->path.pathtarget = rel->reltarget;
pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
required_outer);
- pathnode->path.parallel_aware = false;
+ pathnode->path.parallel_aware = (parallel_workers > 0);
pathnode->path.parallel_safe = rel->consider_parallel;
- pathnode->path.parallel_workers = 0;
+ pathnode->path.parallel_workers = parallel_workers;
pathnode->path.pathkeys = NIL; /* always unordered */
pathnode->tidrangequals = tidrangequals;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..3da43557a13 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -96,6 +96,8 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_nblocks; /* # blocks in relation at start of scan */
slock_t phs_mutex; /* mutual exclusion for setting startblock */
BlockNumber phs_startblock; /* starting block number */
+ BlockNumber phs_numblock; /* # blocks to scan, or InvalidBlockNumber if
+ * no limit */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
} ParallelBlockTableScanDescData;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..8e97fc5f0be 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1130,6 +1130,18 @@ extern void table_parallelscan_initialize(Relation rel,
extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
+/*
+ * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
+ * table_parallelscan_initialize(), for the same relation. The initialization
+ * does not need to have happened in this backend.
+ *
+ * Caller must hold a suitable lock on the relation.
+ */
+extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan,
+ ItemPointerData * mintid,
+ ItemPointerData * maxtid);
+
/*
* Restart a parallel scan. Call this in the leader process. Caller is
* responsible for making sure that all workers have finished the scan
diff --git a/src/include/executor/nodeTidrangescan.h b/src/include/executor/nodeTidrangescan.h
index a831f1202ca..2b5465b3ce4 100644
--- a/src/include/executor/nodeTidrangescan.h
+++ b/src/include/executor/nodeTidrangescan.h
@@ -14,6 +14,7 @@
#ifndef NODETIDRANGESCAN_H
#define NODETIDRANGESCAN_H
+#include "access/parallel.h"
#include "nodes/execnodes.h"
extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
@@ -21,4 +22,10 @@ extern TidRangeScanState *ExecInitTidRangeScan(TidRangeScan *node,
extern void ExecEndTidRangeScan(TidRangeScanState *node);
extern void ExecReScanTidRangeScan(TidRangeScanState *node);
+/* parallel scan support */
+extern void ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanReInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt);
+extern void ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt);
+
#endif /* NODETIDRANGESCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 18ae8f0d4bb..64ff6996431 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1930,6 +1930,7 @@ typedef struct TidScanState
* trss_mintid the lowest TID in the scan range
* trss_maxtid the highest TID in the scan range
* trss_inScan is a scan currently in progress?
+ * trss_pscanlen size of parallel heap scan descriptor
* ----------------
*/
typedef struct TidRangeScanState
@@ -1939,6 +1940,7 @@ typedef struct TidRangeScanState
ItemPointerData trss_mintid;
ItemPointerData trss_maxtid;
bool trss_inScan;
+ Size trss_pscanlen;
} TidRangeScanState;
/* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 955e9056858..6b010f0b1a5 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,7 +67,8 @@ extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
extern TidRangePath *create_tidrangescan_path(PlannerInfo *root,
RelOptInfo *rel,
List *tidrangequals,
- Relids required_outer);
+ Relids required_outer,
+ int parallel_workers);
extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
List *subpaths, List *partial_subpaths,
List *pathkeys, Relids required_outer,
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 721f3b94e04..3c5fc9e102a 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,4 +297,110 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+ min | max
+-------+--------
+ (0,1) | (39,5)
+(1 row)
+
+-- parallel range scans with upper bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid < '(30,1)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with lower bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: (ctid > '(10,0)'::tid)
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+ count
+-------
+ 150
+(1 row)
+
+-- parallel range scans with both bounds
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+ QUERY PLAN
+-----------------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 4
+ -> Partial Aggregate
+ -> Parallel Tid Range Scan on parallel_tidrangescan
+ TID Cond: ((ctid > '(10,0)'::tid) AND (ctid < '(30,1)'::tid))
+(6 rows)
+
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+ count
+-------
+ 100
+(1 row)
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ QUERY PLAN
+----------------------------------------------------------------
+ Nested Loop
+ -> Gather
+ Workers Planned: 4
+ -> Parallel Tid Range Scan on parallel_tidrangescan t
+ TID Cond: (ctid < '(1,0)'::tid)
+ -> Aggregate
+ -> Tid Range Scan on parallel_tidrangescan t2
+ TID Cond: (ctid <= t.ctid)
+(8 rows)
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+ ctid | c
+-------+---
+ (0,1) | 1
+ (0,2) | 2
+ (0,3) | 3
+ (0,4) | 4
+ (0,5) | 5
+(5 rows)
+
+DROP TABLE parallel_tidrangescan;
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index ac09ebb6262..0f1e43c6d05 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,4 +98,49 @@ COMMIT;
DROP TABLE tidrangescan;
+-- tests for parallel tidrangescans
+SET parallel_setup_cost TO 0;
+SET parallel_tuple_cost TO 0;
+SET min_parallel_table_scan_size TO 0;
+SET max_parallel_workers_per_gather TO 4;
+
+CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
+
+-- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
+
+-- ensure there are 40 pages for parallel test
+SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
+
+-- parallel range scans with upper bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
+
+-- parallel range scans with lower bound
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
+
+-- parallel range scans with both bounds
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
+
+-- parallel rescans
+EXPLAIN (COSTS OFF)
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
+LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
+WHERE t.ctid < '(1,0)';
+
+DROP TABLE parallel_tidrangescan;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
RESET enable_seqscan;
--
2.43.0
v12-0002-fixup-v10-parallel-tid-range-scan.patchapplication/octet-stream; name=v12-0002-fixup-v10-parallel-tid-range-scan.patchDownload
From bc5acb45fd9c053b756d7a0a3e86418e5851f737 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Fri, 7 Nov 2025 18:03:09 +1300
Subject: [PATCH v12 2/2] fixup! v10 parallel tid range scan
---
doc/src/sgml/parallel.sgml | 9 ++
src/backend/access/heap/heapam.c | 14 +--
src/backend/access/table/tableam.c | 134 ++++++++++++---------
src/backend/executor/execParallel.c | 6 +-
src/backend/executor/nodeTidrangescan.c | 54 ++-------
src/backend/optimizer/path/costsize.c | 6 +-
src/backend/optimizer/path/tidpath.c | 6 +-
src/include/access/relscan.h | 2 +-
src/include/access/tableam.h | 14 +--
src/test/regress/expected/tidrangescan.out | 29 ++---
src/test/regress/sql/tidrangescan.sql | 31 +++--
11 files changed, 150 insertions(+), 155 deletions(-)
diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index 1ce9abf86f5..af43484703e 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -299,6 +299,15 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
within each worker process.
</para>
</listitem>
+ <listitem>
+ <para>
+ In a <emphasis>parallel tid range scan</emphasis>, the range of blocks
+ will be subdivided into smaller ranges which are shared among the
+ cooperating processes. Each worker process will complete the scanning
+ of its given range of blocks before requesting an additional range of
+ blocks.
+ </para>
+ </listitem>
</itemizedlist>
Other scan types, such as scans of non-btree indexes, may support
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index de0a3a8b219..0a820bab87a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -258,7 +258,9 @@ heap_scan_stream_read_next_parallel(ReadStream *stream,
/* parallel scan */
table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
scan->rs_parallelworkerdata,
- (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+ (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel,
+ scan->rs_startblock,
+ scan->rs_numblocks);
/* may return InvalidBlockNumber if there are no more blocks */
scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
@@ -490,16 +492,6 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
scan->rs_startblock = startBlk;
scan->rs_numblocks = numBlks;
-
- /* set the limits in the ParallelBlockTableScanDesc, when present as leader */
- if (scan->rs_base.rs_parallel != NULL && !IsParallelWorker())
- {
- ParallelBlockTableScanDesc bpscan;
-
- bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
- bpscan->phs_startblock = startBlk;
- bpscan->phs_numblock = numBlks;
- }
}
/*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index baef7459b6b..9c3347ba12b 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -189,8 +189,8 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
}
TableScanDesc
-table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan,
- ItemPointerData * mintid, ItemPointerData * maxtid)
+table_beginscan_parallel_tidrange(Relation relation,
+ ParallelTableScanDesc pscan)
{
Snapshot snapshot;
uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
@@ -216,11 +216,6 @@ table_beginscan_parallel_tidrange(Relation relation, ParallelTableScanDesc pscan
sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
pscan, flags);
-
- /* Set the TID range if needed */
- if (mintid && maxtid)
- relation->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid);
-
return sscan;
}
@@ -453,14 +448,22 @@ table_block_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
*
* Determine where the parallel seq scan should start. This function may be
* called many times, once by each parallel worker. We must be careful only
- * to set the startblock once.
+ * to set the phs_startblock and phs_numblock fields once.
+ *
+ * Callers may optionally specify a non-InvalidBlockNumber value for
+ * 'startblock' to force the scan to start at the given page. Likewise,
+ * 'numblocks' can be specified as a non-InvalidBlockNumber to limit the
+ * number of blocks to scan to that many blocks.
*/
void
table_block_parallelscan_startblock_init(Relation rel,
ParallelBlockTableScanWorker pbscanwork,
- ParallelBlockTableScanDesc pbscan)
+ ParallelBlockTableScanDesc pbscan,
+ BlockNumber startblock,
+ BlockNumber numblocks)
{
BlockNumber sync_startpage = InvalidBlockNumber;
+ BlockNumber scan_nblocks;
/* Reset the state we use for controlling allocation size. */
memset(pbscanwork, 0, sizeof(*pbscanwork));
@@ -468,42 +471,36 @@ table_block_parallelscan_startblock_init(Relation rel,
StaticAssertStmt(MaxBlockNumber <= 0xFFFFFFFE,
"pg_nextpower2_32 may be too small for non-standard BlockNumber width");
- /*
- * We determine the chunk size based on the size of the relation. First we
- * split the relation into PARALLEL_SEQSCAN_NCHUNKS chunks but we then
- * take the next highest power of 2 number of the chunk size. This means
- * we split the relation into somewhere between PARALLEL_SEQSCAN_NCHUNKS
- * and PARALLEL_SEQSCAN_NCHUNKS / 2 chunks.
- */
- pbscanwork->phsw_chunk_size = pg_nextpower2_32(Max(pbscan->phs_nblocks /
- PARALLEL_SEQSCAN_NCHUNKS, 1));
-
- /*
- * Ensure we don't go over the maximum chunk size with larger tables. This
- * means we may get much more than PARALLEL_SEQSCAN_NCHUNKS for larger
- * tables. Too large a chunk size has been shown to be detrimental to
- * synchronous scan performance.
- */
- pbscanwork->phsw_chunk_size = Min(pbscanwork->phsw_chunk_size,
- PARALLEL_SEQSCAN_MAX_CHUNK_SIZE);
-
retry:
/* Grab the spinlock. */
SpinLockAcquire(&pbscan->phs_mutex);
/*
- * If the scan's startblock has not yet been initialized, we must do so
- * now. If this is not a synchronized scan, we just start at block 0, but
- * if it is a synchronized scan, we must get the starting position from
- * the synchronized scan machinery. We can't hold the spinlock while
- * doing that, though, so release the spinlock, get the information we
- * need, and retry. If nobody else has initialized the scan in the
- * meantime, we'll fill in the value we fetched on the second time
- * through.
+ * When the caller specified a limit on the number of blocks to scan, set
+ * that in the ParallelBlockTableScanDesc, if it's not been done by
+ * another worker already.
+ */
+ if (numblocks != InvalidBlockNumber &&
+ pbscan->phs_numblock == InvalidBlockNumber)
+ {
+ pbscan->phs_numblock = numblocks;
+ }
+
+ /*
+ * If the scan's phs_startblock has not yet been initialized, we must do
+ * so now. If a startblock was specified, start there, otherwise if this
+ * is not a synchronized scan, we just start at block 0, but if it is a
+ * synchronized scan, we must get the starting position from the
+ * synchronized scan machinery. We can't hold the spinlock while doing
+ * that, though, so release the spinlock, get the information we need, and
+ * retry. If nobody else has initialized the scan in the meantime, we'll
+ * fill in the value we fetched on the second time through.
*/
if (pbscan->phs_startblock == InvalidBlockNumber)
{
- if (!pbscan->base.phs_syncscan)
+ if (startblock != InvalidBlockNumber)
+ pbscan->phs_startblock = startblock;
+ else if (!pbscan->base.phs_syncscan)
pbscan->phs_startblock = 0;
else if (sync_startpage != InvalidBlockNumber)
pbscan->phs_startblock = sync_startpage;
@@ -515,6 +512,34 @@ retry:
}
}
SpinLockRelease(&pbscan->phs_mutex);
+
+ /*
+ * Figure out how many blocks we're going to scan; either all of them, or
+ * just phs_numblock's worth, if a limit has been imposed.
+ */
+ if (pbscan->phs_numblock == InvalidBlockNumber)
+ scan_nblocks = pbscan->phs_nblocks;
+ else
+ scan_nblocks = pbscan->phs_numblock;
+
+ /*
+ * We determine the chunk size based on scan_nblocks. First we split
+ * scan_nblocks into PARALLEL_SEQSCAN_NCHUNKS chunks then we calculate the
+ * next highest power of 2 number of the result. This means we split the
+ * blocks we're scanning into somewhere between PARALLEL_SEQSCAN_NCHUNKS
+ * and PARALLEL_SEQSCAN_NCHUNKS / 2 chunks.
+ */
+ pbscanwork->phsw_chunk_size = pg_nextpower2_32(Max(scan_nblocks /
+ PARALLEL_SEQSCAN_NCHUNKS, 1));
+
+ /*
+ * Ensure we don't go over the maximum chunk size with larger tables. This
+ * means we may get much more than PARALLEL_SEQSCAN_NCHUNKS for larger
+ * tables. Too large a chunk size has been shown to be detrimental to
+ * synchronous scan performance.
+ */
+ pbscanwork->phsw_chunk_size = Min(pbscanwork->phsw_chunk_size,
+ PARALLEL_SEQSCAN_MAX_CHUNK_SIZE);
}
/*
@@ -530,6 +555,7 @@ table_block_parallelscan_nextpage(Relation rel,
ParallelBlockTableScanWorker pbscanwork,
ParallelBlockTableScanDesc pbscan)
{
+ BlockNumber scan_nblocks;
BlockNumber page;
uint64 nallocated;
@@ -550,7 +576,7 @@ table_block_parallelscan_nextpage(Relation rel,
*
* Here we name these ranges of blocks "chunks". The initial size of
* these chunks is determined in table_block_parallelscan_startblock_init
- * based on the size of the relation. Towards the end of the scan, we
+ * based on the number of blocks to scan. Towards the end of the scan, we
* start making reductions in the size of the chunks in order to attempt
* to divide the remaining work over all the workers as evenly as
* possible.
@@ -567,17 +593,23 @@ table_block_parallelscan_nextpage(Relation rel,
* phs_nallocated counter will exceed rs_nblocks, because workers will
* still increment the value, when they try to allocate the next block but
* all blocks have been allocated already. The counter must be 64 bits
- * wide because of that, to avoid wrapping around when rs_nblocks is close
- * to 2^32.
+ * wide because of that, to avoid wrapping around when scan_nblocks is
+ * close to 2^32.
*
* The actual block to return is calculated by adding the counter to the
- * starting block number, modulo nblocks.
+ * starting block number, modulo phs_nblocks.
*/
+ /* First, figure out how many blocks we're planning on scanning */
+ if (pbscan->phs_numblock == InvalidBlockNumber)
+ scan_nblocks = pbscan->phs_nblocks;
+ else
+ scan_nblocks = pbscan->phs_numblock;
+
/*
- * First check if we have any remaining blocks in a previous chunk for
- * this worker. We must consume all of the blocks from that before we
- * allocate a new chunk to the worker.
+ * Now check if we have any remaining blocks in a previous chunk for this
+ * worker. We must consume all of the blocks from that before we allocate
+ * a new chunk to the worker.
*/
if (pbscanwork->phsw_chunk_remaining > 0)
{
@@ -599,7 +631,7 @@ table_block_parallelscan_nextpage(Relation rel,
* chunk size set to 1.
*/
if (pbscanwork->phsw_chunk_size > 1 &&
- pbscanwork->phsw_nallocated > pbscan->phs_nblocks -
+ pbscanwork->phsw_nallocated > scan_nblocks -
(pbscanwork->phsw_chunk_size * PARALLEL_SEQSCAN_RAMPDOWN_CHUNKS))
pbscanwork->phsw_chunk_size >>= 1;
@@ -614,15 +646,9 @@ table_block_parallelscan_nextpage(Relation rel,
pbscanwork->phsw_chunk_remaining = pbscanwork->phsw_chunk_size - 1;
}
- /*
- * Check if we've allocated every block in the relation, or if we've
- * reached the limit imposed by pbscan->phs_numblock (if set).
- */
- if (nallocated >= pbscan->phs_nblocks)
- page = InvalidBlockNumber; /* all blocks have been allocated */
- else if (pbscan->phs_numblock != InvalidBlockNumber &&
- nallocated >= pbscan->phs_numblock)
- page = InvalidBlockNumber; /* upper scan limit reached */
+ /* Check if we've run out of blocks to scan */
+ if (nallocated >= scan_nblocks)
+ page = InvalidBlockNumber; /* all blocks have been allocated */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 7b1eb2e82c7..0125464d942 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -40,8 +40,8 @@
#include "executor/nodeSeqscan.h"
#include "executor/nodeSort.h"
#include "executor/nodeSubplan.h"
-#include "executor/tqueue.h"
#include "executor/nodeTidrangescan.h"
+#include "executor/tqueue.h"
#include "jit/jit.h"
#include "nodes/nodeFuncs.h"
#include "pgstat.h"
@@ -502,7 +502,7 @@ ExecParallelInitializeDSM(PlanState *planstate,
case T_TidRangeScanState:
if (planstate->plan->parallel_aware)
ExecTidRangeScanInitializeDSM((TidRangeScanState *) planstate,
- d->pcxt);
+ d->pcxt);
break;
case T_AppendState:
if (planstate->plan->parallel_aware)
@@ -1008,7 +1008,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
case T_TidRangeScanState:
if (planstate->plan->parallel_aware)
ExecTidRangeScanReInitializeDSM((TidRangeScanState *) planstate,
- pcxt);
+ pcxt);
break;
case T_AppendState:
if (planstate->plan->parallel_aware)
diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c
index 39088755e90..6fd9f68cddd 100644
--- a/src/backend/executor/nodeTidrangescan.c
+++ b/src/backend/executor/nodeTidrangescan.c
@@ -250,13 +250,9 @@ TidRangeNext(TidRangeScanState *node)
}
else
{
- /* rescan with the updated TID range only in non-parallel mode */
- if (scandesc->rs_parallel == NULL)
- {
- /* rescan with the updated TID range */
- table_rescan_tidrange(scandesc, &node->trss_mintid,
- &node->trss_maxtid);
- }
+ /* rescan with the updated TID range */
+ table_rescan_tidrange(scandesc, &node->trss_mintid,
+ &node->trss_maxtid);
}
node->trss_inScan = true;
@@ -419,6 +415,7 @@ ExecInitTidRangeScan(TidRangeScan *node, EState *estate, int eflags)
*/
return tidrangestate;
}
+
/* ----------------------------------------------------------------
* Parallel Scan Support
* ----------------------------------------------------------------
@@ -446,7 +443,7 @@ ExecTidRangeScanEstimate(TidRangeScanState *node, ParallelContext *pcxt)
/* ----------------------------------------------------------------
* ExecTidRangeScanInitializeDSM
*
- * Set up a parallel TID scan descriptor.
+ * Set up a parallel TID range scan descriptor.
* ----------------------------------------------------------------
*/
void
@@ -460,19 +457,9 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
pscan,
estate->es_snapshot);
shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
-
- /*
- * Initialize parallel scan descriptor with given TID range if it can be
- * evaluated successfully.
- */
- if (TidRangeEval(node))
- node->ss.ss_currentScanDesc =
- table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
- &node->trss_mintid, &node->trss_maxtid);
- else
- node->ss.ss_currentScanDesc =
- table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
- NULL, NULL);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
+ pscan);
}
/* ----------------------------------------------------------------
@@ -483,21 +470,12 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt)
*/
void
ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
- ParallelContext *pcxt)
+ ParallelContext *pcxt)
{
ParallelTableScanDesc pscan;
pscan = node->ss.ss_currentScanDesc->rs_parallel;
table_parallelscan_reinitialize(node->ss.ss_currentRelation, pscan);
-
- /* Set the new TID range if it can be evaluated successfully */
- if (TidRangeEval(node))
- node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(
- node->ss.ss_currentScanDesc, &node->trss_mintid,
- &node->trss_maxtid);
- else
- node->ss.ss_currentRelation->rd_tableam->scan_set_tidrange(
- node->ss.ss_currentScanDesc, NULL, NULL);
}
/* ----------------------------------------------------------------
@@ -508,18 +486,12 @@ ExecTidRangeScanReInitializeDSM(TidRangeScanState *node,
*/
void
ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
- ParallelWorkerContext *pwcxt)
+ ParallelWorkerContext *pwcxt)
{
ParallelTableScanDesc pscan;
pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
-
- if (TidRangeEval(node))
- node->ss.ss_currentScanDesc =
- table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
- &node->trss_mintid, &node->trss_maxtid);
- else
- node->ss.ss_currentScanDesc =
- table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, pscan,
- NULL, NULL);
+ node->ss.ss_currentScanDesc =
+ table_beginscan_parallel_tidrange(node->ss.ss_currentRelation,
+ pscan);
}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 01976226d19..5a7283bd2f5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1371,7 +1371,11 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
/*
* The first page in a range requires a random seek, but each subsequent
- * page is just a normal sequential page read.
+ * page is just a normal sequential page read. NOTE: it's desirable for
+ * TID Range Scans to cost more than the equivalent Sequential Scans,
+ * because Seq Scans have some performance advantages such as scan
+ * synchronization, and we'd prefer one of them to be picked unless a TID
+ * Range Scan really is better.
*/
ntuples = selectivity * baserel->tuples;
nseqpages = pages - 1.0;
diff --git a/src/backend/optimizer/path/tidpath.c b/src/backend/optimizer/path/tidpath.c
index e48c85833e7..3ddbc10bbdf 100644
--- a/src/backend/optimizer/path/tidpath.c
+++ b/src/backend/optimizer/path/tidpath.c
@@ -47,7 +47,6 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/restrictinfo.h"
-#include "optimizer/cost.h"
/*
@@ -491,9 +490,8 @@ ec_member_matches_ctid(PlannerInfo *root, RelOptInfo *rel,
/*
* create_tidscan_paths
- * Create paths corresponding to direct TID scans of the given rel.
- *
- * Candidate paths are added to the rel's pathlist (using add_path).
+ * Create paths corresponding to direct TID scans of the given rel and add
+ * them to the corresponding path list via add_path or add_partial_path.
*/
bool
create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel)
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 3da43557a13..87a8be10461 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -96,7 +96,7 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_nblocks; /* # blocks in relation at start of scan */
slock_t phs_mutex; /* mutual exclusion for setting startblock */
BlockNumber phs_startblock; /* starting block number */
- BlockNumber phs_numblock; /* # blocks to scan, or InvalidBlockNumber if
+ BlockNumber phs_numblock; /* # blocks to scan, or InvalidBlockNumber if
* no limit */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8e97fc5f0be..2fa790b6bf5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1131,16 +1131,14 @@ extern TableScanDesc table_beginscan_parallel(Relation relation,
ParallelTableScanDesc pscan);
/*
- * Begin a parallel tidrange scan. `pscan` needs to have been initialized with
- * table_parallelscan_initialize(), for the same relation. The initialization
- * does not need to have happened in this backend.
+ * Begin a parallel tid range scan. `pscan` needs to have been initialized
+ * with table_parallelscan_initialize(), for the same relation. The
+ * initialization does not need to have happened in this backend.
*
* Caller must hold a suitable lock on the relation.
*/
extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation,
- ParallelTableScanDesc pscan,
- ItemPointerData * mintid,
- ItemPointerData * maxtid);
+ ParallelTableScanDesc pscan);
/*
* Restart a parallel scan. Call this in the leader process. Caller is
@@ -2040,7 +2038,9 @@ extern BlockNumber table_block_parallelscan_nextpage(Relation rel,
ParallelBlockTableScanDesc pbscan);
extern void table_block_parallelscan_startblock_init(Relation rel,
ParallelBlockTableScanWorker pbscanwork,
- ParallelBlockTableScanDesc pbscan);
+ ParallelBlockTableScanDesc pbscan,
+ BlockNumber startblock,
+ BlockNumber numblocks);
/* ----------------------------------------------------------------------------
diff --git a/src/test/regress/expected/tidrangescan.out b/src/test/regress/expected/tidrangescan.out
index 3c5fc9e102a..ce75c96e7c8 100644
--- a/src/test/regress/expected/tidrangescan.out
+++ b/src/test/regress/expected/tidrangescan.out
@@ -297,22 +297,23 @@ FETCH LAST c;
COMMIT;
DROP TABLE tidrangescan;
--- tests for parallel tidrangescans
-SET parallel_setup_cost TO 0;
-SET parallel_tuple_cost TO 0;
-SET min_parallel_table_scan_size TO 0;
-SET max_parallel_workers_per_gather TO 4;
+-- Tests for parallel tidrangescans
+BEGIN;
+SET LOCAL parallel_setup_cost TO 0;
+SET LOCAL parallel_tuple_cost TO 0;
+SET LOCAL min_parallel_table_scan_size TO 0;
+SET LOCAL max_parallel_workers_per_gather TO 4;
CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
--- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+-- Insert enough tuples such that each page gets 5 tuples with fillfactor = 10
INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
--- ensure there are 40 pages for parallel test
+-- Ensure there are 40 pages for parallel test
SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
min | max
-------+--------
(0,1) | (39,5)
(1 row)
--- parallel range scans with upper bound
+-- Parallel range scans with upper bound
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
QUERY PLAN
@@ -331,7 +332,7 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
150
(1 row)
--- parallel range scans with lower bound
+-- Parallel range scans with lower bound
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
QUERY PLAN
@@ -350,7 +351,7 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
150
(1 row)
--- parallel range scans with both bounds
+-- Parallel range scans with both bounds
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
QUERY PLAN
@@ -369,7 +370,7 @@ SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30
100
(1 row)
--- parallel rescans
+-- Parallel rescans
EXPLAIN (COSTS OFF)
SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
@@ -398,9 +399,5 @@ WHERE t.ctid < '(1,0)';
(0,5) | 5
(5 rows)
-DROP TABLE parallel_tidrangescan;
-RESET parallel_setup_cost;
-RESET parallel_tuple_cost;
-RESET min_parallel_table_scan_size;
-RESET max_parallel_workers_per_gather;
+ROLLBACK;
RESET enable_seqscan;
diff --git a/src/test/regress/sql/tidrangescan.sql b/src/test/regress/sql/tidrangescan.sql
index 0f1e43c6d05..c9a63b10ddd 100644
--- a/src/test/regress/sql/tidrangescan.sql
+++ b/src/test/regress/sql/tidrangescan.sql
@@ -98,36 +98,38 @@ COMMIT;
DROP TABLE tidrangescan;
--- tests for parallel tidrangescans
-SET parallel_setup_cost TO 0;
-SET parallel_tuple_cost TO 0;
-SET min_parallel_table_scan_size TO 0;
-SET max_parallel_workers_per_gather TO 4;
+-- Tests for parallel tidrangescans
+BEGIN;
+
+SET LOCAL parallel_setup_cost TO 0;
+SET LOCAL parallel_tuple_cost TO 0;
+SET LOCAL min_parallel_table_scan_size TO 0;
+SET LOCAL max_parallel_workers_per_gather TO 4;
CREATE TABLE parallel_tidrangescan(id integer, data text) WITH (fillfactor = 10);
--- insert enough tuples such that each page gets 5 tuples with fillfactor = 10
+-- Insert enough tuples such that each page gets 5 tuples with fillfactor = 10
INSERT INTO parallel_tidrangescan SELECT i, repeat('x', 100) FROM generate_series(1,200) AS s(i);
--- ensure there are 40 pages for parallel test
+-- Ensure there are 40 pages for parallel test
SELECT min(ctid), max(ctid) FROM parallel_tidrangescan;
--- parallel range scans with upper bound
+-- Parallel range scans with upper bound
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
SELECT count(*) FROM parallel_tidrangescan WHERE ctid < '(30,1)';
--- parallel range scans with lower bound
+-- Parallel range scans with lower bound
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)';
--- parallel range scans with both bounds
+-- Parallel range scans with both bounds
EXPLAIN (COSTS OFF)
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
SELECT count(*) FROM parallel_tidrangescan WHERE ctid > '(10,0)' AND ctid < '(30,1)';
--- parallel rescans
+-- Parallel rescans
EXPLAIN (COSTS OFF)
SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
@@ -137,10 +139,5 @@ SELECT t.ctid,t2.c FROM parallel_tidrangescan t,
LATERAL (SELECT count(*) c FROM parallel_tidrangescan t2 WHERE t2.ctid <= t.ctid) t2
WHERE t.ctid < '(1,0)';
-DROP TABLE parallel_tidrangescan;
-
-RESET parallel_setup_cost;
-RESET parallel_tuple_cost;
-RESET min_parallel_table_scan_size;
-RESET max_parallel_workers_per_gather;
+ROLLBACK;
RESET enable_seqscan;
--
2.43.0
On Tue, 18 Nov 2025 at 14:51, David Rowley <dgrowleyml@gmail.com> wrote:
I've attached v12, which adds a mention in the docs about Parallel TID
Range scans being supported. It also does very minor adjustments to
the comments. Again, I've kept Cary's v10 and the changes I've made
separate. Of course, I'd squash these before commit.
I went over this again today and only made a few whitespace
adjustments in the tests. I've now pushed the resulting patch.
David
On Thu, 27 Nov 2025 at 14:07, David Rowley <dgrowleyml@gmail.com> wrote:
I went over this again today and only made a few whitespace
adjustments in the tests. I've now pushed the resulting patch.
This seems to have caused issues on skink [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-11-27%2002%3A17%3A35 under Valgrind. The
problem seems to be that initscan() does not always initialise
rs_startblock. I'm now trying to figure out if there's some reason for
that, or if that's been overlooked at some point.
David
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-11-27%2002%3A17%3A35
On Thu, 27 Nov 2025 at 18:48, David Rowley <dgrowleyml@gmail.com> wrote:
This seems to have caused issues on skink [1] under Valgrind. The
problem seems to be that initscan() does not always initialise
rs_startblock. I'm now trying to figure out if there's some reason for
that, or if that's been overlooked at some point.
I've written the attached patch to address the uninitialised
rs_startblock field.
The patch basically adds:
if (!keep_startblock)
scan->rs_startblock = InvalidBlockNumber;
to initscan() when in parallel mode. The rest of the patch is a small
refactor to make it clearer which parts are for parallel and which are
for serial. I also added a comment to mention that the syncscan start
location is figured out in table_block_parallelscan_startblock_init()
for parallel scans.
David
Attachments:
v1-0001-Fix-possibly-uninitialized-HeapScanDesc.rs_startb.patchapplication/octet-stream; name=v1-0001-Fix-possibly-uninitialized-HeapScanDesc.rs_startb.patchDownload
From 1611a32fdda6b8d137b133e04ee83a73d8e08b7e Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Fri, 28 Nov 2025 01:47:52 +1300
Subject: [PATCH v1] Fix possibly uninitialized HeapScanDesc.rs_startblock
The solution used in 0ca3b1697 to determine the Parallel TID Range
Scan's start location was to modify the signature of
table_block_parallelscan_startblock_init() to allow the startblock
to be passed in as a parameter. This allows the scan limits to be
adjusted before that function is called so that the limits are picked up
when the parallel scan starts. The commit made it so the call to
table_block_parallelscan_startblock_init uses the HeapScanDesc's
rs_startblock to pass the startblock to the parallel scan. That all
works ok for Parallel TID Range scans as the HeapScanDesc rs_startblock
gets set by heap_setscanlimits(), but for Parallel Seq Scans, initscan()
has no code path where that's set and that results in passing an
uninitialized value to table_block_parallelscan_startblock_init() as
noted by the buildfarm member skink, running Valgrind.
To fix this issue, make it so initscan() sets the rs_startblock for
parallel scans unless we're doing a rescan. This makes it so
table_block_parallelscan_startblock_init() will be called with the
startblock set to InvalidBlockNumber, and that'll allow the syncscan
code to find the correct start location (when enabled). For Parallel
TID Range Scans, this InvalidBlockNumber value will be overwritten in
the call to heap_setscanlimits().
initscan() is a bit light on documentation on what's meant to get
initialized where for parallel scans. From what I can tell, it looks like
it just didn't matter prior to 0ca3b1697 that rs_startblock was left
uninitialized for parallel scans. To address the light documentation,
I've also added some comments to mention that the syncscan location for
parallel scans is figured out in table_block_parallelscan_startblock_init.
I've also taken the liberty to adjust the if/else if/else code in
initscan() to make it clearer which parts apply to parallel scans and
which parts are for the serial scans.
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvqALm+k7FyfdQdCw1yF_8HojvR61YRrNhwRQPE=zSmnQA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 47 ++++++++++++++++++++------------
1 file changed, 30 insertions(+), 17 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0a820bab87a..4d382a04338 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -415,28 +415,41 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
scan->rs_base.rs_flags |= SO_ALLOW_SYNC;
else
scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
- }
- else if (keep_startblock)
- {
+
/*
- * When rescanning, we want to keep the previous startblock setting,
- * so that rewinding a cursor doesn't generate surprising results.
- * Reset the active syncscan setting, though.
+ * If not rescanning, initialize the startblock. Finding the actual
+ * start location is done in table_block_parallelscan_startblock_init,
+ * based on whether an alternative start location has been set with
+ * heap_setscanlimits, or using the syncscan location, when syncscan
+ * is enabled.
*/
- if (allow_sync && synchronize_seqscans)
- scan->rs_base.rs_flags |= SO_ALLOW_SYNC;
- else
- scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
- }
- else if (allow_sync && synchronize_seqscans)
- {
- scan->rs_base.rs_flags |= SO_ALLOW_SYNC;
- scan->rs_startblock = ss_get_location(scan->rs_base.rs_rd, scan->rs_nblocks);
+ if (!keep_startblock)
+ scan->rs_startblock = InvalidBlockNumber;
}
else
{
- scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
- scan->rs_startblock = 0;
+ if (keep_startblock)
+ {
+ /*
+ * When rescanning, we want to keep the previous startblock
+ * setting, so that rewinding a cursor doesn't generate surprising
+ * results. Reset the active syncscan setting, though.
+ */
+ if (allow_sync && synchronize_seqscans)
+ scan->rs_base.rs_flags |= SO_ALLOW_SYNC;
+ else
+ scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
+ }
+ else if (allow_sync && synchronize_seqscans)
+ {
+ scan->rs_base.rs_flags |= SO_ALLOW_SYNC;
+ scan->rs_startblock = ss_get_location(scan->rs_base.rs_rd, scan->rs_nblocks);
+ }
+ else
+ {
+ scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
+ scan->rs_startblock = 0;
+ }
}
scan->rs_numblocks = InvalidBlockNumber;
--
2.43.0
On Fri, 28 Nov 2025 at 02:31, David Rowley <dgrowleyml@gmail.com> wrote:
The patch basically adds:
if (!keep_startblock)
scan->rs_startblock = InvalidBlockNumber;
I've pushed that patch.
David
Hi David,
Thanks a lot for the detailed review, the fixes, and for pushing this
forward. I really appreciate the time you spent going through the
patch set
1. Moved block limiting logic for parallel scans into
table_block_parallelscan_startblock_init(). There's currently a lock
here to ensure only 1 worker can set the shared memory fields at a
time. I've hooked into the same lock to set the startblock and
numblocks.
I agree this is a much better location than heap_setscanlimits() to
ensure only the leader can set parallel scan limits for all workers.
You are right, the condition I put in heap_setscanlimits() cannot
guarentee the leader would get there first. Thanks for pointing
this out.
1. table_parallelscan_initialize() is called first in a parallel TID
Range Scan which calls table_block_parallelscan_initialize() and may
set phs_syncscan to true. We directly then call
table_beginscan_parallel_tidrange(), which sets phs_syncscan = false
unconditionally. No bugs, but it is a little strange. One way to get
around this weirdness would be to move the responsibility of setting
phs_syncscan into table_parallelscan_initialize() and then use
table_beginscan_parallel_tidrange() to set phs_syncscan = false. I
wasn't overly concerned about this, so I didn't do that. I just wanted
to mention it here as someone else might think it's worth making
better.
Nice catch. Yes it is a bit weird but harmless. I agree to leave it as is
now.
within each worker process. </para> </listitem> + <listitem> + <para> + In a <emphasis>parallel tid range scan</emphasis>, the range of blocks + will be subdivided into smaller ranges which are shared among the + cooperating processes. Each worker process will complete the scanning + of its given range of blocks before requesting an additional range of + blocks. + </para> + </listitem> </itemizedlist>
I may be missing some info or wrong but my impression is that
the range of blocks is actually set by the leader worker and is
the same among all the cooperating workers rather than
subdivided. The workers fetches as many blocks to process as
they can (similar to sequential scan in parllel) as long as the
block falls within the TID range. Current block number is
stored in parallel scan descriptor in shared memory so workers
will not fetch the same block during scan.
Thanks again for all the help and improvements!
Cary Huang
www.highgo.ca
On Sat, 6 Dec 2025 at 09:12, Cary Huang <cary.huang@highgo.ca> wrote:
within each worker process. </para> </listitem> + <listitem> + <para> + In a <emphasis>parallel tid range scan</emphasis>, the range of blocks + will be subdivided into smaller ranges which are shared among the + cooperating processes. Each worker process will complete the scanning + of its given range of blocks before requesting an additional range of + blocks. + </para> + </listitem> </itemizedlist>I may be missing some info or wrong but my impression is that
the range of blocks is actually set by the leader worker and is
the same among all the cooperating workers rather than
subdivided. The workers fetches as many blocks to process as
they can (similar to sequential scan in parllel) as long as the
block falls within the TID range. Current block number is
stored in parallel scan descriptor in shared memory so workers
will not fetch the same block during scan.
If you look at what's written for Seq Scans, namely "In a parallel
sequential scan, the table's blocks will be divided into ranges and
shared among the cooperating processes.", this is talking about how
the blocks are shared between workers, i.e no two workers operate on
the same block. If that happened we'd get wrong results. The "range"
that this is talking about was introduced in 56788d215 to fix the
kernel readhead detection issue with Parallel Seq Scans (the kernel
was not detecting sequential file access due to multiple processes
cooperating on the sequential access). With Parallel TID Range Scans,
we already have the "TID Range" of blocks to scan, so I had to come up
with wording that didn't say "we divide the range into ranges and
distribute ...", so I used "subdivided". I'm happy if someone comes
up with better wording, but I don't see anything factually wrong with
what's there.
The TID Range of blocks is set by whichever worker process gets there
first. That might not be the leader.
David