Support tid range scan in parallel?

Started by Cary Huangabout 2 years ago33 messageshackers

cary.huang@highgo.ca

about 2 years ago

Hello

When using ctid as a
restriction clause with lower and upper bounds, PostgreSQL's planner will use
TID range scan plan to handle such query. This works and generally fine.
However, if the ctid range covers a huge amount of data, the planner will not
use parallel worker to perform ctid range scan because it is not supported. It could, however,
still choose to use parallel sequential scan to complete the scan if ti costs less.

In one of our
migration scenarios, we rely on tid range scan to migrate huge table from one
database to another once the lower and upper ctid bound is determined. With the
support of parallel ctid range scan, this process could be done much quicker.

The attached patch
is my approach to add parallel ctid range scan to PostgreSQL's planner and executor. In my
tests, I do see an increase in performance using parallel tid range scan over
the single worker tid range scan and it is also faster than parallel sequential
scan covering similar ranges. Of course, the table needs to be large enough to
reflect the performance increase.

below is the timing to complete a select query covering all the records in a simple 2-column
table with 40 million records,

- tid range scan takes 10216ms

- tid range scan with 2 workers takes 7109ms

- sequential scan with 2 workers takes 8499ms

Having the support
for parallel ctid range scan is definitely helpful in our migration case, I am
sure it could be useful in other cases as well. I am sharing the patch here and
if someone could provide a quick feedback or review that would be greatly appreciated.

Thank you!

Cary Huang

-------------

HighGo Software Inc. (Canada)

mailto:cary.huang@highgo.ca

http://www.highgo.ca

Import Notes

Reply to msg id not found:

dgrowleyml@gmail.com

about 2 years ago

In reply to: Cary Huang (#1)

Re: Support tid range scan in parallel?

On Tue, 30 Apr 2024 at 10:36, Cary Huang <cary.huang@highgo.ca> wrote:

In one of our migration scenarios, we rely on tid range scan to migrate huge table from one database to another once the lower and upper ctid bound is determined. With the support of parallel ctid range scan, this process could be done much quicker.

I would have thought that the best way to migrate would be to further
divide the TID range into N segments and run N queries, one per
segment to get the data out.

From a CPU point of view, I'd hard to imagine that a SELECT * query
without any other items in the WHERE clause other than the TID range
quals would run faster with multiple workers than with 1. The problem
is the overhead of pushing tuples to the main process often outweighs
the benefits of the parallelism. However, from an I/O point of view
on a server with slow enough disks, I can imagine there'd be a
speedup.

The attached patch is my approach to add parallel ctid range scan to PostgreSQL's planner and executor. In my tests, I do see an increase in performance using parallel tid range scan over the single worker tid range scan and it is also faster than parallel sequential scan covering similar ranges. Of course, the table needs to be large enough to reflect the performance increase.

below is the timing to complete a select query covering all the records in a simple 2-column table with 40 million records,

- tid range scan takes 10216ms
- tid range scan with 2 workers takes 7109ms
- sequential scan with 2 workers takes 8499ms

Can you share more details about this test? i.e. the query, what the
times are that you've measured (EXPLAIN ANALYZE, or SELECT, COPY?).
Also, which version/commit did you patch against? I was wondering if
the read stream code added in v17 would result in the serial case
running faster because the parallelism just resulted in more I/O
concurrency.

Of course, it may be beneficial to have parallel TID Range for other
cases when more row filtering or aggregation is being done as that
requires pushing fewer tuples over from the parallel worker to the
main process. It just would be good to get to the bottom of if there's
still any advantage to parallelism when no filtering other than the
ctid quals is being done now that we've less chance of having to wait
for I/O coming from disk with the read streams code.

David

cary.huang@highgo.ca

about 2 years ago

In reply to: David Rowley (#2)

Re: Support tid range scan in parallel?

Hi David

Thank you for your reply.

From a CPU point of view, I'd hard to imagine that a SELECT * query
without any other items in the WHERE clause other than the TID range
quals would run faster with multiple workers than with 1. The problem
is the overhead of pushing tuples to the main process often outweighs
the benefits of the parallelism. However, from an I/O point of view
on a server with slow enough disks, I can imagine there'd be a
speedup.

yeah, this is generally true. With everything set to default, the planner would not choose parallel sequential scan if the scan range covers mostly all tuples of a table (to reduce the overhead of pushing tuples to main proc as you mentioned). It is preferred when the target data is small but the table is huge. In my case, it is also the same, the planner by default uses normal tid range scan, so I had to alter cost parameters to influence the planner's decision. This is where I found that with WHERE clause only containing TID ranges that cover the entire table would result faster with parallel workers, at least in my environment.

Of course, it may be beneficial to have parallel TID Range for other
cases when more row filtering or aggregation is being done as that
requires pushing fewer tuples over from the parallel worker to the
main process. It just would be good to get to the bottom of if there's
still any advantage to parallelism when no filtering other than the
ctid quals is being done now that we've less chance of having to wait
for I/O coming from disk with the read streams code.

I believe so too. I shared my test procedure below with ctid being the only quals.

below is the timing to complete a select query covering all the records in a simple 2-column table with 40 million records,

- tid range scan takes 10216ms
- tid range scan with 2 workers takes 7109ms
- sequential scan with 2 workers takes 8499ms

Can you share more details about this test? i.e. the query, what the
times are that you've measured (EXPLAIN ANALYZE, or SELECT, COPY?).
Also, which version/commit did you patch against? I was wondering if
the read stream code added in v17 would result in the serial case
running faster because the parallelism just resulted in more I/O
concurrency.

Yes of course. These numbers were obtained earlier this year on master with the patch applied most likely without the read stream code you mentioned. The patch attached here is rebased to commit dd0183469bb779247c96e86c2272dca7ff4ec9e7 on master, which is quite recent and should have the read stream code for v17 as I can immediately tell that the serial scans run much faster now in my setup. I increased the records on the test table from 40 to 100 million because serial scans are much faster now. Below is the summary and details of my test. Note that I only include the EXPLAIN ANALYZE details of round1 test. Round2 is the same except for different execution times.

[env]
- OS: Ubuntu 18.04
- CPU: 4 cores @ 3.40 GHz
- MEM: 16 GB

[test table setup]
initdb with all default values
CREATE TABLE test (a INT, b TEXT);
INSERT INTO test VALUES(generate_series(1,100000000), 'testing');
SELECT min(ctid), max(ctid) from test;
min | max
-------+--------------
(0,1) | (540540,100)
(1 row)

[summary]
round 1:
tid range scan: 14915ms
tid range scan 2 workers: 12265ms
seq scan with 2 workers: 12675ms

round2:
tid range scan: 12112ms
tid range scan 2 workers: 10649ms
seq scan with 2 workers: 11206ms

[details of EXPLAIN ANALYZE below]

[default tid range scan]
EXPLAIN ANALYZE SELECT a FROM test WHERE ctid >= '(1,0)' AND ctid <= '(540540,100)';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Tid Range Scan on test (cost=0.01..1227029.81 rows=68648581 width=4) (actual time=0.188..12280.791 rows=99999815 loops=1)
TID Cond: ((ctid >= '(1,0)'::tid) AND (ctid <= '(540540,100)'::tid))
Planning Time: 0.817 ms
Execution Time: 14915.035 ms
(4 rows)

[parallel tid range scan with 2 workers]
set parallel_setup_cost=0;
set parallel_tuple_cost=0;
set min_parallel_table_scan_size=0;
set max_parallel_workers_per_gather=2;

EXPLAIN ANALYZE SELECT a FROM test WHERE ctid >= '(1,0)' AND ctid <= '(540540,100)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=0.01..511262.43 rows=68648581 width=4) (actual time=1.322..9249.197 rows=99999815 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Tid Range Scan on test (cost=0.01..511262.43 rows=28603575 width=4) (actual time=0.332..4906.262 rows=33333272 loops=3)
TID Cond: ((ctid >= '(1,0)'::tid) AND (ctid <= '(540540,100)'::tid))
Planning Time: 0.213 ms
Execution Time: 12265.873 ms
(7 rows)

[parallel seq scan with 2 workers]
set enable_tidscan = 'off';

EXPLAIN ANALYZE SELECT a FROM test WHERE ctid >= '(1,0)' AND ctid <= '(540540,100)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Gather (cost=0.00..969595.42 rows=68648581 width=4) (actual time=4.489..9713.299 rows=99999815 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on test (cost=0.00..969595.42 rows=28603575 width=4) (actual time=0.995..5541.178 rows=33333272 loops=3)
Filter: ((ctid >= '(1,0)'::tid) AND (ctid <= '(540540,100)'::tid))
Rows Removed by Filter: 62
Planning Time: 0.129 ms
Execution Time: 12675.681 ms
(8 rows)

Best regards

Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca

dgrowleyml@gmail.com

about 2 years ago

In reply to: Cary Huang (#3)

Re: Support tid range scan in parallel?

On Wed, 1 May 2024 at 07:10, Cary Huang <cary.huang@highgo.ca> wrote:

Yes of course. These numbers were obtained earlier this year on master with the patch applied most likely without the read stream code you mentioned. The patch attached here is rebased to commit dd0183469bb779247c96e86c2272dca7ff4ec9e7 on master, which is quite recent and should have the read stream code for v17 as I can immediately tell that the serial scans run much faster now in my setup. I increased the records on the test table from 40 to 100 million because serial scans are much faster now. Below is the summary and details of my test. Note that I only include the EXPLAIN ANALYZE details of round1 test. Round2 is the same except for different execution times.

It would be good to see the EXPLAIN (ANALYZE, BUFFERS) with SET
track_io_timing = 1;

Here's a quick review

1. Does not produce correct results:

-- serial plan
postgres=# select count(*) from t where ctid >= '(0,0)' and ctid < '(10,0)';
count
-------
2260
(1 row)

-- parallel plan
postgres=# set max_parallel_workers_per_gather=2;
SET
postgres=# select count(*) from t where ctid >= '(0,0)' and ctid < '(10,0)';
count
-------
0
(1 row)

I've not really looked into why, but I see you're not calling
heap_setscanlimits() in parallel mode. You need to somehow restrict
the block range of the scan to the range specified in the quals. You
might need to do more work to make the scan limits work with parallel
scans.

If you look at heap_scan_stream_read_next_serial(), it's calling
heapgettup_advance_block(), where there's "if (--scan->rs_numblocks
== 0)". But no such equivalent code in
table_block_parallelscan_nextpage() called by
heap_scan_stream_read_next_parallel(). To make Parallel TID Range
work, you'll need heap_scan_stream_read_next_parallel() to abide by
the scan limits.

2. There's a 4 line comment you've added to cost_tidrangescan() which
is just a copy and paste from cost_seqscan(). If you look at the
seqscan costing, the comment is true in that scenario, but not true in
where you've pasted it. The I/O cost is all tied in to run_cost.

+ /* The CPU cost is divided among all the workers. */
+ run_cost /= parallel_divisor;
+
+ /*
+ * It may be possible to amortize some of the I/O cost, but probably
+ * not very much, because most operating systems already do aggressive
+ * prefetching.  For now, we assume that the disk run cost can't be
+ * amortized at all.
+ */

3. Calling TidRangeQualFromRestrictInfoList() once for the serial path
and again for the partial path isn't great. It would be good to just
call that function once and use the result for both path types.

4. create_tidrangescan_subpaths() seems like a weird name for a
function. That seems to imply that scans have subpaths. Scans are
always leaf paths and have no subpaths.

This isn't a complete review. It's just that this seems enough to keep
you busy for a while. I can look a bit harder when the patch is
working correctly. I think you should have enough feedback to allow
that now.

David

cary.huang@highgo.ca

about 2 years ago

In reply to: David Rowley (#4)

Re: Support tid range scan in parallel?

This isn't a complete review. It's just that this seems enough to keep
you busy for a while. I can look a bit harder when the patch is
working correctly. I think you should have enough feedback to allow
that now.

Thanks for the test, review and feedback. They are greatly appreciated!
I will polish the patch some more following your feedback and share new
results / patch when I have them.

Thanks again!

Cary

cary.huang@highgo.ca

about 2 years ago

In reply to: Cary Huang (#5)

Re: Support tid range scan in parallel?

Hello

-- parallel plan
postgres=# set max_parallel_workers_per_gather=2;
SET
postgres=# select count(*) from t where ctid >= '(0,0)' and ctid < '(10,0)';
count
-------
0
(1 row)

I've not really looked into why, but I see you're not calling
heap_setscanlimits() in parallel mode. You need to somehow restrict
the block range of the scan to the range specified in the quals. You
might need to do more work to make the scan limits work with parallel
scans.

I found that select count(*) using parallel tid rangescan for the very first time,
it would return the correct result, but the same subsequent queries would
result in 0 as you stated. This is due to the "pscan->phs_syncscan" set to true
in ExecTidRangeScanInitializeDSM(), inherited from parallel seq scan case.
With syncscan enabled, the "table_block_parallelscan_nextpage()" would
return the next block since the end of the first tid rangescan instead of the
correct start block that should be scanned. I see that single tid rangescan
does not have SO_ALLOW_SYNC set, so I figure syncscan should also be
disabled in parallel case. With this change, then it would be okay to call
heap_setscanlimits() in parallel case, so I added this call back to
heap_set_tidrange() in both serial and parallel cases.

2. There's a 4 line comment you've added to cost_tidrangescan() which
is just a copy and paste from cost_seqscan(). If you look at the
seqscan costing, the comment is true in that scenario, but not true in
where you've pasted it. The I/O cost is all tied in to run_cost.

thanks for pointing out, I have removed these incorrect comments

3. Calling TidRangeQualFromRestrictInfoList() once for the serial path
and again for the partial path isn't great. It would be good to just
call that function once and use the result for both path types.

good point. I moved the adding of tid range scan partial path inside
create_tidscan_paths() where it makes a TidRangeQualFromRestrictInfoList()
call for serial path, so I can just reuse tidrangequals if it is appropriate to
consider parallel tid rangescan.

4. create_tidrangescan_subpaths() seems like a weird name for a
function. That seems to imply that scans have subpaths. Scans are
always leaf paths and have no subpaths.

I removed this function with weird name; it is not needed because the logic inside
is moved to create_tidscan_paths() where it can reuse tidrangequals.

It would be good to see the EXPLAIN (ANALYZE, BUFFERS) with SET
track_io_timing = 1;

the v2 patch is attached that should address the issues above. Below are the EXPLAIN
outputs with track_io_timing = 1 in my environment. Generally, parallel tid range scan
results in more I/O timings and shorter execution time.

SET track_io_timing = 1;

===serial tid rangescan===

EXPLAIN (ANALYZE, BUFFERS) select a from test where ctid >= '(0,0)' and ctid < '(216216,40)';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Tid Range Scan on test (cost=0.01..490815.59 rows=27459559 width=4) (actual time=0.072..10143.770 rows=39999999 loops=1)
TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid))
Buffers: shared hit=298 read=215919 written=12972
I/O Timings: shared read=440.277 write=58.525
Planning:
Buffers: shared hit=2
Planning Time: 0.289 ms
Execution Time: 12497.081 ms
(8 rows)

set parallel_setup_cost=0;
set parallel_tuple_cost=0;
set min_parallel_table_scan_size=0;
set max_parallel_workers_per_gather=2;

===parallel tid rangescan===

EXPLAIN (ANALYZE, BUFFERS) select a from test where ctid >= '(0,0)' and ctid < '(216216,40)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=0.01..256758.88 rows=40000130 width=4) (actual time=0.878..7083.705 rows=39999999 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=216217
I/O Timings: shared read=1224.153
-> Parallel Tid Range Scan on test (cost=0.01..256758.88 rows=16666721 width=4) (actual time=0.256..3980.770 rows=13333333 loops=3)
TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid))
Buffers: shared read=216217
I/O Timings: shared read=1224.153
Planning Time: 0.258 ms
Execution Time: 9731.800 ms
(11 rows)

===serial tid rangescan with aggregate===

set max_parallel_workers_per_gather=0;

EXPLAIN (ANALYZE, BUFFERS) select count(a) from test where ctid >= '(0,0)' and ctid < '(216216,40)';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=716221.63..716221.64 rows=1 width=8) (actual time=12931.695..12931.696 rows=1 loops=1)
Buffers: shared read=216217
I/O Timings: shared read=599.331
-> Tid Range Scan on test (cost=0.01..616221.31 rows=40000130 width=4) (actual time=0.079..6800.482 rows=39999999 loops=1)
TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid))
Buffers: shared read=216217
I/O Timings: shared read=599.331
Planning:
Buffers: shared hit=1 read=2
I/O Timings: shared read=0.124
Planning Time: 0.917 ms
Execution Time: 12932.348 ms
(12 rows)

===parallel tid rangescan with aggregate===

set max_parallel_workers_per_gather=2;
EXPLAIN (ANALYZE, BUFFERS) select count(a) from test where ctid >= '(0,0)' and ctid < '(216216,40)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=298425.70..298425.71 rows=1 width=8) (actual time=4842.512..4847.863 rows=1 loops=1)
Buffers: shared read=216217
I/O Timings: shared read=1155.321
-> Gather (cost=298425.68..298425.69 rows=2 width=8) (actual time=4842.020..4847.851 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=216217
I/O Timings: shared read=1155.321
-> Partial Aggregate (cost=298425.68..298425.69 rows=1 width=8) (actual time=4824.730..4824.731 rows=1 loops=3)
Buffers: shared read=216217
I/O Timings: shared read=1155.321
-> Parallel Tid Range Scan on test (cost=0.01..256758.88 rows=16666721 width=4) (actual time=0.098..2614.108 rows=13333333 loops=3)
TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid))
Buffers: shared read=216217
I/O Timings: shared read=1155.321
Planning:
Buffers: shared read=3
I/O Timings: shared read=3.323
Planning Time: 4.124 ms
Execution Time: 4847.992 ms
(20 rows)

Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca

dgrowleyml@gmail.com

about 2 years ago

In reply to: Cary Huang (#6)

Re: Support tid range scan in parallel?

On Sat, 4 May 2024 at 06:55, Cary Huang <cary.huang@highgo.ca> wrote:

With syncscan enabled, the "table_block_parallelscan_nextpage()" would
return the next block since the end of the first tid rangescan instead of the
correct start block that should be scanned. I see that single tid rangescan
does not have SO_ALLOW_SYNC set, so I figure syncscan should also be
disabled in parallel case. With this change, then it would be okay to call
heap_setscanlimits() in parallel case, so I added this call back to
heap_set_tidrange() in both serial and parallel cases.

This now calls heap_setscanlimits() for the parallel version, it's
just that table_block_parallelscan_nextpage() does nothing to obey
those limits.

The only reason the code isn't reading the entire table is due to the
optimisation in heap_getnextslot_tidrange() which returns false when
the ctid goes out of range. i.e, this code:

/*
* When scanning forward, the TIDs will be in ascending order.
* Future tuples in this direction will be higher still, so we can
* just return false to indicate there will be no more tuples.
*/
if (ScanDirectionIsForward(direction))
return false;

If I comment out that line, I see all pages are accessed:

postgres=# explain (analyze, buffers) select count(*) from a where
ctid >= '(0,1)' and ctid < '(11,0)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=18.80..18.81 rows=1 width=8) (actual
time=33.530..36.118 rows=1 loops=1)
Buffers: shared read=4425
-> Gather (cost=18.78..18.79 rows=2 width=8) (actual
time=33.456..36.102 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared read=4425
-> Partial Aggregate (cost=18.78..18.79 rows=1 width=8)
(actual time=20.389..20.390 rows=1 loops=3)
Buffers: shared read=4425
-> Parallel Tid Range Scan on a (cost=0.01..16.19
rows=1035 width=0) (actual time=9.375..20.349 rows=829 loops=3)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
Buffers: shared read=4425 <---- this is all
pages in the table instead of 11 pages.

With that code still commented out, the non-parallel version still
won't read all pages due to the setscanlimits being obeyed.

postgres=# set max_parallel_workers_per_gather=0;
SET
postgres=# explain (analyze, buffers) select count(*) from a where
ctid >= '(0,1)' and ctid < '(11,0)';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Aggregate (cost=45.07..45.08 rows=1 width=8) (actual
time=0.302..0.302 rows=1 loops=1)
Buffers: shared hit=11
-> Tid Range Scan on a (cost=0.01..38.86 rows=2485 width=0)
(actual time=0.019..0.188 rows=2486 loops=1)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid))
Buffers: shared hit=11

If I put that code back in, how many pages are read depends on the
number of parallel workers as workers will keep running with higher
page numbers and heap_getnextslot_tidrange() will just (inefficiently)
filter those out.

max_parallel_workers_per_gather=2;
-> Parallel Tid Range Scan on a (cost=0.01..16.19
rows=1035 width=0) (actual time=0.191..0.310 rows=829 loops=3)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
Buffers: shared read=17

max_parallel_workers_per_gather=3;
-> Parallel Tid Range Scan on a (cost=0.01..12.54
rows=802 width=0) (actual time=0.012..0.114 rows=622 loops=4)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
Buffers: shared hit=19

max_parallel_workers_per_gather=4;
-> Parallel Tid Range Scan on a (cost=0.01..9.72
rows=621 width=0) (actual time=0.014..0.135 rows=497 loops=5)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
Buffers: shared hit=21

To fix this you need to make table_block_parallelscan_nextpage obey
the limits imposed by heap_setscanlimits().

The equivalent code in the non-parallel version is in
heapgettup_advance_block().

/* check if the limit imposed by heap_setscanlimits() is met */
if (scan->rs_numblocks != InvalidBlockNumber)
{
if (--scan->rs_numblocks == 0)
return InvalidBlockNumber;
}

I've not studied exactly how you'd get the rs_numblock information
down to the parallel scan descriptor. But when you figure that out,
just remember that you can't do the --scan->rs_numblocks from
table_block_parallelscan_nextpage() as that's not parallel safe. You
might be able to add an or condition to: "if (nallocated >=
pbscan->phs_nblocks)" to make it "if (nallocated >=
pbscan->phs_nblocks || nallocated >= pbscan->phs_numblocks)",
although the field names don't seem very intuitive there. It would be
nicer if the HeapScanDesc field was called rs_blocklimit rather than
rs_numblocks. It's not for this patch to go messing with that,
however.

David

cary.huang@highgo.ca

about 2 years ago

In reply to: David Rowley (#7)

Re: Support tid range scan in parallel?

Thank you very much for the test and review. Greatly appreciated!

This now calls heap_setscanlimits() for the parallel version, it's
just that table_block_parallelscan_nextpage() does nothing to obey
those limits.

yes, you are absolutely right. Though heap_setscanlimits() is now called by
parallel tid range scan, table_block_parallelscan_nextpage() does nothing
to obey these limits, resulting in more blocks being inefficiently filtered out
by the optimization code you mentioned in heap_getnextslot_tidrange().

I've not studied exactly how you'd get the rs_numblock information
down to the parallel scan descriptor. But when you figure that out,
just remember that you can't do the --scan->rs_numblocks from
table_block_parallelscan_nextpage() as that's not parallel safe. You
might be able to add an or condition to: "if (nallocated >=
pbscan->phs_nblocks)" to make it "if (nallocated >=
pbscan->phs_nblocks || nallocated >= pbscan->phs_numblocks)",
although the field names don't seem very intuitive there. It would be
nicer if the HeapScanDesc field was called rs_blocklimit rather than
rs_numblocks. It's not for this patch to go messing with that,
however.

rs_numblock was not passed down to the parallel scan context and
table_block_parallelscan_nextpage() did not seem to have a logic to limit
the block scan range set by heap_setscanlimits() in parallel scan. Also, I
noticed that the rs_startblock was also not passed to the parallel scan
context, which causes the parallel scan always start from 0 even when a
lower ctid bound is specified.

so I added a logic in heap_set_tidrange() to pass these 2 values to parallel
scan descriptor as "phs_startblock" and "phs_numblock". These will be
available in table_block_parallelscan_nextpage() in parallel scan.

I followed your recommendation and modified the condition to:

if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
nallocated >= pbscan->phs_numblock))

so that the parallel tid range scan will stop when the upper scan limit is
reached. With these changes, I see that the number of buffer reads is
consistent between single and parallel ctid range scans. The v3 patch is
attached.

postgres=# explain (analyze, buffers) select count(*) from test where ctid >= '(0,1)' and ctid < '(11,0)';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Aggregate (cost=39.43..39.44 rows=1 width=8) (actual time=1.007..1.008 rows=1 loops=1)
Buffers: shared read=11
-> Tid Range Scan on test (cost=0.01..34.35 rows=2034 width=0) (actual time=0.076..0.639 rows=2035 loops=1)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid))
Buffers: shared read=11

postgres=# set max_parallel_workers_per_gather=2;
SET
postgres=# explain (analyze, buffers) select count(*) from test where ctid >= '(0,1)' and ctid < '(11,0)';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=16.45..16.46 rows=1 width=8) (actual time=14.329..16.840 rows=1 loops=1)
Buffers: shared hit=11
-> Gather (cost=16.43..16.44 rows=2 width=8) (actual time=3.197..16.814 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=11
-> Partial Aggregate (cost=16.43..16.44 rows=1 width=8) (actual time=0.705..0.706 rows=1 loops=3)
Buffers: shared hit=11
-> Parallel Tid Range Scan on test (cost=0.01..14.31 rows=848 width=0) (actual time=0.022..0.423 rows=678 loops=3)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid))
Buffers: shared hit=11

postgres=# set max_parallel_workers_per_gather=3;
SET
postgres=# explain (analyze, buffers) select count(*) from test where ctid >= '(0,1)' and ctid < '(11,0)';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=12.74..12.75 rows=1 width=8) (actual time=16.793..19.053 rows=1 loops=1)
Buffers: shared hit=11
-> Gather (cost=12.72..12.73 rows=3 width=8) (actual time=2.827..19.012 rows=4 loops=1)
Workers Planned: 3
Workers Launched: 3
Buffers: shared hit=11
-> Partial Aggregate (cost=12.72..12.73 rows=1 width=8) (actual time=0.563..0.565 rows=1 loops=4)
Buffers: shared hit=11
-> Parallel Tid Range Scan on test (cost=0.01..11.08 rows=656 width=0) (actual time=0.018..0.338 rows=509 loops=4)
TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid))
Buffers: shared hit=11

thank you!

Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca

dgrowleyml@gmail.com

about 2 years ago

In reply to: Cary Huang (#8)

Re: Support tid range scan in parallel?

On Thu, 9 May 2024 at 10:23, Cary Huang <cary.huang@highgo.ca> wrote:

The v3 patch is attached.

I've not looked at the patch, but please add it to the July CF. I'll
try and look in more detail then.

David

cary.huang@highgo.ca

about 2 years ago

In reply to: David Rowley (#9)

Re: Support tid range scan in parallel?

I've not looked at the patch, but please add it to the July CF. I'll
try and look in more detail then.

Thanks David, I have added this patch on July commitfest under the
server feature category.

I understand that the regression tests for parallel ctid range scan is a
bit lacking now. It only has a few EXPLAIN clauses to ensure parallel
workers are used when tid ranges are specified. They are added as
part of select_parallel.sql test. I am not sure if it is more appropriate
to have them as part of tidrangescan.sql test instead. So basically
re-run the same test cases in tidrangescan.sql but in parallel?

thank you

Cary

dgrowleyml@gmail.com

about 2 years ago

In reply to: Cary Huang (#10)

Re: Support tid range scan in parallel?

On Fri, 10 May 2024 at 05:16, Cary Huang <cary.huang@highgo.ca> wrote:

I understand that the regression tests for parallel ctid range scan is a
bit lacking now. It only has a few EXPLAIN clauses to ensure parallel
workers are used when tid ranges are specified. They are added as
part of select_parallel.sql test. I am not sure if it is more appropriate
to have them as part of tidrangescan.sql test instead. So basically
re-run the same test cases in tidrangescan.sql but in parallel?

I think tidrangescan.sql is a more suitable location than
select_parallel.sql I don't think you need to repeat all the tests as
many of them are testing the tid qual processing which is the same
code as it is in the parallel version.

You should add a test that creates a table with a very low fillfactor,
low enough so only 1 tuple can fit on each page and insert a few dozen
tuples. The test would do SELECT COUNT(*) to ensure you find the
correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in
there too for extra coverage to ensure that the correct tuples are
being counted. Just make sure and EXPLAIN (COSTS OFF) the query first
in the test to ensure that it's always testing the plan you're
expecting to test.

David

cary.huang@highgo.ca

about 2 years ago

In reply to: David Rowley (#11)

Re: Support tid range scan in parallel?

You should add a test that creates a table with a very low fillfactor,
low enough so only 1 tuple can fit on each page and insert a few dozen
tuples. The test would do SELECT COUNT(*) to ensure you find the
correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in
there too for extra coverage to ensure that the correct tuples are
being counted. Just make sure and EXPLAIN (COSTS OFF) the query first
in the test to ensure that it's always testing the plan you're
expecting to test.

thank you for the test suggestion. I moved the regress tests from select_parallel
to tidrangescan instead. I follow the existing test table creation in tidrangescan
with the lowest fillfactor of 10, I am able to get consistent 5 tuples per page
instead of 1. It should be okay as long as it is consistently 5 tuples per page so
the tuple count results from parallel tests would be in multiples of 5.

The attached v4 patch includes the improved regression tests.

Thank you very much!

Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca

zhjwpku@gmail.com

almost 2 years ago

In reply to: Cary Huang (#12)

Re: Support tid range scan in parallel?

Hi Cary,

On Wed, May 15, 2024 at 5:33 AM Cary Huang <cary.huang@highgo.ca> wrote:

You should add a test that creates a table with a very low fillfactor,
low enough so only 1 tuple can fit on each page and insert a few dozen
tuples. The test would do SELECT COUNT(*) to ensure you find the
correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in
there too for extra coverage to ensure that the correct tuples are
being counted. Just make sure and EXPLAIN (COSTS OFF) the query first
in the test to ensure that it's always testing the plan you're
expecting to test.

thank you for the test suggestion. I moved the regress tests from select_parallel
to tidrangescan instead. I follow the existing test table creation in tidrangescan
with the lowest fillfactor of 10, I am able to get consistent 5 tuples per page
instead of 1. It should be okay as long as it is consistently 5 tuples per page so
the tuple count results from parallel tests would be in multiples of 5.

The attached v4 patch includes the improved regression tests.

Thank you very much!

Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca

+++ b/src/backend/access/heap/heapam.c
@@ -307,6 +307,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool
keep_startblock)
* results for a non-MVCC snapshot, the caller must hold some higher-level
* lock that ensures the interesting tuple(s) won't change.)
*/
+

I see no reason why you add a blank line here, is it a typo?

+/* ----------------------------------------------------------------
+ * ExecSeqScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;

Function name in the comment is not consistent.

@@ -81,6 +81,7 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_startblock; /* starting block number */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
+ BlockNumber phs_numblock; /* max number of blocks to scan */
} ParallelBlockTableScanDescData;

Can this be reorganized by putting phs_numblock after phs_startblock?

--
Regards
Junwang Zhao

rafia.pghackers@gmail.com

almost 2 years ago

In reply to: Junwang Zhao (#13)

Re: Support tid range scan in parallel?

This is a good idea to extend parallelism in postgres.
I went through this patch, and here are a few review comments,

+ Size pscan_len; /* size of parallel tid range scan descriptor */

The other name for this var could be tidrs_PscanLen, following the pattern
in indexScanState and IndexOnlyScanState.
Also add it and its description in the comment above the struct.

/* ----------------------------------------------------------------
* ExecTidRangeScanInitializeDSM
*
* Set up a parallel heap scan descriptor.
* ----------------------------------------------------------------
*/
This comment doesn't seem right, please correct it to say for Tid range
scan descriptor.

+
+ sscan = relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
+ pscan, flags);
+
+ return sscan;

I do not see any requirement of using this sscan var.

- if (nallocated >= pbscan->phs_nblocks)
+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
+ nallocated >= pbscan->phs_numblock))
  page = InvalidBlockNumber; /* all blocks have been allocated */

Please add a comment for the reason for this change. As far as I
understood, this is only for the case of TIDRangeScan so it requires an
explanation for the case.

On Sun, 11 Aug 2024 at 09:03, Junwang Zhao <zhjwpku@gmail.com> wrote:

Hi Cary,

On Wed, May 15, 2024 at 5:33 AM Cary Huang <cary.huang@highgo.ca> wrote:

You should add a test that creates a table with a very low fillfactor,
low enough so only 1 tuple can fit on each page and insert a few dozen
tuples. The test would do SELECT COUNT(*) to ensure you find the
correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in
there too for extra coverage to ensure that the correct tuples are
being counted. Just make sure and EXPLAIN (COSTS OFF) the query first
in the test to ensure that it's always testing the plan you're
expecting to test.

thank you for the test suggestion. I moved the regress tests from

select_parallel

to tidrangescan instead. I follow the existing test table creation in

tidrangescan

with the lowest fillfactor of 10, I am able to get consistent 5 tuples

per page

instead of 1. It should be okay as long as it is consistently 5 tuples

per page so

the tuple count results from parallel tests would be in multiples of 5.

The attached v4 patch includes the improved regression tests.

Thank you very much!

Cary Huang
-------------
HighGo Software Inc. (Canada)
cary.huang@highgo.ca
www.highgo.ca
+++ b/src/backend/access/heap/heapam.c
@@ -307,6 +307,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool
keep_startblock)
* results for a non-MVCC snapshot, the caller must hold some higher-level
* lock that ensures the interesting tuple(s) won't change.)
*/
+
I see no reason why you add a blank line here, is it a typo?
+/* ----------------------------------------------------------------
+ * ExecSeqScanInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecTidRangeScanInitializeWorker(TidRangeScanState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelTableScanDesc pscan;
Function name in the comment is not consistent.
@@ -81,6 +81,7 @@ typedef struct ParallelBlockTableScanDescData
BlockNumber phs_startblock; /* starting block number */
pg_atomic_uint64 phs_nallocated; /* number of blocks allocated to
* workers so far. */
+ BlockNumber phs_numblock; /* max number of blocks to scan */
} ParallelBlockTableScanDescData;
Can this be reorganized by putting phs_numblock after phs_startblock?

--
Regards
Junwang Zhao

--
Regards,
Rafia Sabih

cary.huang@highgo.ca

12 months ago

In reply to: Rafia Sabih (#14)

Re: Support tid range scan in parallel?

Hello

Sorry David and all who have reviewed the patch, it's been awhile since the patch
was last worked on :(. Thank you all for the reviews and comments! Attached is
the rebased patch that adds support for parallel TID range scans. This feature is
particularly useful scanning large tables where the data needs to be scanned in
sizable segments using a TID range in the WHERE clause to define each segment.
By enabling parallelism, this approach can improve performance compared to
both non-parallel TID range scans and traditional sequential scans.

Regards

Cary Huang

zhjwpku@gmail.com

12 months ago

In reply to: Cary Huang (#15)

Re: Support tid range scan in parallel?

Hi Cary,

On Fri, Jun 6, 2025 at 5:24 AM Cary Huang <cary.huang@highgo.ca> wrote:

Hello

Sorry David and all who have reviewed the patch, it's been awhile since the patch
was last worked on :(. Thank you all for the reviews and comments! Attached is
the rebased patch that adds support for parallel TID range scans. This feature is
particularly useful scanning large tables where the data needs to be scanned in
sizable segments using a TID range in the WHERE clause to define each segment.
By enabling parallelism, this approach can improve performance compared to
both non-parallel TID range scans and traditional sequential scans.

Regards

Cary Huang

Thanks for updating the patch. I have a few comments on it.

+ /*
+ * if parallel mode is used, store startblock and numblocks in parallel
+ * scan descriptor as well.
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan = NULL;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = scan->rs_startblock;
+ bpscan->phs_numblock = scan->rs_numblocks;
+ }

It would be more intuitive and clearer to directly use startBlk and numBlks
to set these values. Since scan->rs_startblock and scan->rs_numblocks
are already set using these variables, using the same approach for bpscan
would make the code easier to understand.

Another nitty-gritty is that you might want to use a capital `If` in the
comments to maintain the same style.

+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
+ nallocated >= pbscan->phs_numblock))

I'd suggest explictly setting phs_numblock to InvalidBlockNumber in
table_block_parallelscan_initialize, and compare with InvalidBlockNumber
here.

--
Regards
Junwang Zhao

cary.huang@highgo.ca

12 months ago

In reply to: Junwang Zhao (#16)

Re: Support tid range scan in parallel?

Hi Junwang

Thank you so much for the review!

+ /*
+ * if parallel mode is used, store startblock and numblocks in parallel
+ * scan descriptor as well.
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan = NULL;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = scan->rs_startblock;
+ bpscan->phs_numblock = scan->rs_numblocks;
+ }
It would be more intuitive and clearer to directly use startBlk and numBlks
to set these values. Since scan->rs_startblock and scan->rs_numblocks
are already set using these variables, using the same approach for bpscan
would make the code easier to understand.

Another nitty-gritty is that you might want to use a capital `If` in the
comments to maintain the same style.

Agreed, made the adjustment in the attached patch.

+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
+ nallocated >= pbscan->phs_numblock))
I'd suggest explictly setting phs_numblock to InvalidBlockNumber in
table_block_parallelscan_initialize, and compare with InvalidBlockNumber
here.

Also agreed, phs_numblock should be initialized in
table_block_parallelscan_initialize just like all other parameters in parallel scan
context. You are right, it is much neater to use InvalidBlockNumber rather
than 0 to indicate if an upper bound has been specified in the TID range scan.

I have addressed your comment in the attached v6 patch. Thank you again for
the review.

Best regards
Cary Huang

niushiji@gmail.com

12 months ago

In reply to: Cary Huang (#17)

Re: Support tid range scan in parallel?

Hi, Cary,

I have two comments:
1. Does table_beginscan_parallel_tidrange() need an assert of relid,
like what table_beginscan_parallel() did?
Assert(RelationGetRelid(relation) == pscan->phs_relid);

2. The new field phs_numblock in ParallelBlockTableScanDescData
structure has almost the same name as another field phs_nblocks. Would
you consider changing it to another name, for example,
phs_maxnumblocktoscan?

Thanks,
Steven

在 2025/6/10 7:04, Cary Huang 写道:

Show quoted text

Hi Junwang

Thank you so much for the review!
+ /*
+ * if parallel mode is used, store startblock and numblocks in parallel
+ * scan descriptor as well.
+ */
+ if (scan->rs_base.rs_parallel != NULL)
+ {
+ ParallelBlockTableScanDesc bpscan = NULL;
+
+ bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ bpscan->phs_startblock = scan->rs_startblock;
+ bpscan->phs_numblock = scan->rs_numblocks;
+ }
It would be more intuitive and clearer to directly use startBlk and numBlks
to set these values. Since scan->rs_startblock and scan->rs_numblocks
are already set using these variables, using the same approach for bpscan
would make the code easier to understand.

Another nitty-gritty is that you might want to use a capital `If` in the
comments to maintain the same style.
Agreed, made the adjustment in the attached patch.
+ if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
+ nallocated >= pbscan->phs_numblock))
I'd suggest explictly setting phs_numblock to InvalidBlockNumber in
table_block_parallelscan_initialize, and compare with InvalidBlockNumber
here.
Also agreed, phs_numblock should be initialized in
table_block_parallelscan_initialize just like all other parameters in parallel scan
context. You are right, it is much neater to use InvalidBlockNumber rather
than 0 to indicate if an upper bound has been specified in the TID range scan.

I have addressed your comment in the attached v6 patch. Thank you again for
the review.

Best regards
Cary Huang

cary.huang@highgo.ca

11 months ago

In reply to: Steven Niu (#18)

Re: Support tid range scan in parallel?

Hi Steven

thanks for the review!

I have two comments:
1. Does table_beginscan_parallel_tidrange() need an assert of relid,
like what table_beginscan_parallel() did?
Assert(RelationGetRelid(relation) == pscan->phs_relid);

In the v6 rebased patch, the assert has become:

Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));

rather than:

Assert(RelationGetRelid(relation) == pscan->phs_relid);

table_beginscan_parallel_tidrange() already has the proper assert line
similar to what table_beginscan_parallel() has.

2. The new field phs_numblock in ParallelBlockTableScanDescData
structure has almost the same name as another field phs_nblocks. Would
you consider changing it to another name, for example,
phs_maxnumblocktoscan?

I actually had a similar thought too, phs_nblocks and phs_numblock are
very similar but are quite different. But I still left the name as phs_numblock
because I want to keep it consistent (kind of) with the 'numBlks' used in
heap_set_tidrange() in heapam.c. The comments besides their declaration
should be enough to describe their differences without causing confusion.

Best regards
Cary

dgrowleyml@gmail.com

10 months ago

In reply to: Cary Huang (#17)

Re: Support tid range scan in parallel?

On Tue, 10 Jun 2025 at 11:04, Cary Huang <cary.huang@highgo.ca> wrote:

I have addressed your comment in the attached v6 patch. Thank you again for
the review.

Here's a review of v6:

1. In cost_tidrangescan() you're dividing the total costs by the
number of workers yet the comment is claiming that's CPU cost. I think
this needs to follow the lead of cost_seqscan() and separate out the
CPU and IO cost then add the IO cost at the end, after the divide by
the number of workers.

2. In execParallel.c, could you move the case for T_TidRangeScanState
below T_ForeignScanState? What you have right now is now quite
following the unwritten standard set out by the other nodes, i.e
non-parallel aware nodes are last. A good spot seems to be putting it
at the end of the scan types... Custom scan seems slightly misplaced,
but probably can ignore that one and put it after T_ForeignScanState

3. The following comment should mention what behaviour occurs when the
field is set to InvalidBlockNumber:

BlockNumber phs_numblock; /* max number of blocks to scan */

Something like /* # of blocks to scan, or InvalidBlockNumber if no limit */

4. I think the following would be clearer if written using an else if:

if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock !=
InvalidBlockNumber && nallocated >= pbscan->phs_numblock))
page = InvalidBlockNumber; /* all blocks have been allocated */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;

e.g:

if (nallocated >= pbscan->phs_nblocks)
page = InvalidBlockNumber; /* all blocks have been allocated */
else if (pbscan->phs_numblock != InvalidBlockNumber &&
nallocated >= pbscan->phs_numblock)
page = InvalidBlockNumber; /* upper scan limit reached */
else
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;

That way the comment after the assignment is accurate.

5. For the tests, is there any reason not to reuse the tidrangescan table?

I don't see any other issues, but I've not tested the patch yet. I'll
do that if you can fix the 5 above.

Thanks

David

cary.huang@highgo.ca

10 months ago

In reply to: David Rowley (#20)

dgrowleyml@gmail.com

10 months ago

In reply to: Cary Huang (#21)

cary.huang@highgo.ca

10 months ago

In reply to: David Rowley (#22)

dgrowleyml@gmail.com

10 months ago

In reply to: Cary Huang (#23)

cary.huang@highgo.ca

9 months ago

In reply to: David Rowley (#24)

dgrowleyml@gmail.com

7 months ago

In reply to: Cary Huang (#25)

dgrowleyml@gmail.com

6 months ago

In reply to: David Rowley (#26)

dgrowleyml@gmail.com

6 months ago

In reply to: David Rowley (#27)

dgrowleyml@gmail.com

6 months ago

In reply to: David Rowley (#28)

dgrowleyml@gmail.com

6 months ago

In reply to: David Rowley (#29)

dgrowleyml@gmail.com

6 months ago

In reply to: David Rowley (#30)

cary.huang@highgo.ca

6 months ago

In reply to: David Rowley (#31)

dgrowleyml@gmail.com

6 months ago

In reply to: Cary Huang (#32)