Parallel Seq Scan

Started by Amit Kapilaover 11 years ago496 messageshackers

amit.kapila16@gmail.com

over 11 years ago

As per discussion on another thread related to using
custom scan nodes for prototype of parallel sequence scan,
I have developed the same, but directly by adding
new nodes for parallel sequence scan. There might be
some advantages for developing this as a contrib
module by using custom scan nodes, however I think
we might get stucked after some point due to custom
scan node capability as pointed out by Andres.

The basic idea used is that while evaluating the cheapest
path for scan, optimizer will also evaluate if it can use
parallel seq path. Currently I have kept a very simple
model to calculate the cost of parallel sequence path which
is that divide the cost for CPU and disk by availble number
of worker backends (We can enhance it based on further
experiments and discussion; we need to consider worker startup
and dynamic shared memory setup cost as well). The work aka
scan of blocks is divided equally among all workers (except for
corner cases where blocks can't be equally divided among workers,
the last worker will be responsible for scanning the remaining blocks).

The number of worker backends that can be used for
parallel seq scan can be configured by using a new GUC
parallel_seqscan_degree, the default value of which is zero
and it means parallel seq scan will not be considered unless
user configures this value.

In ExecutorStart phase, initiate the required number of workers
as per parallel seq scan plan and setup dynamic shared memory and
share the information required for worker to execute the scan.
Currently I have just shared the relId, targetlist and number
of blocks to be scanned by worker, however I think we might want
to generate a plan for each of the workers in master backend and
then share the same to individual worker.
Now to fetch the data from multiple queues corresponding to each
worker a simple mechanism is used that is fetch from first queue
till all the data is consumed from same, then fetch from second
queue and so on. Also here master backend is responsible for just
getting the data from workers and passing it back to client.
I am sure that we can improve this strategy in many ways
like by making master backend to also perform scan for some
of the blocks rather than just getting data from workers and
a better strategy to fetch the data from multiple queues.

Worker backend will receive the information related to scan
from master backend and generate the plan from same and
execute that plan, so here the work to scan the data after
generating the plan is very much similar to exec_simple_query()
(i.e Create the portal and run it based on planned statement)
except that worker backends will initialize the block range it want to
scan in executor initialization phase (ExecInitSeqScan()).
Workers will exit after sending the data to master backend
which essentially means that for each execution we need
to initiate the workers, I think here we can improve by giving the
control for workers to postmaster so that we don't need to
initialize them each time during execution, however this can
be a totally separate optimization which is better to be done
independently of this patch.
As currently we don't have mechanism to share transaction
state, I have used separate transaction in worker backend to
execute the plan.

Any error in master backend either via backend worker or due
to other issue in master backend itself should terminate all the
workers before aborting the transaction.
We can't do it with the error context callback mechanism
(error_context_stack) which we use at other places in code, as
for this case we need it from the time workers are started till
the execution is complete (error_context_stack could get reset
once the control goes out of the function which has set it.)
One way could be that maintain the callback information in
TransactionState and use it to kill the workers before aborting
transaction in main backend. Another could be that have another
variable similar to error_context_stack (which will be used
specifically for storing the workers state), and kill the workers
in errfinish via callback. Currently I have handled it at the time of
detaching from shared memory.
Another point that needs to be taken care in worker backend is
that if any error occurs, we should *not* abort the transaction as
the transaction state is shared across all workers.

Currently the parallel seq scan will not be considered
for statements other than SELECT or if there is a join in
the statement or if statement contains quals or if target
list contains non-Var fields. We can definitely support
simple quals and targetlist other than non-Vars. By simple,
I means that it should not contain functions or some other
conditions which can't be pushed down to worker backend.

Behaviour of some simple statements with patch is as below:

postgres=# create table t1(c1 int, c2 char(500)) with (fillfactor=10);
CREATE TABLE

postgres=# insert into t1 values(generate_series(1,100),'amit');
INSERT 0 100

postgres=# explain select c1 from t1;
QUERY PLAN
------------------------------------------------------
Seq Scan on t1 (cost=0.00..101.00 rows=100 width=4)
(1 row)

postgres=# set parallel_seqscan_degree=4;
SET
postgres=# explain select c1 from t1;
QUERY PLAN
--------------------------------------------------------------
Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4)
Number of Workers: 4
Number of Blocks Per Workers: 25
(3 rows)

postgres=# explain select Distinct(c1) from t1;
QUERY PLAN
--------------------------------------------------------------------
HashAggregate (cost=25.50..26.50 rows=100 width=4)
Group Key: c1
-> Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4)
Number of Workers: 4
Number of Blocks Per Workers: 25
(5 rows)

Attached patch is just to facilitate the discussion about the
parallel seq scan and may be some other dependent tasks like
sharing of various states like combocid, snapshot with parallel
workers. It is by no means ready to do any complex test, ofcourse
I will work towards making it more robust both in terms of adding
more stuff and doing performance optimizations.

Thoughts/Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

José Luis Tallón

jltallon@adv-solutions.net

over 11 years ago

In reply to: Amit Kapila (#1)

Re: Parallel Seq Scan

On 12/04/2014 07:35 AM, Amit Kapila wrote:

[snip]

The number of worker backends that can be used for
parallel seq scan can be configured by using a new GUC
parallel_seqscan_degree, the default value of which is zero
and it means parallel seq scan will not be considered unless
user configures this value.

The number of parallel workers should be capped (of course!) at the
maximum amount of "processors" (cores/vCores, threads/hyperthreads)
available.

More over, when load goes up, the relative cost of parallel working
should go up as well.
Something like:
p = number of cores
l = 1min-load

additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)

(for c>1, of course)

In ExecutorStart phase, initiate the required number of workers
as per parallel seq scan plan and setup dynamic shared memory and
share the information required for worker to execute the scan.
Currently I have just shared the relId, targetlist and number
of blocks to be scanned by worker, however I think we might want
to generate a plan for each of the workers in master backend and
then share the same to individual worker.

[snip]

Attached patch is just to facilitate the discussion about the
parallel seq scan and may be some other dependent tasks like
sharing of various states like combocid, snapshot with parallel
workers. It is by no means ready to do any complex test, ofcourse
I will work towards making it more robust both in terms of adding
more stuff and doing performance optimizations.

Thoughts/Suggestions?

Not directly (I haven't had the time to read the code yet), but I'm
thinking about the ability to simply *replace* executor methods from an
extension.
This could be an alternative to providing additional nodes that the
planner can include in the final plan tree, ready to be executed.

The parallel seq scan nodes are definitively the best approach for
"parallel query", since the planner can optimize them based on cost.
I'm wondering about the ability to modify the implementation of some
methods themselves once at execution time: given a previously planned
query, chances are that, at execution time (I'm specifically thinking
about prepared statements here), a different implementation of the same
"node" might be more suitable and could be used instead while the
condition holds.

If this latter line of thinking is too off-topic within this thread and
there is any interest, we can move the comments to another thread and
I'd begin work on a PoC patch. It might as well make sense to implement
the executor overloading mechanism alongide the custom plan API, though.
Any comments appreciated.

Thank you for your work, Amit

Regards,

/ J.L.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

over 11 years ago

In reply to: José Luis Tallón (#2)

Re: Parallel Seq Scan

José,

* José Luis Tallón (jltallon@adv-solutions.net) wrote:

On 12/04/2014 07:35 AM, Amit Kapila wrote:

The number of worker backends that can be used for
parallel seq scan can be configured by using a new GUC
parallel_seqscan_degree, the default value of which is zero
and it means parallel seq scan will not be considered unless
user configures this value.

The number of parallel workers should be capped (of course!) at the
maximum amount of "processors" (cores/vCores, threads/hyperthreads)
available.

More over, when load goes up, the relative cost of parallel working
should go up as well.
Something like:
p = number of cores
l = 1min-load

additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)

(for c>1, of course)

While I agree in general that we'll need to come up with appropriate
acceptance criteria, etc, I don't think we want to complicate this patch
with that initially. A SUSET GUC which caps the parallel GUC would be
enough for an initial implementation, imv.

Not directly (I haven't had the time to read the code yet), but I'm
thinking about the ability to simply *replace* executor methods from
an extension.

You probably want to look at the CustomScan thread+patch directly then..

Thanks,

Stephen

Stephen Frost

sfrost@snowman.net

over 11 years ago

In reply to: Amit Kapila (#1)

Re: Parallel Seq Scan

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

postgres=# explain select c1 from t1;
QUERY PLAN
------------------------------------------------------
Seq Scan on t1 (cost=0.00..101.00 rows=100 width=4)
(1 row)

postgres=# set parallel_seqscan_degree=4;
SET
postgres=# explain select c1 from t1;
QUERY PLAN
--------------------------------------------------------------
Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4)
Number of Workers: 4
Number of Blocks Per Workers: 25
(3 rows)

This is all great and interesting, but I feel like folks might be
waiting to see just what kind of performance results come from this (and
what kind of hardware is needed to see gains..). There's likely to be
situations where this change is an improvement while also being cases
where it makes things worse.

One really interesting case would be parallel seq scans which are
executing against foreign tables/FDWs..

Thanks!

Stephen

Jim Nasby

Jim.Nasby@BlueTreble.com

over 11 years ago

In reply to: José Luis Tallón (#2)

Re: Parallel Seq Scan

On 12/5/14, 9:08 AM, José Luis Tallón wrote:

More over, when load goes up, the relative cost of parallel working should go up as well.
Something like:
p = number of cores
l = 1min-load

additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)

(for c>1, of course)

...

The parallel seq scan nodes are definitively the best approach for "parallel query", since the planner can optimize them based on cost.
I'm wondering about the ability to modify the implementation of some methods themselves once at execution time: given a previously planned query, chances are that, at execution time (I'm specifically thinking about prepared statements here), a different implementation of the same "node" might be more suitable and could be used instead while the condition holds.

These comments got me wondering... would it be better to decide on parallelism during execution instead of at plan time? That would allow us to dynamically scale parallelism based on system load. If we don't even consider parallelism until we've pulled some number of tuples/pages from a relation, this would also eliminate all parallel overhead on small relations.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Parallel Seq Scan

Attachments:

Attachments: