parallelize queries containing subplans

Started by Amit Kapilaover 9 years ago30 messageshackers

amit.kapila16@gmail.com

over 9 years ago

Currently, queries that have references to SubPlans or
AlternativeSubPlans are considered parallel-restricted. I think we
can lift this restriction in many cases especially when SubPlans are
parallel-safe. To make this work, we need to propagate the
parallel-safety information from path node to plan node and the same
could be easily done while creating a plan. Another option could be
that instead of propagating parallel-safety information from path to
plan, we can find out from the plan if it is parallel-safe (doesn't
contain any parallel-aware node) by traversing whole plan tree, but I
think it is a waste of cycles. Once we have parallel-safety
information in the plan, we can use that for detection of
parallel-safe expressions in max_parallel_hazard_walker(). Finally,
we can pass all the subplans to workers during plan serialization in
ExecSerializePlan(). This will enable workers to execute subplans
that are referred in parallel part of the plan. Now, we might be able
to optimize it such that we pass only subplans that are referred in
parallel portion of plan, but I am not sure if it is worth the trouble
because it is one-time cost and much lesser than other things we do
(like creating
dsm, launching workers).

Attached patch implements the above idea. This will enable
parallelism for queries containing un-correlated subplans, an example
of which is as follows:

set parallel_tuple_cost=0;
set parallel_setup_cost=0;
set min_parallel_relation_size=50;

create table t1 (i int, j int, k int);
create table t2 (i int, j int, k int);

insert into t1 values (generate_series(1,10)*random(),
generate_series(5,50)*random(),
generate_series(8,80)*random());
insert into t2 values (generate_series(4,10)*random(),
generate_series(5,90)*random(),
generate_series(2,10)*random());

Plan without Patch
-----------------------------
postgres=# explain select * from t1 where t1.i not in (select t2.i
from t2 where t2.i in (1,2,3));
QUERY PLAN
---------------------------------------------------------------
Seq Scan on t1 (cost=110.84..411.72 rows=8395 width=12)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4)
Filter: (i = ANY ('{1,2,3}'::integer[]))
(5 rows)

Plan with Patch
------------------------
postgres=# explain select * from t1 where t1.i not in (select t2.i
from t2 where t2.i in (1,2,3));
QUERY PLAN
-------------------------------------------------------------------------
Gather (cost=110.84..325.30 rows=8395 width=12)
Workers Planned: 1
-> Parallel Seq Scan on t1 (cost=110.84..325.30 rows=4938 width=12)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4)
Filter: (i = ANY ('{1,2,3}'::integer[]))
(7 rows)

We have observed that Q-16 in TPC-H have been improved with the patch
and the analysis of same will be shared by my colleague Rafia.

Now, we can further extend this to parallelize queries containing
correlated subplans like below:

explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i);
QUERY PLAN
-------------------------------------------------------------
Seq Scan on t1 (cost=0.00..831049.09 rows=8395 width=12)
Filter: (SubPlan 1)
SubPlan 1
-> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4)
Filter: (i = t1.i)
(5 rows)

In the above query, Filter on t2 (i = t1.i) generates Param node which
is a parallel-restricted node, so such queries won't be able to use
parallelism even with the patch. I think we can mark Params which
refer to same level as parallel-safe and I think we have this
information (node-> varlevelsup/ phlevelsup/ agglevelsup) available
when we replace correlation vars (SS_replace_correlation_vars). The
reason why it is not advisable to mark Params that don't refer to same
query level as parallel-safe is that can lead to plans like below:

Foo
-> Gather
-> Bar
SubPlan 1 (SubPlan refers to Foo)

Now, this problem is tricky because we need to pass all such Params
each time we invoke Gather. That also is doable with much more
effort, but I am not sure if it is worth because all of the use cases
I have seen in TPC-H (Q-2) or TPC-DS (Q-6) always uses SubPlans that
refer to same query level.

Yet, another useful enhancement in this area could be to consider both
parallel and non-parallel paths for subplans. As of now, we consider
the cheapest/best path and form subplan from it, but it is quite
possible that instead of choosing parallel path (in case it is
cheapest) at subplan level, the non-parallel path at subplan level
could be beneficial when upper plan can use parallelism. I think this
will be a separate project in itself if we want to do this and based
on my study of TPC-H and TPC-DS queries, I am confident that this will
be helpful in certain queries at higher scale factors.

Thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Rafia Sabih

rafia.sabih@enterprisedb.com

over 9 years ago

In reply to: Amit Kapila (#1)

Re: parallelize queries containing subplans

On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Currently, queries that have references to SubPlans or
AlternativeSubPlans are considered parallel-restricted. I think we
can lift this restriction in many cases especially when SubPlans are
parallel-safe. To make this work, we need to propagate the
parallel-safety information from path node to plan node and the same
could be easily done while creating a plan. Another option could be
that instead of propagating parallel-safety information from path to
plan, we can find out from the plan if it is parallel-safe (doesn't
contain any parallel-aware node) by traversing whole plan tree, but I
think it is a waste of cycles. Once we have parallel-safety
information in the plan, we can use that for detection of
parallel-safe expressions in max_parallel_hazard_walker(). Finally,
we can pass all the subplans to workers during plan serialization in
ExecSerializePlan(). This will enable workers to execute subplans
that are referred in parallel part of the plan. Now, we might be able
to optimize it such that we pass only subplans that are referred in
parallel portion of plan, but I am not sure if it is worth the trouble
because it is one-time cost and much lesser than other things we do
(like creating
dsm, launching workers).

Attached patch implements the above idea. This will enable
parallelism for queries containing un-correlated subplans, an example
of which is as follows:

set parallel_tuple_cost=0;
set parallel_setup_cost=0;
set min_parallel_relation_size=50;

create table t1 (i int, j int, k int);
create table t2 (i int, j int, k int);

insert into t1 values (generate_series(1,10)*random(),
generate_series(5,50)*random(),
generate_series(8,80)*random());
insert into t2 values (generate_series(4,10)*random(),
generate_series(5,90)*random(),
generate_series(2,10)*random());

Plan without Patch
-----------------------------
postgres=# explain select * from t1 where t1.i not in (select t2.i
from t2 where t2.i in (1,2,3));
QUERY PLAN
---------------------------------------------------------------
Seq Scan on t1 (cost=110.84..411.72 rows=8395 width=12)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4)
Filter: (i = ANY ('{1,2,3}'::integer[]))
(5 rows)

Plan with Patch
------------------------
postgres=# explain select * from t1 where t1.i not in (select t2.i
from t2 where t2.i in (1,2,3));
QUERY PLAN
-------------------------------------------------------------------------
Gather (cost=110.84..325.30 rows=8395 width=12)
Workers Planned: 1
-> Parallel Seq Scan on t1 (cost=110.84..325.30 rows=4938 width=12)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4)
Filter: (i = ANY ('{1,2,3}'::integer[]))
(7 rows)

We have observed that Q-16 in TPC-H have been improved with the patch
and the analysis of same will be shared by my colleague Rafia.

To study the effect of uncorrelated sub-plan pushdown on TPC-H and
TPC-DS benchmark queries we performed some experiments and the
execution time results for same are summarised as follows,

Query | HEAD | Patches | scale-factor
-----------+---------+-----------+-----------------
DS-Q45 | 35 | 19 | scale-factor: 100
H-Q16 | 812 | 645 | scale-factor: 300
H-Q16 | 49 | 37 | scale-factor: 20

Execution time given in above table is in seconds. Detailed analysis
of this experimentation is as follows,
Additional patches applied in this analysis are,
Parallel index scan [1]/messages/by-id/CAA4eK1KthrAvNjmB2cWuUHz+p3ZTTtbD7o2KUw49PC8HK4r1Wg@mail.gmail.com
Parallel index-only scan [2]/messages/by-id/CAOGQiiOv9NA1VrAo8PmENfGi-y=Cj_DiTF4vyjp+fmuEzovwEA@mail.gmail.com
Parallel merge-join [3]/messages/by-id/CAFiTN-tdYpcsk7Lpv0HapcmvSnMN_TgKjC7RkmvVLZAF+XfmPg@mail.gmail.com
The system setup used for this experiment is,

Server parameter settings:
work_mem = 500 MB,
max_parallel_workers_per_gather = 4,
random_page_cost = seq_page_cost = 0.1 = parallel_tuple_cost,
shared_buffers = 1 GB
Machine used: IBM Power, 4 socket machine, 512 GB RAM

TPC-DS scale factor = 100 (approx size of database is 150 GB)

Query 45 which takes around 35 seconds on head, completes in 19
seconds with these patches. The point to note here is that without
this patch of pushing uncorrelated sublans, hash join which is using
subplan in join filter could not be pushed to workers and hence query
was unable to use the parallelism enough, with this patch parallelism
is available till the topmost join node. The output of explain analyse
statement of this query on both head and with patches is given in
attached file ds_q45.txt.

On further evaluating these patches on TPC-H queries on different
scale factors we came across query 16, for TPC-H scale factor 20 and
aforementioned parameter settings with the change of
max_parallel_workers_per gather = 2, it took 37 seconds with the
patches and 49 seconds on head. Though the improvement in overall
query performance is not appearing to be significantly high, yet the
point to note here is that the time taken by join was around 26
seconds on head which reduced to 14 seconds with the patches. Overall
benefit in performance is not high because sort node is dominating the
execution time. The plan information of this query is given in
attached file h_q16_20_2.txt.

On increasing the scale factor to 300, setting work_mem to 1GB,
increasing max_parallel_workers_per_gather = 6, and disabling the
parallel sequential scan for supplier table by 'alter table supplier
set (parallel_workers = 0)', Q16 completes in 645 seconds with
patches, which was taking 812 seconds on head. We need to disable
parallel_workers for supplier table because on higher worker count it
was taking parallel seq scan and hence the scan node that is using
subplan could not be parallelised. For this query both pushdown of
subplans and parallel merge-join, without any one of these the
benefits of parallelism might not be leveraged fully. The output of
explain analyse for this query is given in h_q16_300.txt

Overall, with pushdown of uncorrelated sub-plans to workers enables
the parallelism in joins which was restricted before and hence good
improvement in query performance can be witnessed.

Now, we can further extend this to parallelize queries containing
correlated subplans like below:

explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i);
QUERY PLAN
-------------------------------------------------------------
Seq Scan on t1 (cost=0.00..831049.09 rows=8395 width=12)
Filter: (SubPlan 1)
SubPlan 1
-> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4)
Filter: (i = t1.i)
(5 rows)

As per my analysis this extension to correlated subplans is likely to
improve the performance of following queries -- Q2 in TPC-H and Q6 in
TPC-DS.

Yet, another useful enhancement in this area could be to consider both
parallel and non-parallel paths for subplans. As of now, we consider
the cheapest/best path and form subplan from it, but it is quite
possible that instead of choosing parallel path (in case it is
cheapest) at subplan level, the non-parallel path at subplan level
could be beneficial when upper plan can use parallelism. I think this
will be a separate project in itself if we want to do this and based
on my study of TPC-H and TPC-DS queries, I am confident that this will
be helpful in certain queries at higher scale factors.

I agree as then we do not need to disable parallelism for particular
relations as we currently do for supplier relation in Q16 of TPC-H.

[1]: /messages/by-id/CAA4eK1KthrAvNjmB2cWuUHz+p3ZTTtbD7o2KUw49PC8HK4r1Wg@mail.gmail.com
[2]: /messages/by-id/CAOGQiiOv9NA1VrAo8PmENfGi-y=Cj_DiTF4vyjp+fmuEzovwEA@mail.gmail.com
[3]: /messages/by-id/CAFiTN-tdYpcsk7Lpv0HapcmvSnMN_TgKjC7RkmvVLZAF+XfmPg@mail.gmail.com

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#1)

Re: parallelize queries containing subplans

On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Now, we can further extend this to parallelize queries containing
correlated subplans like below:

explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i);
QUERY PLAN
-------------------------------------------------------------
Seq Scan on t1 (cost=0.00..831049.09 rows=8395 width=12)
Filter: (SubPlan 1)
SubPlan 1
-> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4)
Filter: (i = t1.i)
(5 rows)

In the above query, Filter on t2 (i = t1.i) generates Param node which
is a parallel-restricted node, so such queries won't be able to use
parallelism even with the patch. I think we can mark Params which
refer to same level as parallel-safe and I think we have this
information (node-> varlevelsup/ phlevelsup/ agglevelsup) available
when we replace correlation vars (SS_replace_correlation_vars).

I have implemented the above idea which will allow same or immediate
outer level PARAMS as parallel_safe. The results of above query after
patch:

postgres=# explain select * from t1 where t1.i in (select t2.i from t2
where t2.i=t1.i);
QUERY PLAN
--------------------------------------------------------------------------
Gather (cost=0.00..488889.88 rows=8395 width=12)
Workers Planned: 1
-> Parallel Seq Scan on t1 (cost=0.00..488889.88 rows=4938 width=12)
Filter: (SubPlan 1)
SubPlan 1
-> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4)
Filter: (i = t1.i)
(7 rows)

Note - This patch can be applied on top of
pq_pushdown_subplan_v1.patch posted upthread [1]/messages/by-id/CAA4eK1+e8Z45D2n+rnDMDYsVEb5iW7jqaCH_tvPMYau=1Rru9w@mail.gmail.com

[1]: /messages/by-id/CAA4eK1+e8Z45D2n+rnDMDYsVEb5iW7jqaCH_tvPMYau=1Rru9w@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Dilip Kumar

dilipbalaut@gmail.com

over 9 years ago

In reply to: Amit Kapila (#1)

Re: parallelize queries containing subplans

On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch implements the above idea. This will enable
parallelism for queries containing un-correlated subplans, an example
of which is as follows:

I have reviewed the patch (pq_pushdown_subplan_v1.patch), Mostly patch
looks clean to me.
I have also done some basic functional testing and it's working fine.

I have one comment.

+ else if (IsA(node, AlternativeSubPlan))
+ {
+ AlternativeSubPlan *asplan = (AlternativeSubPlan *) node;
+ ListCell   *lc;
+
+ Assert(context->root);
+
+ foreach(lc, asplan->subplans)
+ {
+ SubPlan    *splan = (SubPlan *) lfirst(lc);
+ Plan   *plan;
+
+ Assert(IsA(splan, SubPlan));
+
+ plan = planner_subplan_get_plan(context->root, splan);
+ if (!plan->parallel_safe)
+ return true;
+ }
  }

I see no reason why we need to process the subplan list of
AlternativeSubPlan here, Anyway expression_tree_walker will take care
of processing each subplan of AlternativeSubPlan. Now in the case
where all the subplans in AlternativeSubPlan are parallel_safe we will
process this list twice. But, more than that it will be cleaner to not
handle AlternativeSubPlan here unless there is some strong reason?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Kapila (#1)

Re: parallelize queries containing subplans

On Wed, Dec 28, 2016 at 1:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Currently, queries that have references to SubPlans or
AlternativeSubPlans are considered parallel-restricted. I think we
can lift this restriction in many cases especially when SubPlans are
parallel-safe. To make this work, we need to propagate the
parallel-safety information from path node to plan node and the same
could be easily done while creating a plan. Another option could be
that instead of propagating parallel-safety information from path to
plan, we can find out from the plan if it is parallel-safe (doesn't
contain any parallel-aware node) by traversing whole plan tree, but I
think it is a waste of cycles. Once we have parallel-safety
information in the plan, we can use that for detection of
parallel-safe expressions in max_parallel_hazard_walker(). Finally,
we can pass all the subplans to workers during plan serialization in
ExecSerializePlan(). This will enable workers to execute subplans
that are referred in parallel part of the plan. Now, we might be able
to optimize it such that we pass only subplans that are referred in
parallel portion of plan, but I am not sure if it is worth the trouble
because it is one-time cost and much lesser than other things we do
(like creating
dsm, launching workers).

It seems unfortunate to have to add a parallel_safe flag to the
finished plan; the whole reason we have the Path-Plan distinction is
so that we can throw away information that won't be needed at
execution time. The parallel_safe flag is, in fact, not needed at
execution time, but just for further planning. Isn't there some way
that we can remember, at the time when a sublink is converted to a
subplan, whether or not the subplan was created from a parallel-safe
path? That seems like it would be cleaner than maintaining this flag
for all plans.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Robert Haas (#5)

Re: parallelize queries containing subplans

On Tue, Jan 10, 2017 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 28, 2016 at 1:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Currently, queries that have references to SubPlans or
AlternativeSubPlans are considered parallel-restricted. I think we
can lift this restriction in many cases especially when SubPlans are
parallel-safe. To make this work, we need to propagate the
parallel-safety information from path node to plan node and the same
could be easily done while creating a plan. Another option could be
that instead of propagating parallel-safety information from path to
plan, we can find out from the plan if it is parallel-safe (doesn't
contain any parallel-aware node) by traversing whole plan tree, but I
think it is a waste of cycles. Once we have parallel-safety
information in the plan, we can use that for detection of
parallel-safe expressions in max_parallel_hazard_walker(). Finally,
we can pass all the subplans to workers during plan serialization in
ExecSerializePlan(). This will enable workers to execute subplans
that are referred in parallel part of the plan. Now, we might be able
to optimize it such that we pass only subplans that are referred in
parallel portion of plan, but I am not sure if it is worth the trouble
because it is one-time cost and much lesser than other things we do
(like creating
dsm, launching workers).

It seems unfortunate to have to add a parallel_safe flag to the
finished plan; the whole reason we have the Path-Plan distinction is
so that we can throw away information that won't be needed at
execution time. The parallel_safe flag is, in fact, not needed at
execution time, but just for further planning. Isn't there some way
that we can remember, at the time when a sublink is converted to a
subplan, whether or not the subplan was created from a parallel-safe
path?

The other alternative is to remember this information in SubPlan. We
can retrieve parallel_safe information from best_path and can use it
while generating SubPlan. The main reason for storing it in the plan
was to avoid explicitly passing parallel_safe information while
generating SubPlan as plan was already available at that time.
However, it seems there are only two places in code (refer
build_subplan) where this information needs to be propagated. Let me
know if you prefer to remember the parallel_safe information in
SubPlan instead of in Plan or if you have something else in mind?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Kapila (#6)

Re: parallelize queries containing subplans

On Wed, Jan 11, 2017 at 9:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The other alternative is to remember this information in SubPlan. We
can retrieve parallel_safe information from best_path and can use it
while generating SubPlan. The main reason for storing it in the plan
was to avoid explicitly passing parallel_safe information while
generating SubPlan as plan was already available at that time.
However, it seems there are only two places in code (refer
build_subplan) where this information needs to be propagated. Let me
know if you prefer to remember the parallel_safe information in
SubPlan instead of in Plan or if you have something else in mind?

I think we should try doing it in the SubPlan, at least, and see if
that comes out more elegant than what you have at the moment.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Dilip Kumar (#4)

Re: parallelize queries containing subplans

On Tue, Jan 10, 2017 at 11:55 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch implements the above idea. This will enable
parallelism for queries containing un-correlated subplans, an example
of which is as follows:

I have reviewed the patch (pq_pushdown_subplan_v1.patch), Mostly patch
looks clean to me.
I have also done some basic functional testing and it's working fine.

I have one comment.
+ else if (IsA(node, AlternativeSubPlan))
+ {
+ AlternativeSubPlan *asplan = (AlternativeSubPlan *) node;
+ ListCell   *lc;
+
+ Assert(context->root);
+
+ foreach(lc, asplan->subplans)
+ {
+ SubPlan    *splan = (SubPlan *) lfirst(lc);
+ Plan   *plan;
+
+ Assert(IsA(splan, SubPlan));
+
+ plan = planner_subplan_get_plan(context->root, splan);
+ if (!plan->parallel_safe)
+ return true;
+ }
}
I see no reason why we need to process the subplan list of
AlternativeSubPlan here, Anyway expression_tree_walker will take care
of processing each subplan of AlternativeSubPlan. Now in the case
where all the subplans in AlternativeSubPlan are parallel_safe we will
process this list twice.

Valid point, but I think we can avoid that by returning false after
foreach(..) loop. I think one improvement could be that instead of
manually checking the parallel safety of each subplan, we can
recursively call max_parallel_hazard_walker for each subplan.

But, more than that it will be cleaner to not
handle AlternativeSubPlan here unless there is some strong reason?

Yeah, the reason is to avoid unnecessary recursion. Also, in all
other places in the code, we do handle both of them separately, refer
cost_qual_eval_walker for somewhat similar usage.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Dilip Kumar

dilipbalaut@gmail.com

over 9 years ago

In reply to: Amit Kapila (#8)

Re: parallelize queries containing subplans

On Thu, Jan 12, 2017 at 9:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Valid point, but I think we can avoid that by returning false after
foreach(..) loop. I think one improvement could be that instead of
manually checking the parallel safety of each subplan, we can
recursively call max_parallel_hazard_walker for each subplan.

I agree that this way we can avoid the problem what I mentioned.

But, more than that it will be cleaner to not
handle AlternativeSubPlan here unless there is some strong reason?

Yeah, the reason is to avoid unnecessary recursion. Also, in all
other places in the code, we do handle both of them separately, refer
cost_qual_eval_walker for somewhat similar usage.

ok.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Robert Haas (#7)

Re: parallelize queries containing subplans

On Thu, Jan 12, 2017 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 11, 2017 at 9:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The other alternative is to remember this information in SubPlan. We
can retrieve parallel_safe information from best_path and can use it
while generating SubPlan. The main reason for storing it in the plan
was to avoid explicitly passing parallel_safe information while
generating SubPlan as plan was already available at that time.
However, it seems there are only two places in code (refer
build_subplan) where this information needs to be propagated. Let me
know if you prefer to remember the parallel_safe information in
SubPlan instead of in Plan or if you have something else in mind?

I think we should try doing it in the SubPlan, at least, and see if
that comes out more elegant than what you have at the moment.

Okay, done that way. I have fixed the review comments raised by Dilip
as well and added the test case in attached patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#11

Dilip Kumar

dilipbalaut@gmail.com

over 9 years ago

In reply to: Amit Kapila (#10)

Re: parallelize queries containing subplans

On Thu, Jan 12, 2017 at 7:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, done that way. I have fixed the review comments raised by Dilip
as well and added the test case in attached patch.

I have tested with pq_pushdown_subplan_v2.patch +
pq_pushdown_correl_subplan_v1.patch

sqlsmith. is reporting below error, I have attached the query to
reproduce the issue. This issue is only coming when both the patch is
applied, only with pq_pushdown_subplan_v2.patch there is no problem.
So I think the problem is because of
pq_pushdown_correl_subplan_v1.patch.

ERROR: did not find '}' at end of input node at character 762

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#12

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Dilip Kumar (#11)

Re: parallelize queries containing subplans

Dilip Kumar <dilipbalaut@gmail.com> writes:

ERROR: did not find '}' at end of input node at character 762

I've not looked at the patches, but just seeing this error message,
this looks like somebody's fat-fingered the correspondence between
outfuncs.c and readfuncs.c processing.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Tom Lane (#12)

Re: parallelize queries containing subplans

On Fri, Jan 13, 2017 at 7:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dilip Kumar <dilipbalaut@gmail.com> writes:

ERROR: did not find '}' at end of input node at character 762

I could reproduce this error with simple query like:
SELECT * FROM information_schema.role_usage_grants WHERE object_type
LIKE 'FOREIGN%' AND object_name IN ('s6', 'foo') ORDER BY 1, 2, 3, 4,
5;

The reason is that the stored rules have the old structure of Param. See below:
{RELABELTYPE :arg {PARAM :paramkind 2 :paramid 1 :paramtype 11975
:paramtypmod -1 :paramcollid 100 :location -1}

The new variable parallel_safe is not present in above node. If you
recreate the fresh database you won't see the above problem. I have
observed a problem in equal function which I have fixed in the
attached patch. I have also added a test, so that we can catch any
problem similar to what you have reported.

I've not looked at the patches, but just seeing this error message,
this looks like somebody's fat-fingered the correspondence between
outfuncs.c and readfuncs.c processing.

I think what we need is catversion bump as Param is part of stored rules.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Amit Kapila (#10)

Re: parallelize queries containing subplans

On Thu, Jan 12, 2017 at 7:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 12, 2017 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 11, 2017 at 9:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The other alternative is to remember this information in SubPlan. We
can retrieve parallel_safe information from best_path and can use it
while generating SubPlan. The main reason for storing it in the plan
was to avoid explicitly passing parallel_safe information while
generating SubPlan as plan was already available at that time.
However, it seems there are only two places in code (refer
build_subplan) where this information needs to be propagated. Let me
know if you prefer to remember the parallel_safe information in
SubPlan instead of in Plan or if you have something else in mind?

I think we should try doing it in the SubPlan, at least, and see if
that comes out more elegant than what you have at the moment.

Okay, done that way. I have fixed the review comments raised by Dilip
as well and added the test case in attached patch.

After commit-ab1f0c8, this patch needs a rebase. Attached find
rebased version of the patch.

Thanks to Kuntal for informing me offlist that this patch needs rebase.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#15

Dilip Kumar

dilipbalaut@gmail.com

about 9 years ago

In reply to: Amit Kapila (#14)

Re: parallelize queries containing subplans

On Mon, Jan 16, 2017 at 9:13 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

After commit-ab1f0c8, this patch needs a rebase. Attached find
rebased version of the patch.

Thanks to Kuntal for informing me offlist that this patch needs rebase.

In this patch, I have observed some changes while creating subplan for
CTE. So I have reviewed this from that perspective and also tried to
perform some test.

Here is my finding/ comments.

@@ -1213,6 +1216,7 @@ SS_process_ctes(PlannerInfo *root)
    &splan->firstColCollation);
  splan->useHashTable = false;
  splan->unknownEqFalse = false;
+ splan->parallel_safe = best_path->parallel_safe;

I noticed that if path for CTE is parallel safe then we are marking
CTE subplan as parallel safe, In particular, I don't have any problem
with that, but can you add some test case which can cover this path, I
mean to say where CTE subplan are pushed.

------------
I have tried to test the subplan with CTE below is my test.
create table t1(a int , b varchar);
create table t (n int, b varchar);

Query:
explain verbose select * from t where t.n not in (WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT a+1 FROM t1 WHERE a < 100
)
SELECT sum(n) FROM t);

During debugging I found that subplan created for below part of the
query is parallel_unsafe, Is it a problem or there is some explanation
of why it's not parallel_safe,

(WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT a+1 FROM t1 WHERE a < 100
)
SELECT sum(n) FROM t);
----------

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Kuntal Ghosh

kuntalghosh.2007@gmail.com

about 9 years ago

In reply to: Dilip Kumar (#15)

Re: parallelize queries containing subplans

On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

@@ -1213,6 +1216,7 @@ SS_process_ctes(PlannerInfo *root)
&splan->firstColCollation);
splan->useHashTable = false;
splan->unknownEqFalse = false;
+ splan->parallel_safe = best_path->parallel_safe;
I noticed that if path for CTE is parallel safe then we are marking
CTE subplan as parallel safe, In particular, I don't have any problem
with that, but can you add some test case which can cover this path, I
mean to say where CTE subplan are pushed.

------------
I have tried to test the subplan with CTE below is my test.
create table t1(a int , b varchar);
create table t (n int, b varchar);

Query:
explain verbose select * from t where t.n not in (WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT a+1 FROM t1 WHERE a < 100
)
SELECT sum(n) FROM t);

During debugging I found that subplan created for below part of the
query is parallel_unsafe, Is it a problem or there is some explanation
of why it's not parallel_safe,

(WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT a+1 FROM t1 WHERE a < 100
)
SELECT sum(n) FROM t);
----------

The corresponding plan for the query you have specified is:

QUERY PLAN
----------------------------------------------------------------------------------
Seq Scan on public.t (cost=40.73..20894.74 rows=500480 width=35)
Output: t.n, t.b
Filter: (NOT (hashed SubPlan 2))
SubPlan 2
-> Aggregate (cost=40.72..40.73 rows=1 width=8)
Output: sum(t_1.n)
CTE t
-> Append (cost=0.00..31.18 rows=424 width=4)
-> Result (cost=0.00..0.01 rows=1 width=4)
Output: 1
-> Seq Scan on public.t1 (cost=0.00..26.93
rows=423 width=4)
Output: (t1.a + 1)
Filter: (t1.a < 100)
-> CTE Scan on t t_1 (cost=0.00..8.48 rows=424 width=4)
Output: t_1.n
(15 rows)

Now, the plan for CTE is parallel_safe. But, a CTE plan is converted
to an InitPlan and returns a Param which is used in the CTE Scan.
Since Param is not parallel_safe till now, the SubPlan is also not
parallel_safe. This is why CTE subplans will not be pushed under
Gather.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Kuntal Ghosh

kuntalghosh.2007@gmail.com

about 9 years ago

In reply to: Kuntal Ghosh (#16)

Re: parallelize queries containing subplans

On Thu, Jan 19, 2017 at 3:19 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

Since Param is not parallel_safe till now, the SubPlan is also not
parallel_safe. This is why CTE subplans will not be pushed under
Gather.

Specifically, Params which are generated using generate_new_param()
are not parallel_safe. In the patch, it is marked as parallel-unsafe.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Dilip Kumar

dilipbalaut@gmail.com

about 9 years ago

In reply to: Dilip Kumar (#15)

Re: parallelize queries containing subplans

On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

During debugging I found that subplan created for below part of the
query is parallel_unsafe, Is it a problem or there is some explanation
of why it's not parallel_safe,

Okay, so basically we don't have any mechanism to perform parallel
scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes)
must come along with CTE scan. So I think we can avoid setting below
code because we will never be able to test its side effect, another
argument can be that if we don't consider the final effect, and just
see this subplan then by logic it should be marked parallel-safe or
unsafe as per it's path and it will not have any side effect, as it
will finally become parallel-unsafe. So it's your call to keep it
either way.

@@ -1213,6 +1216,7 @@ SS_process_ctes(PlannerInfo *root)
    &splan->firstColCollation);
  splan->useHashTable = false;
  splan->unknownEqFalse = false;
+ splan->parallel_safe = best_path->parallel_safe;

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Kuntal Ghosh

kuntalghosh.2007@gmail.com

about 9 years ago

In reply to: Dilip Kumar (#18)

Re: parallelize queries containing subplans

On Thu, Jan 19, 2017 at 4:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

During debugging I found that subplan created for below part of the
query is parallel_unsafe, Is it a problem or there is some explanation
of why it's not parallel_safe,

Okay, so basically we don't have any mechanism to perform parallel
scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes)
must come along with CTE scan. So I think we can avoid setting below
code because we will never be able to test its side effect, another
argument can be that if we don't consider the final effect, and just
see this subplan then by logic it should be marked parallel-safe or
unsafe as per it's path and it will not have any side effect, as it
will finally become parallel-unsafe. So it's your call to keep it
either way.

Oops, you're right. We don't consider parallelism for RTE_CTE type.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Dilip Kumar (#18)

Re: parallelize queries containing subplans

On Thu, Jan 19, 2017 at 4:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

During debugging I found that subplan created for below part of the
query is parallel_unsafe, Is it a problem or there is some explanation
of why it's not parallel_safe,

Okay, so basically we don't have any mechanism to perform parallel
scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes)
must come along with CTE scan. So I think we can avoid setting below
code because we will never be able to test its side effect, another
argument can be that if we don't consider the final effect, and just
see this subplan then by logic it should be marked parallel-safe or
unsafe as per it's path and it will not have any side effect, as it
will finally become parallel-unsafe. So it's your call to keep it
either way.

Yeah, actually setting parallel_safety information for subplan from
corresponding is okay. However, in this particular case as we know
that it might not be of any use till we enable parallelism for CTE
Scan (and doing that is certainly not essential for this project).
So, I have set parallel_safe as false for CTE subplans.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com