[sqlsmith] Failed assertion in parallel worker in ExecInitSubPlan

Started by Andreas Seltenreichabout 9 years ago10 messages
#1Andreas Seltenreich
seltenreich@gmx.de

Hi,

the query below triggers a parallel worker assertion for me when run on
the regression database of master as of 0832f2d. The plan sports a
couple of InitPlan nodes below Gather.

regards,
Andreas

Gather (cost=1.64..84.29 rows=128 width=4)
Workers Planned: 1
Single Copy: true
-> Limit (cost=1.64..84.29 rows=128 width=4)
-> Subquery Scan on subq_0 (cost=1.64..451.06 rows=696 width=4)
Filter: (subq_0.c6 IS NOT NULL)
-> Nested Loop Left Join (cost=1.64..444.07 rows=699 width=145)
Join Filter: (sample_0.aa = sample_1.pageno)
InitPlan 4 (returns $3)
-> Result (cost=1.21..5.36 rows=15 width=0)
One-Time Filter: ($0 AND ($1 = $2))
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.00 rows=0 width=0)
One-Time Filter: false
InitPlan 2 (returns $1)
-> Limit (cost=0.35..0.52 rows=1 width=4)
-> Gather (cost=0.00..1.04 rows=6 width=4)
Workers Planned: 1
-> Parallel Seq Scan on reltime_tbl (cost=0.00..1.04 rows=4 width=4)
InitPlan 3 (returns $2)
-> Limit (cost=0.52..0.69 rows=1 width=4)
-> Gather (cost=0.00..1.04 rows=6 width=4)
Workers Planned: 1
-> Parallel Seq Scan on reltime_tbl reltime_tbl_1 (cost=0.00..1.04 rows=4 width=4)
-> Sample Scan on pg_foreign_data_wrapper sample_2 (cost=1.21..5.36 rows=15 width=0)
Sampling: system ('3.1'::real)
-> Nested Loop (cost=0.15..382.85 rows=699 width=4)
-> Sample Scan on pg_largeobject sample_1 (cost=0.00..209.03 rows=699 width=8)
Sampling: bernoulli ('2.9'::real)
Filter: (pageno IS NOT NULL)
-> Index Only Scan using pg_foreign_table_relid_index on pg_foreign_table ref_0 (cost=0.15..0.24 rows=1 width=4)
Index Cond: (ftrelid = sample_1.loid)
-> Materialize (cost=0.00..16.06 rows=4 width=4)
-> Append (cost=0.00..16.04 rows=4 width=4)
-> Sample Scan on a sample_0 (cost=0.00..4.01 rows=1 width=4)
Sampling: system ('5'::real)
-> Sample Scan on b sample_0_1 (cost=0.00..4.01 rows=1 width=4)
Sampling: system ('5'::real)
-> Sample Scan on c sample_0_2 (cost=0.00..4.01 rows=1 width=4)
Sampling: system ('5'::real)
-> Sample Scan on d sample_0_3 (cost=0.00..4.01 rows=1 width=4)
Sampling: system ('5'::real)

--8<---------------cut here---------------start------------->8---
set force_parallel_mode to on;
set max_parallel_workers_per_gather to 2;

select
91 as c0
from
(select
(select pfname from public.pslot limit 1 offset 3)
as c0,
ref_1.grandtot as c1,
(select pg_catalog.min(procost) from pg_catalog.pg_proc)
as c2,
ref_0.ftoptions as c3,
ref_1.grandtot as c4,
sample_1.loid as c5,
pg_catalog.pg_rotate_logfile() as c6,
(select random from public.hash_i4_heap limit 1 offset 5)
as c7,
sample_1.loid as c8
from
public.a as sample_0 tablesample system (5)
right join pg_catalog.pg_largeobject as sample_1 tablesample bernoulli (2.9)
inner join pg_catalog.pg_foreign_table as ref_0
on (sample_1.loid = ref_0.ftrelid )
on (sample_0.aa = sample_1.pageno )
left join public.mvtest_tvv as ref_1
on (EXISTS (
select
sample_2.fdwoptions as c0,
sample_2.fdwhandler as c1,
(select during from public.test_range_excl limit 1 offset 89)
as c2
from
pg_catalog.pg_foreign_data_wrapper as sample_2 tablesample system (3.1)
where (EXISTS (
select
sample_3.b as c0,
(select grantee from information_schema.udt_privileges limit 1 offset 4)
as c1,
sample_3.b as c2,
sample_3.rf_a as c3,
sample_3.b as c4,
sample_3.rf_a as c5,
sample_3.rf_a as c6
from
public.clstr_tst_s as sample_3 tablesample system (8.1)
where sample_3.rf_a >= cast(null as int8)
limit 141))
and ((select f1 from public.reltime_tbl limit 1 offset 2)
= (select f1 from public.reltime_tbl limit 1 offset 3)
)
limit 49))
where sample_1.pageno is not NULL) as subq_0
where subq_0.c6 is not NULL
limit 128;
--8<---------------cut here---------------end--------------->8---

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Andreas Seltenreich (#1)
Re: [sqlsmith] Failed assertion in parallel worker in ExecInitSubPlan

On Sun, Nov 20, 2016 at 10:43 PM, Andreas Seltenreich
<seltenreich@gmx.de> wrote:

Hi,

the query below triggers a parallel worker assertion for me when run on
the regression database of master as of 0832f2d. The plan sports a
couple of InitPlan nodes below Gather.

I think the reason of this issue is that in some cases where subplan
is at some node other than top_plan node, we allow them to be executed
in the worker in force_parallel_mode. It seems to me that the
problematic code is below check in standard_planner()

if (force_parallel_mode != FORCE_PARALLEL_OFF &&
best_path->parallel_safe &&
top_plan->initPlan == NIL)

Here instead of checking whether top_plan has initPlan, it should
check whether initPlan is present anywhere in plan tree. I think one
simple way could be to check *glob->subplans* instead of
top_plan->initPlan, another possibility is to traverse the whole tree
to see if initPlan is present.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#2)
1 attachment(s)
Re: [sqlsmith] Failed assertion in parallel worker in ExecInitSubPlan

On Mon, Nov 21, 2016 at 6:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Nov 20, 2016 at 10:43 PM, Andreas Seltenreich
<seltenreich@gmx.de> wrote:

Hi,

the query below triggers a parallel worker assertion for me when run on
the regression database of master as of 0832f2d. The plan sports a
couple of InitPlan nodes below Gather.

I think the reason of this issue is that in some cases where subplan
is at some node other than top_plan node, we allow them to be executed
in the worker in force_parallel_mode. It seems to me that the
problematic code is below check in standard_planner()

if (force_parallel_mode != FORCE_PARALLEL_OFF &&
best_path->parallel_safe &&
top_plan->initPlan == NIL)

Here instead of checking whether top_plan has initPlan, it should
check whether initPlan is present anywhere in plan tree. I think one
simple way could be to check *glob->subplans* instead of
top_plan->initPlan,

Patch based on above suggestion is attached with this mail.

Thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

restrict_subplans_parallel_mode_v1.patchapplication/octet-stream; name=restrict_subplans_parallel_mode_v1.patchDownload
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d8c5dd3..16f807f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -342,11 +342,11 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	 * actually a safe thing to do.  (Note: we assume adding a Material node
 	 * above did not change the parallel safety of the plan, so we can still
 	 * rely on best_path->parallel_safe.  However, that flag doesn't account
-	 * for initPlans, which render the plan parallel-unsafe.)
+	 * for subplans, which render the plan parallel-unsafe.)
 	 */
 	if (force_parallel_mode != FORCE_PARALLEL_OFF &&
 		best_path->parallel_safe &&
-		top_plan->initPlan == NIL)
+		glob->subplans == NIL)
 	{
 		Gather	   *gather = makeNode(Gather);
 
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#3)
Re: [sqlsmith] Failed assertion in parallel worker in ExecInitSubPlan

Amit Kapila <amit.kapila16@gmail.com> writes:

On Mon, Nov 21, 2016 at 6:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Here instead of checking whether top_plan has initPlan, it should
check whether initPlan is present anywhere in plan tree. I think one
simple way could be to check *glob->subplans* instead of
top_plan->initPlan,

Patch based on above suggestion is attached with this mail.

I think this is the right fix for the moment, because the proximate cause
of the crash is that ExecSerializePlan doesn't transmit any part of the
PlannedStmt.subplans list, which glob->subplans is the preimage of.

Now, maybe I'm missing something, but it seems to me that ordinary
subplans could be transmitted to workers just fine. The problem with
transmitting initplans is that you'd lose the expectation of single
evaluation. (Even there, it might be okay if they're far enough down
in the plan tree, but I haven't thought about it in detail.) So I'd
rather see us fix ExecSerializePlan to transmit the subplans list
and have some more-restrictive test here. This code would still be
wrong as it stands though.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#4)
Re: [sqlsmith] Failed assertion in parallel worker in ExecInitSubPlan

On Mon, Nov 21, 2016 at 9:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Mon, Nov 21, 2016 at 6:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Here instead of checking whether top_plan has initPlan, it should
check whether initPlan is present anywhere in plan tree. I think one
simple way could be to check *glob->subplans* instead of
top_plan->initPlan,

Patch based on above suggestion is attached with this mail.

I think this is the right fix for the moment, because the proximate cause
of the crash is that ExecSerializePlan doesn't transmit any part of the
PlannedStmt.subplans list, which glob->subplans is the preimage of.

Now, maybe I'm missing something, but it seems to me that ordinary
subplans could be transmitted to workers just fine. The problem with
transmitting initplans is that you'd lose the expectation of single
evaluation.

Yes and I think we can handle it such that master backend evaluates
initplans and share the calculated value along with paramid with all
the workers. Workers will, in turn, restore it in
queryDesc->estate->es_param_exec_vals (or some other place where those
can be directly used, I have yet to evaluate on this matter). I am
working on a patch to parallelize queries containing
initplans/subplans, so I will evaluate your suggestion of passing
subplans (maybe non-InitPlans) in ExecSerializePlan as part of that
patch. I have yet to figure out what is the best way to share hashed
subplans, do we pass them as it is and let each worker evaluate and
store it's own copy of hash table or shall we try to form the hash
table once in master and then share the same with workers (which could
be tricky) or shall we restrict such queries which contain hashed
subplans based on assumption that it will be costly for each worker to
form the hash table.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andreas Seltenreich
seltenreich@gmx.de
In reply to: Tom Lane (#4)
Re: [sqlsmith] Failed assertion in parallel worker in ExecInitSubPlan

Hi,

just caught another InitPlan below Gather with the recent patches in
(master as of 4cc6a3f). Recipe below.

regards,
andreas

set max_parallel_workers_per_gather = 2;
set min_parallel_relation_size = 0;
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;

explain select 1 from
public.quad_point_tbl as ref_0,
lateral (select
ref_0.p as c3,
sample_0.d as c5
from
public.nv_child_2010 as sample_0
left join public.mvtest_tvv as ref_1
on ('x'< (select contype from pg_catalog.pg_constraint limit 1))
limit 82) as subq_0;

-- QUERY PLAN
-- --------------------------------------------------------------------------------------------------------
-- Gather (cost=0.19..13727.52 rows=902246 width=4)
-- Workers Planned: 2
-- -> Nested Loop (cost=0.19..13727.52 rows=902246 width=4)
-- -> Parallel Seq Scan on quad_point_tbl ref_0 (cost=0.00..105.85 rows=4585 width=16)
-- -> Limit (cost=0.19..1.33 rows=82 width=20)
-- InitPlan 1 (returns $0)
-- -> Limit (cost=0.00..0.19 rows=1 width=1)
-- -> Gather (cost=0.00..10.22 rows=54 width=1)
-- Workers Planned: 2
-- -> Parallel Seq Scan on pg_constraint (cost=0.00..10.22 rows=22 width=1)
-- -> Seq Scan on nv_child_2010 sample_0 (cost=0.00..35.50 rows=2550 width=20)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Andreas Seltenreich (#6)
1 attachment(s)
Re: [sqlsmith] Failed assertion in parallel worker in ExecInitSubPlan

On Fri, Nov 25, 2016 at 3:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

Hi,

just caught another InitPlan below Gather with the recent patches in
(master as of 4cc6a3f). Recipe below.

I think this problem exists since commit
110a6dbdebebac9401b43a8fc223e6ec43cd4d10 where we have allowed
subqueries to be pushed to parallel workers. I think we should
consider rel (where rtekind is RTE_SUBQUERY) to be parallel safe if
the subquery is also parallel safe. Attached patch does that and
fixes the reported problem for me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

allow_safe_subquery_parallel_worker_v1.patchapplication/octet-stream; name=allow_safe_subquery_parallel_worker_v1.patchDownload
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index e42ef98..caa691c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -574,17 +574,17 @@ set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
 
 			/*
 			 * There's no intrinsic problem with scanning a subquery-in-FROM
-			 * (as distinct from a SubPlan or InitPlan) in a parallel worker.
-			 * If the subquery doesn't happen to have any parallel-safe paths,
-			 * then flagging it as consider_parallel won't change anything,
-			 * but that's true for plain tables, too.  We must set
-			 * consider_parallel based on the rel's own quals and targetlist,
-			 * so that if a subquery path is parallel-safe but the quals and
-			 * projection we're sticking onto it are not, we correctly mark
-			 * the SubqueryScanPath as not parallel-safe.  (Note that
+			 * (as distinct from a SubPlan or InitPlan) in a parallel worker
+			 * if it is parallel-safe.  We must set consider_parallel based
+			 * on the rel's own quals and targetlist, so that if a subquery
+			 * path is parallel-safe but the quals and projection we're
+			 * sticking onto it are not, we correctly mark the
+			 * SubqueryScanPath as not parallel-safe.  (Note that
 			 * set_subquery_pathlist() might push some of these quals down
 			 * into the subquery itself, but that doesn't change anything.)
 			 */
+			if (!is_parallel_safe(root, (Node *) rte->subquery))
+				return;
 			break;
 
 		case RTE_JOIN:
#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#7)
Re: [sqlsmith] Failed assertion in parallel worker in ExecInitSubPlan

Amit Kapila <amit.kapila16@gmail.com> writes:

On Fri, Nov 25, 2016 at 3:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

just caught another InitPlan below Gather with the recent patches in
(master as of 4cc6a3f). Recipe below.

I think this problem exists since commit
110a6dbdebebac9401b43a8fc223e6ec43cd4d10 where we have allowed
subqueries to be pushed to parallel workers.

The impression I got in poking at this for a few minutes, before
going off to stuff myself with turkey, was that we were allowing
a SubqueryScanPath to mark itself as parallel-safe even though the
resulting plan node would have an InitPlan attached. So my thought
about fixing it was along the lines of if-subroot-contains-initplans-
then-dont-let-SubqueryScanPath-be-parallel-safe. What you propose
here seems like it would shut off parallel query in more cases than
that. But I'm still full of turkey and may be missing something.

There's another issue here, which is that the InitPlan is derived from an
outer join qual whose outer join seems to have been deleted entirely by
analyzejoins.c. Up to now it was possible to avert our eyes from the fact
that join removal is lazy about getting rid of unused InitPlans, but if
they're going to disable parallelization of the surrounding query, maybe
we need to work harder on that.

Another question worth asking is whether it was okay for the subquery to
decide to use a Gather. Are we OK with multiple Gathers per plan tree,
independently of the points above?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#8)
1 attachment(s)
Re: [sqlsmith] Failed assertion in parallel worker in ExecInitSubPlan

On Fri, Nov 25, 2016 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Fri, Nov 25, 2016 at 3:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

just caught another InitPlan below Gather with the recent patches in
(master as of 4cc6a3f). Recipe below.

I think this problem exists since commit
110a6dbdebebac9401b43a8fc223e6ec43cd4d10 where we have allowed
subqueries to be pushed to parallel workers.

The impression I got in poking at this for a few minutes, before
going off to stuff myself with turkey, was that we were allowing
a SubqueryScanPath to mark itself as parallel-safe even though the
resulting plan node would have an InitPlan attached. So my thought
about fixing it was along the lines of if-subroot-contains-initplans-
then-dont-let-SubqueryScanPath-be-parallel-safe.

I think this will work for the reported case, see the patch attached.
However, don't we need to worry about if there is a subplan
(non-initplan) instead of initplan. I have tried by tweaking the
above query such that it will generate a subplan and for such a case
it will make SubqueryScanPath as parallel-safe.

explain select 1 from public.quad_point_tbl as ref_0, lateral (select
ref_0.p as c3, sample_0.d as c5 from public.nv_child_2010 as sample_0
left join public.mvtest_tvv as ref_1 on ('x' = ANY (select contype
from pg_catalog.pg_constraint limit 1)) limit 82) as subq_0;

What you propose
here seems like it would shut off parallel query in more cases than
that. But I'm still full of turkey and may be missing something.

There's another issue here, which is that the InitPlan is derived from an
outer join qual whose outer join seems to have been deleted entirely by
analyzejoins.c. Up to now it was possible to avert our eyes from the fact
that join removal is lazy about getting rid of unused InitPlans, but if
they're going to disable parallelization of the surrounding query, maybe
we need to work harder on that.

Yeah, that makes sense, but not sure whether we should try it along
with this patch.

Another question worth asking is whether it was okay for the subquery to
decide to use a Gather. Are we OK with multiple Gathers per plan tree,
independently of the points above?

As of now, we don't support nested Gather case. Example:

Not Okay Plan
---------------------
-> Gather
-> Nested Loop
-> Parallel Seq Scan
-> Gather
-> Parallel Seq Scan

Okay Plan
---------------------
-> Nested Loop
-> Gather
-> Parallel Seq Scan
-> Gather
-> Parallel Seq Scan

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

allow_safe_subquery_parallel_worker_v2.patchapplication/octet-stream; name=allow_safe_subquery_parallel_worker_v2.patchDownload
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 6d3ccfd..6bda46f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1717,7 +1717,7 @@ create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
 	pathnode->path.parallel_safe = rel->consider_parallel &&
-		subpath->parallel_safe;
+		subpath->parallel_safe && !rel->subroot->init_plans;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->path.pathkeys = pathkeys;
 	pathnode->subpath = subpath;
#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#9)
Re: [sqlsmith] Failed assertion in parallel worker in ExecInitSubPlan

Amit Kapila <amit.kapila16@gmail.com> writes:

On Fri, Nov 25, 2016 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The impression I got in poking at this for a few minutes, before
going off to stuff myself with turkey, was that we were allowing
a SubqueryScanPath to mark itself as parallel-safe even though the
resulting plan node would have an InitPlan attached. So my thought
about fixing it was along the lines of if-subroot-contains-initplans-
then-dont-let-SubqueryScanPath-be-parallel-safe.

I think this will work for the reported case, see the patch attached.

After further thought, it seems to me that the fundamental error here
is that we're not accounting for initPlans while marking paths as
parallel safe or not. They should be correctly marked when they come
out of subquery_planner, rather than expecting callers to know that
they can't trust that marking. That way, we don't need to do anything
special in create_subqueryscan_path, and we can also get rid of that
kluge in the force_parallel_mode logic.

However, don't we need to worry about if there is a subplan
(non-initplan) instead of initplan.

I don't think so. References to subplans are already known to be parallel
restricted. The issue that we're fixing here is that if a plan node has
initPlans attached, ExecutorStart will try to access those subplans,
whether or not they actually get used while running. That's why the plan
node itself has to be marked parallel-unsafe so it won't get shipped to
parallel workers.

There's another issue here, which is that the InitPlan is derived from an
outer join qual whose outer join seems to have been deleted entirely by
analyzejoins.c. Up to now it was possible to avert our eyes from the fact
that join removal is lazy about getting rid of unused InitPlans, but if
they're going to disable parallelization of the surrounding query, maybe
we need to work harder on that.

Yeah, that makes sense, but not sure whether we should try it along
with this patch.

I'm not certain that sublinks in deletable outer join quals is a case
important enough to sweat over. But this is the first time I've realized
that failure to remove those initPlans is capable of making a plan worse.
So it's something to keep an eye on. (I still wish that we could make
join removal happen during join tree preprocessing; doing it where it is
is nothing but a kluge, with unpleasant side effects like this one.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers