Parallelize correlated subqueries that execute within each worker

Started by James Colemanabout 5 years ago46 messageshackers

jtc331@gmail.com

about 5 years ago

In a bug report back in November [1] a subthread explored why parallel
query is excluded any time we have "Plan nodes which reference a
correlated SubPlan". Amit's understanding was that the reasoning had
to do with inability to easily pass (potentially variable length)
Param values between workers.

However a decent-sized subset of this kind of query doesn't actually
require that we communicate between workers. If the Subplan executes
per-tuple within the worker then there's no reason I can see why it
needs to be marked parallel unsafe. Amit concurred but noted that
identifying that subset of plans is the difficult part (as is usually
the case!)

At the time I'd started work on an approach to handle this case and
hoped to "post about it in a new thread later this week." That didn't
happen, but here we are now, and I finally have this patch cleaned up
enough to share.

The basic idea is that we need to track (both on nodes and relations)
not only whether that node or rel is parallel safe but also whether
it's parallel safe assuming params are rechecked in the using context.
That allows us to delay making a final decision until we have
sufficient context to conclude that a given usage of a Param is
actually parallel safe or unsafe.

The first patch in this series was previously posted in the thread
"Consider parallel for lateral subqueries with limit" [2] and is
required as a precursor for various test cases to work here.

The second patch implements the core of the series. It results in
parallel query being possible for subplans that execute entirely
within the context of a parallel worker for cases where that subplan
is in the target, a LATERAL JOIN, or the WHERE and ORDER BY clauses.

The final patch notes several places where we set e.g.
rel->consider_parallel but setting the corresponding new value
rel->consider_parallel_recheckng_params wasn't yet necessary. It shows
opportunity either for further improvement or concluding certain cases
can't benefit and should be left unchanged.

James

1: /messages/by-id/CAAaqYe_vihKjc+8LuQa49EHW4+Kfefb3wHqPYFnCuUqozo+LFg@mail.gmail.com
2: /messages/by-id/CAAaqYe_HEkmLwf_1iEHxXwQOWiRyiFd=uOu6kwj3sWYdVd1-zA@mail.gmail.com

Daniel Gustafsson

daniel@yesql.se

over 4 years ago

In reply to: James Coleman (#1)

Re: Parallelize correlated subqueries that execute within each worker

On 7 May 2021, at 18:30, James Coleman <jtc331@gmail.com> wrote:

..here we are now, and I finally have this patch cleaned up
enough to share.

This patch no longer applies to HEAD, can you please submit a rebased version?

--
Daniel Gustafsson https://vmware.com/

James Coleman

jtc331@gmail.com

over 4 years ago

In reply to: Daniel Gustafsson (#2)

Re: Parallelize correlated subqueries that execute within each worker

On Wed, Sep 1, 2021 at 7:06 AM Daniel Gustafsson <daniel@yesql.se> wrote:

On 7 May 2021, at 18:30, James Coleman <jtc331@gmail.com> wrote:

..here we are now, and I finally have this patch cleaned up
enough to share.

This patch no longer applies to HEAD, can you please submit a rebased version?

See attached.

Thanks,
James

Zhihong Yu

zyu@yugabyte.com

over 4 years ago

In reply to: James Coleman (#3)

Re: Parallelize correlated subqueries that execute within each worker

On Tue, Sep 7, 2021 at 6:17 AM James Coleman <jtc331@gmail.com> wrote:

On Wed, Sep 1, 2021 at 7:06 AM Daniel Gustafsson <daniel@yesql.se> wrote:

On 7 May 2021, at 18:30, James Coleman <jtc331@gmail.com> wrote:

..here we are now, and I finally have this patch cleaned up
enough to share.

This patch no longer applies to HEAD, can you please submit a rebased

version?

See attached.

Thanks,
James

Hi,
For v2-0002-Parallel-query-support-for-basic-correlated-subqu.patch :

+ * is when we're going to execute multiple partial parths in parallel

parths -> paths

        if (index->amcanparallel &&
-           rel->consider_parallel && outer_relids == NULL &&
-           scantype != ST_BITMAPSCAN)
+               rel->consider_parallel && outer_relids == NULL &&
+               scantype != ST_BITMAPSCAN)

the change above seems unnecessary since the first line of if condition
doesn't change.
Similar comment for the next hunk.

+ * It's not a partial path; it'a a full path that is executed
as a subquery.

it'a a -> it's a

+           /* rel->consider_parallel_rechecking_params = false; */
+           /* rel->partial_pathlist = NIL; */

The commented code can be taken out.

Cheers

James Coleman

jtc331@gmail.com

over 4 years ago

In reply to: Zhihong Yu (#4)

Re: Parallelize correlated subqueries that execute within each worker

On Tue, Sep 7, 2021 at 11:06 AM Zhihong Yu <zyu@yugabyte.com> wrote:

On Tue, Sep 7, 2021 at 6:17 AM James Coleman <jtc331@gmail.com> wrote:

On Wed, Sep 1, 2021 at 7:06 AM Daniel Gustafsson <daniel@yesql.se> wrote:

On 7 May 2021, at 18:30, James Coleman <jtc331@gmail.com> wrote:

..here we are now, and I finally have this patch cleaned up
enough to share.

This patch no longer applies to HEAD, can you please submit a rebased version?

See attached.

Thanks,
James

Hi,
For v2-0002-Parallel-query-support-for-basic-correlated-subqu.patch :

+ * is when we're going to execute multiple partial parths in parallel

parths -> paths
if (index->amcanparallel &&
-           rel->consider_parallel && outer_relids == NULL &&
-           scantype != ST_BITMAPSCAN)
+               rel->consider_parallel && outer_relids == NULL &&
+               scantype != ST_BITMAPSCAN)
the change above seems unnecessary since the first line of if condition doesn't change.
Similar comment for the next hunk.

+ * It's not a partial path; it'a a full path that is executed as a subquery.

it'a a -> it's a
+           /* rel->consider_parallel_rechecking_params = false; */
+           /* rel->partial_pathlist = NIL; */
The commented code can be taken out.

Thanks for taking a look at this.

See updated patch series attached.

James Coleman

jtc331@gmail.com

over 4 years ago

In reply to: James Coleman (#5)

Re: Parallelize correlated subqueries that execute within each worker

On Wed, Sep 8, 2021 at 8:47 AM James Coleman <jtc331@gmail.com> wrote:

See updated patch series attached.

Jaime,

I noticed on 3-October you moved this into "waiting on author"; I
don't see anything waiting in this thread, however. Am I missing
something?

I'm planning to change it back to "needs review".

Thanks,
James

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: James Coleman (#1)

Re: Parallelize correlated subqueries that execute within each worker

As a preliminary comment, it would be quite useful to get Tom Lane's
opinion on this, since it's not an area I understand especially well,
and I think he understands it better than anyone.

On Fri, May 7, 2021 at 12:30 PM James Coleman <jtc331@gmail.com> wrote:

The basic idea is that we need to track (both on nodes and relations)
not only whether that node or rel is parallel safe but also whether
it's parallel safe assuming params are rechecked in the using context.
That allows us to delay making a final decision until we have
sufficient context to conclude that a given usage of a Param is
actually parallel safe or unsafe.

I don't really understand what you mean by "assuming params are
rechecked in the using context." However, I think that a possibly
better approach to this whole area would be to try to solve the
problem by putting limits on where you can insert a Gather node.
Consider:

Nested Loop
-> Seq Scan on x
-> Index Scan on y
Index Cond: y.q = x.q

If you insert a Gather node atop the Index Scan, necessarily changing
it to a Parallel Index Scan, then you need to pass values around. For
every value we get for x.q, we would need to start workers, sending
them the value of x.q, and they do a parallel index scan working
together to find all rows where y.q = x.q, and then exit. We repeat
this for every tuple from x.q. In the absence of infrastructure to
pass those parameters, we can't put the Gather there. We also don't
want to, because it would be really slow.

If you insert the Gather node atop the Seq Scan or the Nested Loop, in
either case necessarily changing the Seq Scan to a Parallel Seq Scan,
you have no problem. If you put it on top of the Nested Loop, the
parameter will be set in the workers and used in the workers and
everything is fine. If you put it on top of the Seq Scan, the
parameter will be set in the leader -- by the Nested Loop -- and used
in the leader, and again you have no problem.

So in my view of the world, the parameter just acts as an additional
constraint on where Gather nodes can be placed. I don't see that there
are any parameters that are unsafe categorically -- they're just
unsafe if the place where they are set is on a different side of the
Gather from the place where they are used. So I don't understand --
possibly just because I'm dumb -- the idea behind
consider_parallel_rechecking_params, because that seems to be making a
sort of overall judgement about the safety or unsafety of the
parameter on its own merits, rather than thinking about the Gather
placement.

When I last worked on this, I had hoped that extParam or allParam
would be the thing that would answer the question: are there any
parameters used under this node that are not also set under this node?
But I seem to recall that neither seemed to be answering precisely
that question, and the lousy naming of those fields and limited
documentation of their intended purpose did not help.

--
Robert Haas
EDB: http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Robert Haas (#7)

Re: Parallelize correlated subqueries that execute within each worker

Robert Haas <robertmhaas@gmail.com> writes:

When I last worked on this, I had hoped that extParam or allParam
would be the thing that would answer the question: are there any
parameters used under this node that are not also set under this node?
But I seem to recall that neither seemed to be answering precisely
that question, and the lousy naming of those fields and limited
documentation of their intended purpose did not help.

FWIW, I've never been very happy with those fields either. IIRC the
design in that area was all Vadim's, but to the extent that there's
any usable documentation of extParam/allParam, it was filled in by me
while trying to understand what Vadim did. If somebody wants to step
up and do a rewrite to make the planner's Param management more useful
or at least easier to understand, I think that'd be great.

But anyway: yeah, those fields as currently constituted don't help
much. They tell you which Params are consumed by this node or its
subnodes, but not where those Params came from. The planner's
plan_params and outer_params fields might be more nearly the right
thing, but I'm not sure they're spot-on either, nor that they're
up-to-date at the point where you'd want to make decisions about
Gather safety.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Tom Lane (#8)

Re: Parallelize correlated subqueries that execute within each worker

On Wed, Nov 3, 2021 at 11:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I've never been very happy with those fields either. IIRC the
design in that area was all Vadim's, but to the extent that there's
any usable documentation of extParam/allParam, it was filled in by me
while trying to understand what Vadim did. If somebody wants to step
up and do a rewrite to make the planner's Param management more useful
or at least easier to understand, I think that'd be great.

Good to know, thanks.

But anyway: yeah, those fields as currently constituted don't help
much. They tell you which Params are consumed by this node or its
subnodes, but not where those Params came from. The planner's
plan_params and outer_params fields might be more nearly the right
thing, but I'm not sure they're spot-on either, nor that they're
up-to-date at the point where you'd want to make decisions about
Gather safety.

One thing I discovered when I was looking at this a few years ago is
that there was only one query in the regression tests where extParam
and allParam were not the same. The offending query was select 1 =
all(select (select 1)), and the resulting plan has a Materialize node
with an attached InitPlan. For that Materialize node, extParam = {}
and allParam = {$0}, with $0 also being the output parameter of the
InitPlan attached that that Materialize node. In every other node in
that plan and in every node of every other plan generated by the
regression tests, the values were identical. So it's extremely niche
that these fields are even different from each other, and it's unclear
to me that we really need both of them.

What's also interesting is that extParam is computed (by
finalize_plan) as plan->extParam = bms_del_members(plan->extParam,
initSetParam). So I think it mostly ends up that extParam for a node
is not exactly all the parameters that anything under that node cares
about, but rather - approximately - all the things that anything under
that node cares about that aren't also set someplace under that node.
If it were exactly that, I think it would be perfect for our needs
here: if the set of things used but not set below the current level is
empty, it's OK to insert a Gather node; otherwise, it's not, at least,
not unless we find a way to pipe parameters from the leader into the
workers. But I think there's some reason that I no longer remember why
it's not exactly that, and therefore the idea doesn't work.

One problem I do remember is that attaching initplans at the top of
each subquery level as we presently do is really not good for this
kind of thing. Suppose you have several levels of Nested Loop and
someplace down in the plan you reference an InitPlan. The planner sees
no harm in attaching the InitPlan at the top level, which makes it
unsafe to put the Gather any place but at the top level. If you
attached the InitPlan to the lowest node in the plan tree that is high
enough to be above all the places that use the value from that
parameter, you could potentially shift the Gather down the plan tree,
which would be great if, for example, there's exactly one
parallel-restricted join and the rest are parallel-safe. The best plan
might be to do all the other joins under a Gather and then perform the
parallel-restricted join above it.

But I found it very hard to figure out how to rejigger the logic that
places InitPlans to be more intelligent, and eventually gave up.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Robert Haas (#9)

Re: Parallelize correlated subqueries that execute within each worker

Robert Haas <robertmhaas@gmail.com> writes:

One thing I discovered when I was looking at this a few years ago is
that there was only one query in the regression tests where extParam
and allParam were not the same. The offending query was select 1 =
all(select (select 1)), and the resulting plan has a Materialize node
with an attached InitPlan. For that Materialize node, extParam = {}
and allParam = {$0}, with $0 also being the output parameter of the
InitPlan attached that that Materialize node. In every other node in
that plan and in every node of every other plan generated by the
regression tests, the values were identical. So it's extremely niche
that these fields are even different from each other, and it's unclear
to me that we really need both of them.

Yeah, I've had that nagging feeling about them too. But ISTR trying to
reduce them to one value years ago, and finding that it didn't quite work,
or at least would result in more subquery-re-evaluation than we do today.
You have to dig into what the executor uses these values for to really
grok them. I'm afraid that that detail is all swapped out right now, so
I can't say much more.

regards, tom lane

#11

James Coleman

jtc331@gmail.com

over 4 years ago

In reply to: Robert Haas (#7)

Re: Parallelize correlated subqueries that execute within each worker

Hi Robert, thanks for the detailed reply.

On Wed, Nov 3, 2021 at 10:48 AM Robert Haas <robertmhaas@gmail.com> wrote:

As a preliminary comment, it would be quite useful to get Tom Lane's
opinion on this, since it's not an area I understand especially well,
and I think he understands it better than anyone.

On Fri, May 7, 2021 at 12:30 PM James Coleman <jtc331@gmail.com> wrote:

The basic idea is that we need to track (both on nodes and relations)
not only whether that node or rel is parallel safe but also whether
it's parallel safe assuming params are rechecked in the using context.
That allows us to delay making a final decision until we have
sufficient context to conclude that a given usage of a Param is
actually parallel safe or unsafe.

I don't really understand what you mean by "assuming params are
rechecked in the using context." However, I think that a possibly
better approach to this whole area would be to try to solve the
problem by putting limits on where you can insert a Gather node.
Consider:

Nested Loop
-> Seq Scan on x
-> Index Scan on y
Index Cond: y.q = x.q

If you insert a Gather node atop the Index Scan, necessarily changing
it to a Parallel Index Scan, then you need to pass values around. For
every value we get for x.q, we would need to start workers, sending
them the value of x.q, and they do a parallel index scan working
together to find all rows where y.q = x.q, and then exit. We repeat
this for every tuple from x.q. In the absence of infrastructure to
pass those parameters, we can't put the Gather there. We also don't
want to, because it would be really slow.

If you insert the Gather node atop the Seq Scan or the Nested Loop, in
either case necessarily changing the Seq Scan to a Parallel Seq Scan,
you have no problem. If you put it on top of the Nested Loop, the
parameter will be set in the workers and used in the workers and
everything is fine. If you put it on top of the Seq Scan, the
parameter will be set in the leader -- by the Nested Loop -- and used
in the leader, and again you have no problem.

So in my view of the world, the parameter just acts as an additional
constraint on where Gather nodes can be placed. I don't see that there
are any parameters that are unsafe categorically -- they're just
unsafe if the place where they are set is on a different side of the
Gather from the place where they are used. So I don't understand --
possibly just because I'm dumb -- the idea behind
consider_parallel_rechecking_params, because that seems to be making a
sort of overall judgement about the safety or unsafety of the
parameter on its own merits, rather than thinking about the Gather
placement.

I had to read through this several times before I understood the point
(not your fault, this is, as you note, a complicated area). I *think*
if I grok it properly you're effectively describing what this patch
results in conceptually (but possibly solving it from a different
direction).

As I understand the current code, parallel plans are largely chosen
based not on where it's safe to insert a Gather node but rather by
determining if a given path is parallel safe. Through that lens params
are a bit of an odd man out -- they aren't inherently unsafe in the
way a parallel-unsafe function is, but they can only be used in
parallel plans under certain conditions (whether because of project
policy, performance, or missing infrastructure).

Under that paradigm the existing consider_parallel and parallel_safe
boolean values imply everything is about whether a plan is inherently
parallel safe. Thus the current doesn't have the context to handle the
nuance of params (as they are not inherently parallel-safe or unsafe).

Introducing consider_parallel_rechecking_params and
parallel_safe_ignoring_params allows us to keep more context on params
and make a more nuanced decision at the proper level of the plan. This
is what I mean by "rechecked in the using context", though I realize
now that both "recheck" and "context" are overloaded terms in the
project, so don't describe the concept particularly clearly. When a
path relies on params we can only make a final determination about its
parallel safety if we know whether or not the current parallel node
can provide the param's value. We don't necessarily know that
information until we attempt to generate a full parallel node in the
plan (I think what you're describing as "inserting a Gather node")
since the param may come from another node in the plan. These new
values allow us to do that by tracking tentatively parallel-safe
subplans (given proper Gather node placement) and delaying the
parallel-safety determination until the point at which a param is
available (or not).

Is that a more helpful framing of what my goal is here?

When I last worked on this, I had hoped that extParam or allParam
would be the thing that would answer the question: are there any
parameters used under this node that are not also set under this node?
But I seem to recall that neither seemed to be answering precisely
that question, and the lousy naming of those fields and limited
documentation of their intended purpose did not help.

I don't really know anything about extParam or allParam, so I can't
offer any insight here.

Thanks,
James Coleman

#12

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: James Coleman (#11)

Re: Parallelize correlated subqueries that execute within each worker

On Wed, Nov 3, 2021 at 1:34 PM James Coleman <jtc331@gmail.com> wrote:

As I understand the current code, parallel plans are largely chosen
based not on where it's safe to insert a Gather node but rather by
determining if a given path is parallel safe. Through that lens params
are a bit of an odd man out -- they aren't inherently unsafe in the
way a parallel-unsafe function is, but they can only be used in
parallel plans under certain conditions (whether because of project
policy, performance, or missing infrastructure).

Right.

Introducing consider_parallel_rechecking_params and
parallel_safe_ignoring_params allows us to keep more context on params
and make a more nuanced decision at the proper level of the plan. This
is what I mean by "rechecked in the using context", though I realize
now that both "recheck" and "context" are overloaded terms in the
project, so don't describe the concept particularly clearly. When a
path relies on params we can only make a final determination about its
parallel safety if we know whether or not the current parallel node
can provide the param's value. We don't necessarily know that
information until we attempt to generate a full parallel node in the
plan (I think what you're describing as "inserting a Gather node")
since the param may come from another node in the plan. These new
values allow us to do that by tracking tentatively parallel-safe
subplans (given proper Gather node placement) and delaying the
parallel-safety determination until the point at which a param is
available (or not).

So I think I agree with you here. But I don't like all of this
"ignoring_params" stuff and I don't see why it's necessary. Say we
don't have both parallel_safe and parallel_safe_ignoring_params. Say
we just have parallel_safe. If the plan will be parallel safe if the
params are available, we label it parallel safe. If the plan will not
be parallel safe even if the params are available, we say it's not
parallel safe. Then, when we get to generate_gather_paths(), we don't
generate any paths if there are required parameters that are not
available. What's wrong with that approach?

Maybe it's clearer to say this: I feel like one extra Boolean is
either too much or too little. I think maybe it's not even needed. But
if it is needed, then why just a bool instead of, say, a Bitmapset of
params that are needed, or something?

I'm sort of speaking from intuition here rather than sure knowledge. I
might be totally wrong.

--
Robert Haas
EDB: http://www.enterprisedb.com

#13

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Robert Haas (#12)

Re: Parallelize correlated subqueries that execute within each worker

On Mon, Nov 15, 2021 at 10:01:37AM -0500, Robert Haas wrote:

On Wed, Nov 3, 2021 at 1:34 PM James Coleman <jtc331@gmail.com> wrote:

As I understand the current code, parallel plans are largely chosen
based not on where it's safe to insert a Gather node but rather by
determining if a given path is parallel safe. Through that lens params
are a bit of an odd man out -- they aren't inherently unsafe in the
way a parallel-unsafe function is, but they can only be used in
parallel plans under certain conditions (whether because of project
policy, performance, or missing infrastructure).

Right.

Please note that the CF bot is complaining here, so I have moved this
patch to the next CF, but changed the status as waiting on author.
--
Michael

#14

James Coleman

jtc331@gmail.com

over 4 years ago

In reply to: Michael Paquier (#13)

Re: Parallelize correlated subqueries that execute within each worker

On Fri, Dec 3, 2021 at 2:35 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Nov 15, 2021 at 10:01:37AM -0500, Robert Haas wrote:

On Wed, Nov 3, 2021 at 1:34 PM James Coleman <jtc331@gmail.com> wrote:

As I understand the current code, parallel plans are largely chosen
based not on where it's safe to insert a Gather node but rather by
determining if a given path is parallel safe. Through that lens params
are a bit of an odd man out -- they aren't inherently unsafe in the
way a parallel-unsafe function is, but they can only be used in
parallel plans under certain conditions (whether because of project
policy, performance, or missing infrastructure).

Right.

Please note that the CF bot is complaining here, so I have moved this
patch to the next CF, but changed the status as waiting on author.

I rebased this back in December, but somehow forgot to reply with the
updated patch, so, here it is finally.

Thanks,
James Coleman

#15

James Coleman

jtc331@gmail.com

over 4 years ago

In reply to: Robert Haas (#12)

Re: Parallelize correlated subqueries that execute within each worker

On Mon, Nov 15, 2021 at 10:01 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 3, 2021 at 1:34 PM James Coleman <jtc331@gmail.com> wrote:

As I understand the current code, parallel plans are largely chosen
based not on where it's safe to insert a Gather node but rather by
determining if a given path is parallel safe. Through that lens params
are a bit of an odd man out -- they aren't inherently unsafe in the
way a parallel-unsafe function is, but they can only be used in
parallel plans under certain conditions (whether because of project
policy, performance, or missing infrastructure).

Right.

Introducing consider_parallel_rechecking_params and
parallel_safe_ignoring_params allows us to keep more context on params
and make a more nuanced decision at the proper level of the plan. This
is what I mean by "rechecked in the using context", though I realize
now that both "recheck" and "context" are overloaded terms in the
project, so don't describe the concept particularly clearly. When a
path relies on params we can only make a final determination about its
parallel safety if we know whether or not the current parallel node
can provide the param's value. We don't necessarily know that
information until we attempt to generate a full parallel node in the
plan (I think what you're describing as "inserting a Gather node")
since the param may come from another node in the plan. These new
values allow us to do that by tracking tentatively parallel-safe
subplans (given proper Gather node placement) and delaying the
parallel-safety determination until the point at which a param is
available (or not).

So I think I agree with you here. But I don't like all of this
"ignoring_params" stuff and I don't see why it's necessary. Say we
don't have both parallel_safe and parallel_safe_ignoring_params. Say
we just have parallel_safe. If the plan will be parallel safe if the
params are available, we label it parallel safe. If the plan will not
be parallel safe even if the params are available, we say it's not
parallel safe. Then, when we get to generate_gather_paths(), we don't
generate any paths if there are required parameters that are not
available. What's wrong with that approach?

Maybe it's clearer to say this: I feel like one extra Boolean is
either too much or too little. I think maybe it's not even needed. But
if it is needed, then why just a bool instead of, say, a Bitmapset of
params that are needed, or something?

I'm sort of speaking from intuition here rather than sure knowledge. I
might be totally wrong.

Apologies for quite the delay responding to this.

I've been chewing on this a bit, and I was about to go re-read the
code and see how easy it'd be to do exactly what you're suggesting in
generate_gather_paths() (and verifying it doesn't need to happen in
other places). However there's one (I think large) gotcha with that
approach (assuming it otherwise makes sense): it means we do
unnecessary work. In the current patch series we only need to recheck
parallel safety if we're in a situation where we might actually
benefit from doing that work (namely when we have a correlated
subquery we might otherwise be able to execute in a parallel plan). If
we don't track that status we'd have to recheck the full parallel
safety of the path for all paths -- even without correlated
subqueries.

Alternatively we could merge these fields into a single enum field
that tracked these states. Even better, we could use a bitmap to
signify what items are/aren't parallel safe. I'm not sure if that'd
create even larger churn in the patch, but maybe it's worth it either
way. In theory it'd open up further expansions to this concept later
(though I'm not aware of any such ideas).

If you think such an approach would be an improvement I'd be happy to
take a pass at a revised patch.

Thoughts?

Thanks,
James Coleman

#16

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: James Coleman (#15)

Re: Parallelize correlated subqueries that execute within each worker

On Fri, Jan 14, 2022 at 2:25 PM James Coleman <jtc331@gmail.com> wrote:

I've been chewing on this a bit, and I was about to go re-read the
code and see how easy it'd be to do exactly what you're suggesting in
generate_gather_paths() (and verifying it doesn't need to happen in
other places). However there's one (I think large) gotcha with that
approach (assuming it otherwise makes sense): it means we do
unnecessary work. In the current patch series we only need to recheck
parallel safety if we're in a situation where we might actually
benefit from doing that work (namely when we have a correlated
subquery we might otherwise be able to execute in a parallel plan). If
we don't track that status we'd have to recheck the full parallel
safety of the path for all paths -- even without correlated
subqueries.

I don't think there's an intrinsic problem with the idea of making a
tentative determination about parallel safety and then refining it
later, but I'm not sure why you think it would be a lot of work to
figure this out at the point where we generate gather paths. I think
it's just a matter of testing whether the set of parameters that the
path needs as input is the empty set. It may be that neither extParam
nor allParam are precisely that thing, but I think both are very
close, and it seems to me that there's no theoretical reason why we
can't know for every path the set of inputs that it requires "from the
outside."

--
Robert Haas
EDB: http://www.enterprisedb.com

#17

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Robert Haas (#16)

Re: Parallelize correlated subqueries that execute within each worker

Robert Haas <robertmhaas@gmail.com> writes:

I don't think there's an intrinsic problem with the idea of making a
tentative determination about parallel safety and then refining it
later, but I'm not sure why you think it would be a lot of work to
figure this out at the point where we generate gather paths. I think
it's just a matter of testing whether the set of parameters that the
path needs as input is the empty set. It may be that neither extParam
nor allParam are precisely that thing, but I think both are very
close, and it seems to me that there's no theoretical reason why we
can't know for every path the set of inputs that it requires "from the
outside."

I'd be very happy if someone redesigned the extParam/allParam mechanism,
or at least documented it better. It's confusing and I've never been
able to escape the feeling that it's somewhat redundant.

The real problem with it though is that we don't compute those values
until much too late to be useful in path construction; see comments
for SS_identify_outer_params. To be helpful to the planner, we'd have
to rejigger things at least enough to calculate them earlier -- or
maybe better, calculate what the planner wants earlier, and then transform
to what the executor wants later.

regards, tom lane

#18

James Coleman

jtc331@gmail.com

over 4 years ago

In reply to: Robert Haas (#16)

Re: Parallelize correlated subqueries that execute within each worker

On Fri, Jan 21, 2022 at 3:20 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 14, 2022 at 2:25 PM James Coleman <jtc331@gmail.com> wrote:

I've been chewing on this a bit, and I was about to go re-read the
code and see how easy it'd be to do exactly what you're suggesting in
generate_gather_paths() (and verifying it doesn't need to happen in
other places). However there's one (I think large) gotcha with that
approach (assuming it otherwise makes sense): it means we do
unnecessary work. In the current patch series we only need to recheck
parallel safety if we're in a situation where we might actually
benefit from doing that work (namely when we have a correlated
subquery we might otherwise be able to execute in a parallel plan). If
we don't track that status we'd have to recheck the full parallel
safety of the path for all paths -- even without correlated
subqueries.

I don't think there's an intrinsic problem with the idea of making a
tentative determination about parallel safety and then refining it
later, but I'm not sure why you think it would be a lot of work to
figure this out at the point where we generate gather paths. I think
it's just a matter of testing whether the set of parameters that the
path needs as input is the empty set. It may be that neither extParam
nor allParam are precisely that thing, but I think both are very
close, and it seems to me that there's no theoretical reason why we
can't know for every path the set of inputs that it requires "from the
outside."

As I understand it now (not sure I realized this before) you're
suggesting that *even when there are required params* marking it as
parallel safe, and then checking the params for parallel safety later.
From a purely theoretical perspective that seemed reasonable, so I
took a pass at that approach.

The first, and likely most interesting, thing I discovered was that
the vast majority of what the patch accomplishes it does so not via
the delayed params safety checking but rather via the required outer
relids checks I'd added to generate_useful_gather_paths.

For that to happen I did have to mark PARAM_EXEC params as presumed
parallel safe. That means that parallel_safe now doesn't strictly mean
"parallel safe in the current context" but "parallel safe as long as
any params are provided". That's a real change, but probably
acceptable as long as a project policy decision is made in that
direction.

There are a few concerns I have (and I'm not sure what level they rise to):

1. From what I can tell we don't have access on a path to the set of
params required by that path (I believe this is what Tom was
referencing in his sister reply at this point in the thread). That
means we have to rely on checking that the required outer relids are
provided by the current query level. I'm not quite sure yet whether or
not that guarantees (or if the rest of the path construction logic
guarantees for us) that the params provided by the outer rel are used
in a correlated way that isn't shared across workers. And because we
don't have the param information available we can't add additional
checks (that I can tell) to verify that.
2. Are we excluding any paths (by having one that will always be
invalid win the cost comparisons in add_partial_path)? I suppose this
danger actually exists in the previous version of the patch as well,
and I don't actually have any examples of this being a problem. Also
maybe this can only be a problem if (1) reveals a bug.
3. The new patch series actually ends up allowing parallelization of
correlated params in a few more places than the original patch series.
From what I can tell all of the cases are in fact safe to execute in
parallel, which, if true, means this is a feature not a concern. The
changed query plans fall into two categories: a.) putting a gather
inside a subplan and b.) correlated param usages in a subquery scan
path on the inner side of a join. I've separated out those specific
changes in a separate patch to make it easier to tell which ones I'm
referencing.

On the other hand this is a dramatically simpler patch series.
Assuming the approach is sound, it should much easier to maintain than
the previous version.

The final patch in the series is a set of additional checks I could
imagine to try to be more explicit, but at least in the current test
suite there isn't anything at all they affect.

Does this look at least somewhat more like what you'd envisionsed
(granting the need to squint hard given the relids checks instead of
directly checking params)?

Thanks,
James Coleman

#19

Andres Freund

andres@anarazel.de

about 4 years ago

In reply to: James Coleman (#18)

Re: Parallelize correlated subqueries that execute within each worker

Hi,

On 2022-01-22 20:25:19 -0500, James Coleman wrote:

On the other hand this is a dramatically simpler patch series.
Assuming the approach is sound, it should much easier to maintain than
the previous version.

The final patch in the series is a set of additional checks I could
imagine to try to be more explicit, but at least in the current test
suite there isn't anything at all they affect.

Does this look at least somewhat more like what you'd envisionsed
(granting the need to squint hard given the relids checks instead of
directly checking params)?

This fails on freebsd (so likely a timing issue): https://cirrus-ci.com/task/4758411492458496?logs=test_world#L2225

Marked as waiting on author.

Greetings,

Andres Freund

#20

Jacob Champion

jacob.champion@enterprisedb.com

almost 4 years ago

In reply to: Andres Freund (#19)

Re: Parallelize correlated subqueries that execute within each worker

This entry has been waiting on author input for a while (our current
threshold is roughly two weeks), so I've marked it Returned with
Feedback.

Once you think the patchset is ready for review again, you (or any
interested party) can resurrect the patch entry by visiting

https://commitfest.postgresql.org/38/3246/

and changing the status to "Needs Review", and then changing the
status again to "Move to next CF". (Don't forget the second step;
hopefully we will have streamlined this in the near future!)

Thanks,
--Jacob

#21

James Coleman

jtc331@gmail.com

over 3 years ago

In reply to: Andres Freund (#19)

#22