initial pruning in parallel append

Started by Amit Langoteover 2 years ago11 messages

amitlangote09@gmail.com

over 2 years ago

Hi,

In an off-list chat, Robert suggested that it might be a good idea to
look more closely into $subject, especially in the context of the
project of moving the locking of child tables / partitions to the
ExecInitNode() phase when executing cached generic plans [1]https://commitfest.postgresql.org/43/3478/.

Robert's point is that a worker's output of initial pruning which
consists of the set of child subplans (of a parallel-aware Append or
MergeAppend) it considers as valid for execution may not be the same
as the leader's and that of other workers. If that does indeed
happen, it may confuse the Append's parallel-execution code, possibly
even cause crashes, because the ParallelAppendState set up by the
leader assumes a certain number and identity (?) of
valid-for-execution subplans.

So he suggests that initial pruning should only be done once in the
leader and the result of that put in the EState for
ExecInitParallelPlan() to serialize to pass down to workers. Workers
would simply consume that as-is to set the valid-for-execution child
subplans in its copy of AppendState, instead of doing the initial
pruning again. Actually, earlier patches at [1]https://commitfest.postgresql.org/43/3478/ had implemented that
mechanism (remembering the result of initial pruning and using it at a
later time and place), because the earlier design there was to move
the initial pruning on the nodes in a cached generic plan tree from
ExecInitNode() to GetCachedPlan(). The result of initial pruning done
in the latter would be passed down to and consumed in the former using
what was called PartitionPruneResult nodes.

Maybe that stuff could be resurrected, though I was wondering if the
risk of the same initial pruning steps returning different results
when performed repeatedly in *one query lifetime* aren't pretty
minimal or maybe rather non-existent? I think that's because
performing initial pruning steps entails computing constant and/or
stable expressions and comparing them with an unchanging set of
partition bound values, with comparison functions whose result is also
presumed to be stable. Then there's also the step of mapping the
partition indexes as they appear in the PartitionDesc to the indexes
of their subplans under Append/MergeAppend using the information
contained in PartitionPruneInfo (subplan_map) and the result of
mapping should be immutable too.

I considered that the comparison functions that
match_clause_to_partition_key() obtains by calling get_opfamily_proc()
may in fact not be stable, though that doesn't seem to be a worry at
least with the out-of-the-box pg_amproc collection:

select amproc, p.provolatile from pg_amproc, pg_proc p where amproc =
p.oid and p.provolatile <> 'i';
amproc | provolatile
---------------------------+-------------
date_cmp_timestamptz | s
timestamp_cmp_timestamptz | s
timestamptz_cmp_date | s
timestamptz_cmp_timestamp | s
pg_catalog.in_range | s
(5 rows)

Is it possible for a user to add a volatile procedure to pg_amproc?
If that's possible, match_clause_to_partition_key() may pick one as a
comparison function for pruning, because it doesn't actually check the
procedure's provolatile before doing so. I'd hope not, though would
like to be sure to support what I wrote above.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1]: https://commitfest.postgresql.org/43/3478/

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Amit Langote (#1)

Re: initial pruning in parallel append

On Tue, Jun 27, 2023 at 9:23 AM Amit Langote <amitlangote09@gmail.com> wrote:

Maybe that stuff could be resurrected, though I was wondering if the
risk of the same initial pruning steps returning different results
when performed repeatedly in *one query lifetime* aren't pretty
minimal or maybe rather non-existent? I think that's because
performing initial pruning steps entails computing constant and/or
stable expressions and comparing them with an unchanging set of
partition bound values, with comparison functions whose result is also
presumed to be stable. Then there's also the step of mapping the
partition indexes as they appear in the PartitionDesc to the indexes
of their subplans under Append/MergeAppend using the information
contained in PartitionPruneInfo (subplan_map) and the result of
mapping should be immutable too.

I considered that the comparison functions that
match_clause_to_partition_key() obtains by calling get_opfamily_proc()
may in fact not be stable, though that doesn't seem to be a worry at
least with the out-of-the-box pg_amproc collection:

I think it could be acceptable if a stable function not actually being
stable results in some kind of internal error message, hopefully one
that in some way hints to the user what the problem was. But crashing
because some expression was supposed to be stable and wasn't is a
bridge too far, especially if, as I think would be the case here, the
crash happens in a part of the code that is far removed from where the
problem was introduced.

The real issue here is about how much trust you can place in a given
invariant. If, in a single function, we initialize a value to 0 and
thereafter only ever increment it, we can logically reason that if we
ever see a value less than zero, there must have been an overflow. It
is true that if our function calls some other function, that other
function could access data through a garbage pointer and possibly
corrupt the value of our function's local variable, but that's
extremely unlikely, and we can basically decide that we're not going
to are about it, because such code is likely to crash anyway before
too long.

But now consider an invariant that implicates a larger amount of code
e.g. you must always hold a buffer pin before accessing the buffer
contents. In many cases, it's fairly easy to verify that this must be
so in any given piece of code, but there are problems: some code that
does buffer access is complicated enough that it's hard to fully
verify, especially when buffer pins are held across long periods, and
what is probably worse, there are tons of different places in the code
that access buffers. Hence, we've had bugs in this area, and likely
will have bugs in this area again. In theory, with a sufficient amount
of really careful work, you can find all of the problems, but in
practice it's pretty difficult. Nonetheless, we just have to just
accept the risk that we're going to crash if a bug in this area does
exist, because there's no real way to cope with the contents of the
buffer that you're accessing being swapped out while you're in the
middle of looking at it, or even modifying it.

But the present case is different in a couple of ways. First, there's
probably even more code involved, including a good bit of it that's
not in core but is user-defined. Second, we've generally made a
decision up until now that we don't want to have a hard dependency on
stable functions actually being stable. If they aren't, and for
example you're using index expressions, your queries may return wrong
answers, but you won't get weird internal error messages, and you
won't get a crash. I think the bar for this feature is the same.

--
Robert Haas
EDB: http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Robert Haas (#2)

Re: initial pruning in parallel append

Robert Haas <robertmhaas@gmail.com> writes:

... Second, we've generally made a
decision up until now that we don't want to have a hard dependency on
stable functions actually being stable. If they aren't, and for
example you're using index expressions, your queries may return wrong
answers, but you won't get weird internal error messages, and you
won't get a crash. I think the bar for this feature is the same.

Yeah, I agree --- wrong answers may be acceptable in such a case, but
crashes or unintelligible error messages aren't. There are practical
reasons for being tolerant here, notably that it's not that easy
for users to get their volatility markings right.

In the case at hand, I think that means that allowing workers to do
pruning would require hardening the parallel append code against the
situation where their pruning results vary. Maybe, instead of passing
the pruning results *down*, we could pass them *up* to the leader and
then throw an error if they're different?

regards, tom lane

Amit Langote

amitlangote09@gmail.com

over 2 years ago

In reply to: Tom Lane (#3)

Re: initial pruning in parallel append

On Mon, Aug 7, 2023 at 22:29 Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

... Second, we've generally made a
decision up until now that we don't want to have a hard dependency on
stable functions actually being stable. If they aren't, and for
example you're using index expressions, your queries may return wrong
answers, but you won't get weird internal error messages, and you
won't get a crash. I think the bar for this feature is the same.

Yeah, I agree --- wrong answers may be acceptable in such a case, but
crashes or unintelligible error messages aren't. There are practical
reasons for being tolerant here, notably that it's not that easy
for users to get their volatility markings right.

In the case at hand, I think that means that allowing workers to do
pruning would require hardening the parallel append code against the
situation where their pruning results vary. Maybe, instead of passing
the pruning results *down*, we could pass them *up* to the leader and
then throw an error if they're different?

Note we’re talking here about “initial” pruning that occurs during
ExecInitNode(). Workers are only launched during ExecGather[Merge]() which
thereafter do ExecInitNode() on their copy of the the plan tree. So if we
are to pass the pruning results for cross-checking, it will have to be from
the leader to workers.

--

Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Amit Langote (#4)

Re: initial pruning in parallel append

On Mon, Aug 7, 2023 at 10:25 AM Amit Langote <amitlangote09@gmail.com> wrote:

Note we’re talking here about “initial” pruning that occurs during ExecInitNode(). Workers are only launched during ExecGather[Merge]() which thereafter do ExecInitNode() on their copy of the the plan tree. So if we are to pass the pruning results for cross-checking, it will have to be from the leader to workers.

That doesn't seem like a big problem because there aren't many node
types that do pruning, right? I think we're just talking about Append
and MergeAppend, or something like that, right? You just need the
ExecWhateverEstimate function to budget some DSM space to store the
information, which can basically just be a bitmap over the set of
child plans, and the ExecWhateverInitializeDSM copies the information
into that DSM space, and ExecWhateverInitializeWorker() copies the
information from the shared space back into the local node (or maybe
just points to it, if the representation is sufficiently compatible).
I feel like this is an hour or two's worth of coding, unless I'm
missing something, and WAY easier than trying to reason about what
happens if expression evaluation isn't as stable as we'd like it to
be.

--
Robert Haas
EDB: http://www.enterprisedb.com

Amit Langote

amitlangote09@gmail.com

over 2 years ago

In reply to: Robert Haas (#5)

Re: initial pruning in parallel append

On Tue, Aug 8, 2023 at 12:53 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 7, 2023 at 10:25 AM Amit Langote <amitlangote09@gmail.com> wrote:

Note we’re talking here about “initial” pruning that occurs during ExecInitNode(). Workers are only launched during ExecGather[Merge]() which thereafter do ExecInitNode() on their copy of the the plan tree. So if we are to pass the pruning results for cross-checking, it will have to be from the leader to workers.

That doesn't seem like a big problem because there aren't many node
types that do pruning, right? I think we're just talking about Append
and MergeAppend, or something like that, right?

MergeAppend can't be parallel-aware atm, so only Append.

You just need the
ExecWhateverEstimate function to budget some DSM space to store the
information, which can basically just be a bitmap over the set of
child plans, and the ExecWhateverInitializeDSM copies the information
into that DSM space, and ExecWhateverInitializeWorker() copies the
information from the shared space back into the local node (or maybe
just points to it, if the representation is sufficiently compatible).
I feel like this is an hour or two's worth of coding, unless I'm
missing something, and WAY easier than trying to reason about what
happens if expression evaluation isn't as stable as we'd like it to
be.

OK, I agree that we'd better share the pruning result between the
leader and workers.

I hadn't thought about putting the pruning result into Append's DSM
(ParallelAppendState), which is what you're describing IIUC. I looked
into it, though I'm not sure if it can be made to work given the way
things are on the worker side, or at least not without some
reshuffling of code in ParallelQueryMain(). The pruning result will
have to be available in ExecInitAppend, but because the worker reads
the DSM only after finishing the plan tree initialization, it won't.
Perhaps, we can integrate ExecParallelInitializeWorker()'s
responsibilities into ExecutorStart() / ExecInitNode() somehow?

So change the ordering of the following code in ParallelQueryMain():

/* Start up the executor */
queryDesc->plannedstmt->jitFlags = fpes->jit_flags;
ExecutorStart(queryDesc, fpes->eflags);

/* Special executor initialization steps for parallel workers */
queryDesc->planstate->state->es_query_dsa = area;
if (DsaPointerIsValid(fpes->param_exec))
{
char *paramexec_space;

paramexec_space = dsa_get_address(area, fpes->param_exec);
RestoreParamExecParams(paramexec_space, queryDesc->estate);
}
pwcxt.toc = toc;
pwcxt.seg = seg;
ExecParallelInitializeWorker(queryDesc->planstate, &pwcxt);

Looking inside ExecParallelInitializeWorker():

static bool
ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
{
if (planstate == NULL)
return false;

switch (nodeTag(planstate))
{
case T_SeqScanState:
if (planstate->plan->parallel_aware)
ExecSeqScanInitializeWorker((SeqScanState *) planstate, pwcxt);

I guess that'd mean putting the if (planstate->plan->parallel_aware)
block seen here at the end of ExecInitSeqScan() and so on.

Or we could consider something like the patch I mentioned in my 1st
email. The idea there was to pass the pruning result via a separate
channel, not the DSM chunk linked into the PlanState tree. To wit, on
the leader side, ExecInitParallelPlan() puts the serialized
List-of-Bitmapset into the shm_toc with a dedicated PARALLEL_KEY,
alongside PlannedStmt, ParamListInfo, etc. The List-of-Bitmpaset is
initialized during the leader's ExecInitNode(). On the worker side,
ExecParallelGetQueryDesc() reads the List-of-Bitmapset string and puts
the resulting node into the QueryDesc, that ParallelQueryMain() then
uses to do ExecutorStart() which copies the pointer to
EState.es_part_prune_results. ExecInitAppend() consults
EState.es_part_prune_results and uses the Bitmapset from there, if
present, instead of performing initial pruning. I'm assuming it's not
too ugly if ExecInitAppend() uses IsParallelWorker() to decide whether
it should be writing to EState.es_part_prune_results or reading from
it -- the former if in the leader and the latter in a worker. If we
are to go with this approach we will need to un-revert ec386948948c,
which moved PartitionPruneInfo nodes out of Append/MergeAppend nodes
to a List in PlannedStmt (copied into EState.es_part_prune_infos),
such that es_part_prune_results mirrors es_part_prune_infos.

Thoughts?

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Amit Langote (#6)

Re: initial pruning in parallel append

On Tue, Aug 8, 2023 at 2:58 AM Amit Langote <amitlangote09@gmail.com> wrote:

That doesn't seem like a big problem because there aren't many node
types that do pruning, right? I think we're just talking about Append
and MergeAppend, or something like that, right?

MergeAppend can't be parallel-aware atm, so only Append.

Well, the question isn't whether it's parallel-aware, but whether
startup-time pruning happens there.

So change the ordering of the following code in ParallelQueryMain():

Yeah, that would be a reasonable thing to do.

Or we could consider something like the patch I mentioned in my 1st
email. The idea there was to pass the pruning result via a separate
channel, not the DSM chunk linked into the PlanState tree. To wit, on
the leader side, ExecInitParallelPlan() puts the serialized
List-of-Bitmapset into the shm_toc with a dedicated PARALLEL_KEY,
alongside PlannedStmt, ParamListInfo, etc. The List-of-Bitmpaset is
initialized during the leader's ExecInitNode(). On the worker side,
ExecParallelGetQueryDesc() reads the List-of-Bitmapset string and puts
the resulting node into the QueryDesc, that ParallelQueryMain() then
uses to do ExecutorStart() which copies the pointer to
EState.es_part_prune_results. ExecInitAppend() consults
EState.es_part_prune_results and uses the Bitmapset from there, if
present, instead of performing initial pruning.

This also seems reasonable.

I'm assuming it's not
too ugly if ExecInitAppend() uses IsParallelWorker() to decide whether
it should be writing to EState.es_part_prune_results or reading from
it -- the former if in the leader and the latter in a worker.

I don't think that's too ugly. I mean you have to have an if statement
someplace.

If we
are to go with this approach we will need to un-revert ec386948948c,
which moved PartitionPruneInfo nodes out of Append/MergeAppend nodes
to a List in PlannedStmt (copied into EState.es_part_prune_infos),
such that es_part_prune_results mirrors es_part_prune_infos.

The comment for the revert (which was
5472743d9e8583638a897b47558066167cc14583) points to
/messages/by-id/4191508.1674157166@sss.pgh.pa.us
as the reason, but it's not very clear to me why that email led to
this being reverted. In any event, I agree that if we go with your
idea to pass this via a separate PARALLEL_KEY, unreverting that patch
seems to make sense, because otherwise I think we don't have a fast
way to find the nodes that contain the state that we care about.

--
Robert Haas
EDB: http://www.enterprisedb.com

Amit Langote

amitlangote09@gmail.com

over 2 years ago

In reply to: Robert Haas (#7)

2 attachment(s)

Re: initial pruning in parallel append

On Tue, Aug 8, 2023 at 11:16 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 8, 2023 at 2:58 AM Amit Langote <amitlangote09@gmail.com> wrote:

Or we could consider something like the patch I mentioned in my 1st
email. The idea there was to pass the pruning result via a separate
channel, not the DSM chunk linked into the PlanState tree. To wit, on
the leader side, ExecInitParallelPlan() puts the serialized
List-of-Bitmapset into the shm_toc with a dedicated PARALLEL_KEY,
alongside PlannedStmt, ParamListInfo, etc. The List-of-Bitmpaset is
initialized during the leader's ExecInitNode(). On the worker side,
ExecParallelGetQueryDesc() reads the List-of-Bitmapset string and puts
the resulting node into the QueryDesc, that ParallelQueryMain() then
uses to do ExecutorStart() which copies the pointer to
EState.es_part_prune_results. ExecInitAppend() consults
EState.es_part_prune_results and uses the Bitmapset from there, if
present, instead of performing initial pruning.

This also seems reasonable.

I'm assuming it's not
too ugly if ExecInitAppend() uses IsParallelWorker() to decide whether
it should be writing to EState.es_part_prune_results or reading from
it -- the former if in the leader and the latter in a worker.

I don't think that's too ugly. I mean you have to have an if statement
someplace.

Yes, that makes sense.

It's just that I thought maybe I haven't thought hard enough about
options before adding a new IsParallelWorker(), because I don't find
too many instances of IsParallelWorker() in the generic executor code.
I think that's because most parallel worker-specific logic lives in
execParallel.c or in Exec*Worker() functions outside that file, so the
generic code remains parallel query agnostic as much as possible.

If we
are to go with this approach we will need to un-revert ec386948948c,
which moved PartitionPruneInfo nodes out of Append/MergeAppend nodes
to a List in PlannedStmt (copied into EState.es_part_prune_infos),
such that es_part_prune_results mirrors es_part_prune_infos.

The comment for the revert (which was
5472743d9e8583638a897b47558066167cc14583) points to
/messages/by-id/4191508.1674157166@sss.pgh.pa.us
as the reason, but it's not very clear to me why that email led to
this being reverted. In any event, I agree that if we go with your
idea to pass this via a separate PARALLEL_KEY, unreverting that patch
seems to make sense, because otherwise I think we don't have a fast
way to find the nodes that contain the state that we care about.

OK, I've attached the unreverted patch that adds
EState.es_part_prune_infos as 0001.

0002 adds EState.es_part_prune_results. Parallel query leader stores
the bitmapset of initially valid subplans by performing initial
pruning steps contained in a given PartitionPruneInfo into that list
at the same index as the PartitionPruneInfo's index in
es_part_prune_infos. ExecInitParallelPlan() serializes
es_part_prune_results and stores it in the DSM. A worker initializes
es_part_prune_results in its own EState by reading the leader's value
from the DSM and for each PartitionPruneInfo in its own copy of
EState.es_part_prune_infos, gets the set of initially valid subplans
by referring to es_part_prune_results in lieu of performing initial
pruning again.

Should workers, as Tom says, instead do the pruning and cross-check
the result to give an error if it doesn't match the leader's? The
error message can't specifically point out which, though a user would
at least know that they have functions in their database with wrong
volatility markings.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

v1-0001-Move-PartitioPruneInfo-out-of-plan-nodes-into-Pla.patchapplication/octet-stream; name=v1-0001-Move-PartitioPruneInfo-out-of-plan-nodes-into-Pla.patchDownload

From 2397e2ada9b6f41e7e3c2b8f70c191b878bee1f8 Mon Sep 17 00:00:00 2001
From: Amit Langote <amitlan@postgresql.org>
Date: Wed, 9 Aug 2023 18:24:11 +0900
Subject: [PATCH v1 1/2] Move PartitioPruneInfo out of plan nodes into
 PlannedStmt

The planner will now add a given PartitioPruneInfo to
PlannedStmt.partPruneInfos instead of directly to the
Append/MergeAppend plan node.  What gets set instead in the
latter is an index field which points to the list element
of PlannedStmt.partPruneInfos containing the PartitioPruneInfo
belonging to the plan node.
---
 src/backend/executor/execMain.c         |  1 +
 src/backend/executor/execParallel.c     |  1 +
 src/backend/executor/execPartition.c    | 18 ++++-
 src/backend/executor/execUtils.c        |  1 +
 src/backend/executor/nodeAppend.c       |  5 +-
 src/backend/executor/nodeMergeAppend.c  |  5 +-
 src/backend/optimizer/plan/createplan.c | 24 +++----
 src/backend/optimizer/plan/planner.c    |  1 +
 src/backend/optimizer/plan/setrefs.c    | 88 ++++++++++++++++---------
 src/backend/partitioning/partprune.c    | 19 ++++--
 src/include/executor/execPartition.h    |  4 +-
 src/include/nodes/execnodes.h           |  1 +
 src/include/nodes/pathnodes.h           |  6 ++
 src/include/nodes/plannodes.h           | 14 ++--
 src/include/partitioning/partprune.h    |  8 +--
 15 files changed, 134 insertions(+), 62 deletions(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4c5a7bbf62..49293fc8ce 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -855,6 +855,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	ExecInitRangeTable(estate, rangeTable, plannedstmt->permInfos);
 
 	estate->es_plannedstmt = plannedstmt;
+	estate->es_part_prune_infos = plannedstmt->partPruneInfos;
 
 	/*
 	 * Next, build the ExecRowMark array from the PlanRowMark(s), if any.
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index cc2b8ccab7..aa3f283453 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -183,6 +183,7 @@ ExecSerializePlan(Plan *plan, EState *estate)
 	pstmt->dependsOnRole = false;
 	pstmt->parallelModeNeeded = false;
 	pstmt->planTree = plan;
+	pstmt->partPruneInfos = estate->es_part_prune_infos;
 	pstmt->rtable = estate->es_range_table;
 	pstmt->permInfos = estate->es_rteperminfos;
 	pstmt->resultRelations = NIL;
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index eb8a87fd63..9799968a42 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1778,6 +1778,9 @@ adjust_partition_colnos_using_map(List *colnos, AttrMap *attrMap)
  *		Initialize data structure needed for run-time partition pruning and
  *		do initial pruning if needed
  *
+ * 'root_parent_relids' identifies the relation to which both the parent plan
+ * and the PartitionPruneInfo given by 'part_prune_index' belong.
+ *
  * On return, *initially_valid_subplans is assigned the set of indexes of
  * child subplans that must be initialized along with the parent plan node.
  * Initial pruning is performed here if needed and in that case only the
@@ -1790,11 +1793,24 @@ adjust_partition_colnos_using_map(List *colnos, AttrMap *attrMap)
 PartitionPruneState *
 ExecInitPartitionPruning(PlanState *planstate,
 						 int n_total_subplans,
-						 PartitionPruneInfo *pruneinfo,
+						 int part_prune_index,
+						 Bitmapset *root_parent_relids,
 						 Bitmapset **initially_valid_subplans)
 {
 	PartitionPruneState *prunestate;
 	EState	   *estate = planstate->state;
+	PartitionPruneInfo *pruneinfo;
+
+	/* Obtain the pruneinfo we need, and make sure it's the right one */
+	pruneinfo = list_nth(estate->es_part_prune_infos, part_prune_index);
+	if (!bms_equal(root_parent_relids, pruneinfo->root_parent_relids))
+		ereport(ERROR,
+				errcode(ERRCODE_INTERNAL_ERROR),
+				errmsg_internal("mismatching PartitionPruneInfo found at part_prune_index %d",
+								part_prune_index),
+				errdetail_internal("plan node relids %s, pruneinfo relids %s",
+								   bmsToString(root_parent_relids),
+								   bmsToString(pruneinfo->root_parent_relids)));
 
 	/* We may need an expression context to evaluate partition exprs */
 	ExecAssignExprContext(estate, planstate);
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index c06b228858..48366a33b5 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -123,6 +123,7 @@ CreateExecutorState(void)
 	estate->es_rowmarks = NULL;
 	estate->es_rteperminfos = NIL;
 	estate->es_plannedstmt = NULL;
+	estate->es_part_prune_infos = NIL;
 
 	estate->es_junkFilter = NULL;
 
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 609df6b9e6..c185b11c67 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -134,7 +134,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->as_begun = false;
 
 	/* If run-time partition pruning is enabled, then set that up now */
-	if (node->part_prune_info != NULL)
+	if (node->part_prune_index >= 0)
 	{
 		PartitionPruneState *prunestate;
 
@@ -145,7 +145,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 		 */
 		prunestate = ExecInitPartitionPruning(&appendstate->ps,
 											  list_length(node->appendplans),
-											  node->part_prune_info,
+											  node->part_prune_index,
+											  node->apprelids,
 											  &validsubplans);
 		appendstate->as_prune_state = prunestate;
 		nplans = bms_num_members(validsubplans);
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 21b5726e6e..399b39c598 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -82,7 +82,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ps.ExecProcNode = ExecMergeAppend;
 
 	/* If run-time partition pruning is enabled, then set that up now */
-	if (node->part_prune_info != NULL)
+	if (node->part_prune_index >= 0)
 	{
 		PartitionPruneState *prunestate;
 
@@ -93,7 +93,8 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 		 */
 		prunestate = ExecInitPartitionPruning(&mergestate->ps,
 											  list_length(node->mergeplans),
-											  node->part_prune_info,
+											  node->part_prune_index,
+											  node->apprelids,
 											  &validsubplans);
 		mergestate->ms_prune_state = prunestate;
 		nplans = bms_num_members(validsubplans);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index af48109058..3ea5fef5d8 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1203,7 +1203,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
 	ListCell   *subpaths;
 	int			nasyncplans = 0;
 	RelOptInfo *rel = best_path->path.parent;
-	PartitionPruneInfo *partpruneinfo = NULL;
 	int			nodenumsortkeys = 0;
 	AttrNumber *nodeSortColIdx = NULL;
 	Oid		   *nodeSortOperators = NULL;
@@ -1354,6 +1353,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
 		subplans = lappend(subplans, subplan);
 	}
 
+	/* Set below if we find quals that we can use to run-time prune */
+	plan->part_prune_index = -1;
+
 	/*
 	 * If any quals exist, they may be useful to perform further partition
 	 * pruning during execution.  Gather information needed by the executor to
@@ -1377,16 +1379,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
 		}
 
 		if (prunequal != NIL)
-			partpruneinfo =
-				make_partition_pruneinfo(root, rel,
-										 best_path->subpaths,
-										 prunequal);
+			plan->part_prune_index = make_partition_pruneinfo(root, rel,
+															  best_path->subpaths,
+															  prunequal);
 	}
 
 	plan->appendplans = subplans;
 	plan->nasyncplans = nasyncplans;
 	plan->first_partial_plan = best_path->first_partial_path;
-	plan->part_prune_info = partpruneinfo;
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -1425,7 +1425,6 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
-	PartitionPruneInfo *partpruneinfo = NULL;
 
 	/*
 	 * We don't have the actual creation of the MergeAppend node split out
@@ -1518,6 +1517,9 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
 		subplans = lappend(subplans, subplan);
 	}
 
+	/* Set below if we find quals that we can use to run-time prune */
+	node->part_prune_index = -1;
+
 	/*
 	 * If any quals exist, they may be useful to perform further partition
 	 * pruning during execution.  Gather information needed by the executor to
@@ -1533,13 +1535,13 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
 		Assert(best_path->path.param_info == NULL);
 
 		if (prunequal != NIL)
-			partpruneinfo = make_partition_pruneinfo(root, rel,
-													 best_path->subpaths,
-													 prunequal);
+			node->part_prune_index = make_partition_pruneinfo(root, rel,
+															  best_path->subpaths,
+															  prunequal);
 	}
 
 	node->mergeplans = subplans;
-	node->part_prune_info = partpruneinfo;
+
 
 	/*
 	 * If prepare_sort_from_pathkeys added sort columns, but we were told to
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 44efb1f4eb..f9e030cad1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -544,6 +544,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	result->dependsOnRole = glob->dependsOnRole;
 	result->parallelModeNeeded = glob->parallelModeNeeded;
 	result->planTree = top_plan;
+	result->partPruneInfos = glob->partPruneInfos;
 	result->rtable = glob->finalrtable;
 	result->permInfos = glob->finalrteperminfos;
 	result->resultRelations = glob->resultRelations;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 97fa561e4e..32abd1dfac 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -349,6 +349,29 @@ set_plan_references(PlannerInfo *root, Plan *plan)
 			palloc0(list_length(glob->subplans) * sizeof(bool));
 	}
 
+	/* Also fix up the information in PartitionPruneInfos. */
+	foreach(lc, root->partPruneInfos)
+	{
+		PartitionPruneInfo *pruneinfo = lfirst(lc);
+		ListCell   *l;
+
+		pruneinfo->root_parent_relids =
+			offset_relid_set(pruneinfo->root_parent_relids, rtoffset);
+		foreach(l, pruneinfo->prune_infos)
+		{
+			List	   *prune_infos = lfirst(l);
+			ListCell   *l2;
+
+			foreach(l2, prune_infos)
+			{
+				PartitionedRelPruneInfo *pinfo = lfirst(l2);
+
+				/* RT index of the table to which the pinfo belongs. */
+				pinfo->rtindex += rtoffset;
+			}
+		}
+	}
+
 	/* Now fix the Plan tree */
 	result = set_plan_refs(root, plan, rtoffset);
 
@@ -1715,6 +1738,29 @@ set_customscan_references(PlannerInfo *root,
 	cscan->custom_relids = offset_relid_set(cscan->custom_relids, rtoffset);
 }
 
+/*
+ * register_partpruneinfo
+ *		Subroutine for set_append_references and set_mergeappend_references
+ *
+ * Add the PartitionPruneInfo from root->partPruneInfos at the given index
+ * into PlannerGlobal->partPruneInfos and return its index there.
+ */
+static int
+register_partpruneinfo(PlannerInfo *root, int part_prune_index)
+{
+	PlannerGlobal  *glob = root->glob;
+	PartitionPruneInfo *pruneinfo;
+
+	Assert(part_prune_index >= 0 &&
+		   part_prune_index < list_length(root->partPruneInfos));
+	pruneinfo = list_nth_node(PartitionPruneInfo, root->partPruneInfos,
+							  part_prune_index);
+
+	glob->partPruneInfos = lappend(glob->partPruneInfos, pruneinfo);
+
+	return list_length(glob->partPruneInfos) - 1;
+}
+
 /*
  * set_append_references
  *		Do set_plan_references processing on an Append
@@ -1767,21 +1813,12 @@ set_append_references(PlannerInfo *root,
 
 	aplan->apprelids = offset_relid_set(aplan->apprelids, rtoffset);
 
-	if (aplan->part_prune_info)
-	{
-		foreach(l, aplan->part_prune_info->prune_infos)
-		{
-			List	   *prune_infos = lfirst(l);
-			ListCell   *l2;
-
-			foreach(l2, prune_infos)
-			{
-				PartitionedRelPruneInfo *pinfo = lfirst(l2);
-
-				pinfo->rtindex += rtoffset;
-			}
-		}
-	}
+	/*
+	 * Add PartitionPruneInfo, if any, to PlannerGlobal and update the index.
+	 */
+	if (aplan->part_prune_index >= 0)
+		aplan->part_prune_index =
+			register_partpruneinfo(root, aplan->part_prune_index);
 
 	/* We don't need to recurse to lefttree or righttree ... */
 	Assert(aplan->plan.lefttree == NULL);
@@ -1843,21 +1880,12 @@ set_mergeappend_references(PlannerInfo *root,
 
 	mplan->apprelids = offset_relid_set(mplan->apprelids, rtoffset);
 
-	if (mplan->part_prune_info)
-	{
-		foreach(l, mplan->part_prune_info->prune_infos)
-		{
-			List	   *prune_infos = lfirst(l);
-			ListCell   *l2;
-
-			foreach(l2, prune_infos)
-			{
-				PartitionedRelPruneInfo *pinfo = lfirst(l2);
-
-				pinfo->rtindex += rtoffset;
-			}
-		}
-	}
+	/*
+	 * Add PartitionPruneInfo, if any, to PlannerGlobal and update the index.
+	 */
+	if (mplan->part_prune_index >= 0)
+		mplan->part_prune_index =
+			register_partpruneinfo(root, mplan->part_prune_index);
 
 	/* We don't need to recurse to lefttree or righttree ... */
 	Assert(mplan->plan.lefttree == NULL);
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 7179b22a05..0fb1035127 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -210,16 +210,20 @@ static void partkey_datum_from_expr(PartitionPruneContext *context,
 
 /*
  * make_partition_pruneinfo
- *		Builds a PartitionPruneInfo which can be used in the executor to allow
- *		additional partition pruning to take place.  Returns NULL when
- *		partition pruning would be useless.
+ *		Checks if the given set of quals can be used to build pruning steps
+ *		that the executor can use to prune away unneeded partitions.  If
+ *		suitable quals are found then a PartitionPruneInfo is built and tagged
+ *		onto the PlannerInfo's partPruneInfos list.
+ *
+ * The return value is the 0-based index of the item added to the
+ * partPruneInfos list or -1 if nothing was added.
  *
  * 'parentrel' is the RelOptInfo for an appendrel, and 'subpaths' is the list
  * of scan paths for its child rels.
  * 'prunequal' is a list of potential pruning quals (i.e., restriction
  * clauses that are applicable to the appendrel).
  */
-PartitionPruneInfo *
+int
 make_partition_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 						 List *subpaths,
 						 List *prunequal)
@@ -333,10 +337,11 @@ make_partition_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 	 * quals, then we can just not bother with run-time pruning.
 	 */
 	if (prunerelinfos == NIL)
-		return NULL;
+		return -1;
 
 	/* Else build the result data structure */
 	pruneinfo = makeNode(PartitionPruneInfo);
+	pruneinfo->root_parent_relids = parentrel->relids;
 	pruneinfo->prune_infos = prunerelinfos;
 
 	/*
@@ -359,7 +364,9 @@ make_partition_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 	else
 		pruneinfo->other_subplans = NULL;
 
-	return pruneinfo;
+	root->partPruneInfos = lappend(root->partPruneInfos, pruneinfo);
+
+	return list_length(root->partPruneInfos) - 1;
 }
 
 /*
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 15ec869ac8..ee487e42dd 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -123,9 +123,9 @@ typedef struct PartitionPruneState
 
 extern PartitionPruneState *ExecInitPartitionPruning(PlanState *planstate,
 													 int n_total_subplans,
-													 PartitionPruneInfo *pruneinfo,
+													 int part_prune_index,
+													 Bitmapset *root_parent_relids,
 													 Bitmapset **initially_valid_subplans);
 extern Bitmapset *ExecFindMatchingSubPlans(PartitionPruneState *prunestate,
 										   bool initial_prune);
-
 #endif							/* EXECPARTITION_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cb714f4a19..21c388a1f4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -623,6 +623,7 @@ typedef struct EState
 										 * ExecRowMarks, or NULL if none */
 	List	   *es_rteperminfos;	/* List of RTEPermissionInfo */
 	PlannedStmt *es_plannedstmt;	/* link to top of plan tree */
+	List		*es_part_prune_infos;	/* PlannedStmt.partPruneInfos */
 	const char *es_sourceText;	/* Source text from QueryDesc */
 
 	JunkFilter *es_junkFilter;	/* top-level junk filter, if any */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a1dc1d07e1..e610c3183e 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -125,6 +125,9 @@ typedef struct PlannerGlobal
 	/* "flat" list of AppendRelInfos */
 	List	   *appendRelations;
 
+	/* List of PartitionPruneInfo contained in the plan */
+	List	   *partPruneInfos;
+
 	/* OIDs of relations the plan depends on */
 	List	   *relationOids;
 
@@ -544,6 +547,9 @@ struct PlannerInfo
 
 	/* Does this query modify any partition key columns? */
 	bool		partColsUpdated;
+
+	/* PartitionPruneInfos added in this query's plan. */
+	List	   *partPruneInfos;
 };
 
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 1b787fe031..c91ffb3eac 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -70,6 +70,9 @@ typedef struct PlannedStmt
 
 	struct Plan *planTree;		/* tree of Plan nodes */
 
+	List	   *partPruneInfos;	/* List of PartitionPruneInfo contained in
+								 * the plan */
+
 	List	   *rtable;			/* list of RangeTblEntry nodes */
 
 	List	   *permInfos;		/* list of RTEPermissionInfo nodes for rtable
@@ -273,8 +276,8 @@ typedef struct Append
 	 */
 	int			first_partial_plan;
 
-	/* Info for run-time subplan pruning; NULL if we're not doing that */
-	struct PartitionPruneInfo *part_prune_info;
+	/* Index to PlannerInfo.partPruneInfos or -1 if no run-time pruning */
+	int			part_prune_index;
 } Append;
 
 /* ----------------
@@ -308,8 +311,8 @@ typedef struct MergeAppend
 	/* NULLS FIRST/LAST directions */
 	bool	   *nullsFirst pg_node_attr(array_size(numCols));
 
-	/* Info for run-time subplan pruning; NULL if we're not doing that */
-	struct PartitionPruneInfo *part_prune_info;
+	/* Index to PlannerInfo.partPruneInfos or -1 if no run-time pruning */
+	int			part_prune_index;
 } MergeAppend;
 
 /* ----------------
@@ -1411,6 +1414,8 @@ typedef struct PlanRowMark
  * Then, since an Append-type node could have multiple partitioning
  * hierarchies among its children, we have an unordered List of those Lists.
  *
+ * root_parent_relids	RelOptInfo.relids of the relation to which the parent
+ *						plan node and this PartitionPruneInfo node belong
  * prune_infos			List of Lists containing PartitionedRelPruneInfo nodes,
  *						one sublist per run-time-prunable partition hierarchy
  *						appearing in the parent plan node's subplans.
@@ -1423,6 +1428,7 @@ typedef struct PartitionPruneInfo
 	pg_node_attr(no_equal, no_query_jumble)
 
 	NodeTag		type;
+	Bitmapset  *root_parent_relids;
 	List	   *prune_infos;
 	Bitmapset  *other_subplans;
 } PartitionPruneInfo;
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index 8636e04e37..c0d6889d47 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -70,10 +70,10 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
-extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
-													struct RelOptInfo *parentrel,
-													List *subpaths,
-													List *prunequal);
+extern int make_partition_pruneinfo(struct PlannerInfo *root,
+									struct RelOptInfo *parentrel,
+									List *subpaths,
+									List *prunequal);
 extern Bitmapset *prune_append_rel_partitions(struct RelOptInfo *rel);
 extern Bitmapset *get_matching_partitions(PartitionPruneContext *context,
 										  List *pruning_steps);
-- 
2.35.3

v1-0002-Share-initial-pruning-result-between-parallel-que.patchapplication/octet-stream; name=v1-0002-Share-initial-pruning-result-between-parallel-que.patchDownload

From 069079bc0814d85b55d58f96532546cafc79f136 Mon Sep 17 00:00:00 2001
From: Amit Langote <amitlan@postgresql.org>
Date: Wed, 9 Aug 2023 18:25:29 +0900
Subject: [PATCH v1 2/2] Share initial pruning result between parallel query
 processes

In the leader, ExecInitPartitionPruning() will save the resulting
bitmapset of initially valid subplans found by performing initial
pruning steps in EState.es_part_prune_results, a List of Bitmapsets
containing an element for each one in EState.es_part_prune_infos.

In workers, it will read a bitmapset at the given part_prune_index
and use that as the set of initially valid subplans, instead of
performing the initial pruning steps again.

Note that this is not just a performance optimization, but also
important to avoid different processes involved in the execution of a
Parallel Append with initial pruning possibly ending up with
different sets of initially valid subplans.  That can happen despite
the requirement that initial pruning steps may only contain
expressions that are at least stable, especially if the volatility
markings are not accurate.

Discussion: https://postgr.es/m/flat/CA%2BHiwqFA%3DswkzgGK8AmXUNFtLeEXFJwFyY3E7cTxvL46aa1OTw%40mail.gmail.com
---
 src/backend/executor/execMain.c      |  1 +
 src/backend/executor/execParallel.c  | 37 +++++++++++++++++++++---
 src/backend/executor/execPartition.c | 42 +++++++++++++++++++++++++++-
 src/backend/executor/execUtils.c     |  1 +
 src/backend/tcop/pquery.c            |  1 +
 src/include/executor/execdesc.h      |  6 ++++
 src/include/nodes/execnodes.h        |  1 +
 7 files changed, 84 insertions(+), 5 deletions(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 49293fc8ce..ba954d7934 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -856,6 +856,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 
 	estate->es_plannedstmt = plannedstmt;
 	estate->es_part_prune_infos = plannedstmt->partPruneInfos;
+	estate->es_part_prune_results = queryDesc->part_prune_results;
 
 	/*
 	 * Next, build the ExecRowMark array from the PlanRowMark(s), if any.
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index aa3f283453..1a6b2611de 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -66,6 +66,7 @@
 #define PARALLEL_KEY_QUERY_TEXT		UINT64CONST(0xE000000000000008)
 #define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_PARTITION_PRUNE_RESULTS	UINT64CONST(0xE00000000000000B)
 
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
@@ -598,12 +599,15 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	FixedParallelExecutorState *fpes;
 	char	   *pstmt_data;
 	char	   *pstmt_space;
+	char	   *part_prune_results_data;
+	char	   *part_prune_results_space;
 	char	   *paramlistinfo_space;
 	BufferUsage *bufusage_space;
 	WalUsage   *walusage_space;
 	SharedExecutorInstrumentation *instrumentation = NULL;
 	SharedJitInstrumentation *jit_instrumentation = NULL;
 	int			pstmt_len;
+	int			part_prune_results_len;
 	int			paramlistinfo_len;
 	int			instrumentation_len = 0;
 	int			jit_instrumentation_len = 0;
@@ -632,6 +636,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 
 	/* Fix up and serialize plan to be sent to workers. */
 	pstmt_data = ExecSerializePlan(planstate->plan, estate);
+	part_prune_results_data = nodeToString(estate->es_part_prune_results);
 
 	/* Create a parallel context. */
 	pcxt = CreateParallelContext("postgres", "ParallelQueryMain", nworkers);
@@ -658,6 +663,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	shm_toc_estimate_chunk(&pcxt->estimator, pstmt_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+	/* Estimate space for serialized part_prune_results. */
+	part_prune_results_len = strlen(part_prune_results_data) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, part_prune_results_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 	/* Estimate space for serialized ParamListInfo. */
 	paramlistinfo_len = EstimateParamListSpace(estate->es_param_list_info);
 	shm_toc_estimate_chunk(&pcxt->estimator, paramlistinfo_len);
@@ -752,6 +762,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	memcpy(pstmt_space, pstmt_data, pstmt_len);
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, pstmt_space);
 
+	/* Store serialized part_prune_results. */
+	part_prune_results_space = shm_toc_allocate(pcxt->toc, part_prune_results_len);
+	memcpy(part_prune_results_space, part_prune_results_data, part_prune_results_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARTITION_PRUNE_RESULTS,
+				   part_prune_results_space);
+
 	/* Store serialized ParamListInfo. */
 	paramlistinfo_space = shm_toc_allocate(pcxt->toc, paramlistinfo_len);
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMLISTINFO, paramlistinfo_space);
@@ -1233,8 +1249,11 @@ ExecParallelGetQueryDesc(shm_toc *toc, DestReceiver *receiver,
 						 int instrument_options)
 {
 	char	   *pstmtspace;
+	char	   *part_prune_results_space;
 	char	   *paramspace;
 	PlannedStmt *pstmt;
+	QueryDesc  *queryDesc;
+	List	   *part_prune_results;
 	ParamListInfo paramLI;
 	char	   *queryString;
 
@@ -1245,15 +1264,25 @@ ExecParallelGetQueryDesc(shm_toc *toc, DestReceiver *receiver,
 	pstmtspace = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT, false);
 	pstmt = (PlannedStmt *) stringToNode(pstmtspace);
 
+	/* Reconstruct leader-supplied part_prune_results. */
+	part_prune_results_space =
+		shm_toc_lookup(toc, PARALLEL_KEY_PARTITION_PRUNE_RESULTS, false);
+	part_prune_results = (List *) stringToNode(part_prune_results_space);
+
 	/* Reconstruct ParamListInfo. */
 	paramspace = shm_toc_lookup(toc, PARALLEL_KEY_PARAMLISTINFO, false);
 	paramLI = RestoreParamList(&paramspace);
 
 	/* Create a QueryDesc for the query. */
-	return CreateQueryDesc(pstmt,
-						   queryString,
-						   GetActiveSnapshot(), InvalidSnapshot,
-						   receiver, paramLI, NULL, instrument_options);
+	queryDesc = CreateQueryDesc(pstmt,
+								queryString,
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, paramLI, NULL, instrument_options);
+
+	/* For ExecutorStart() to propagate into the EState. */
+	queryDesc->part_prune_results = part_prune_results;
+
+	return queryDesc;
 }
 
 /*
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 9799968a42..a1ec26ad3c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1822,13 +1822,53 @@ ExecInitPartitionPruning(PlanState *planstate,
 	 * Perform an initial partition prune pass, if required.
 	 */
 	if (prunestate->do_initial_prune)
-		*initially_valid_subplans = ExecFindMatchingSubPlans(prunestate, true);
+	{
+		/*
+		 * If doing this in a parallel query plan, compute the initial pruning
+		 * steps only once in the leader and have it pass that down to the
+		 * workers so they don't need to compute that again.  That is done not
+		 * just for performance, but also to avoid situations where a worker
+		 * might end up with a different result of performing the same initial
+		 * pruning steps than the leader and/or other workers.
+		 */
+		if (IsParallelWorker())
+		{
+			Assert(estate->es_part_prune_results != NULL);
+			*initially_valid_subplans = list_nth(estate->es_part_prune_results,
+												 part_prune_index);
+		}
+		else
+		{
+			*initially_valid_subplans = ExecFindMatchingSubPlans(prunestate,
+																 true);
+
+			/*
+			 * Add the bitmap at part_prune_index to pass to parallel workers.
+			 * XXX - no way to tell whether or not we're under Gather to avoid
+			 * populating the list if not.
+			 */
+			Assert(list_length(estate->es_part_prune_results) ==
+				   part_prune_index);
+			estate->es_part_prune_results =
+				lappend(estate->es_part_prune_results,
+						*initially_valid_subplans);
+		}
+	}
 	else
 	{
 		/* No pruning, so we'll need to initialize all subplans */
 		Assert(n_total_subplans > 0);
 		*initially_valid_subplans = bms_add_range(NULL, 0,
 												  n_total_subplans - 1);
+
+		/*
+		 * Add a dummy NULL to es_part_prune_results at this index to keep it
+		 * of the same length as es_part_prune_infos.  Note that the worker
+		 * won't actually read this element, so there's no confusing NULL for
+		 * an empty set of initially valid subplans.
+		 */
+		estate->es_part_prune_results =
+				lappend(estate->es_part_prune_results, NULL);
 	}
 
 	/*
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 48366a33b5..5c6eb32e2e 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -141,6 +141,7 @@ CreateExecutorState(void)
 	estate->es_param_exec_vals = NULL;
 
 	estate->es_queryEnv = NULL;
+	estate->es_part_prune_results = NIL;
 
 	estate->es_query_cxt = qcontext;
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 5565f200c3..ce94acdc8a 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -91,6 +91,7 @@ CreateQueryDesc(PlannedStmt *plannedstmt,
 	qd->estate = NULL;
 	qd->planstate = NULL;
 	qd->totaltime = NULL;
+	qd->part_prune_results = NIL;
 
 	/* not yet executed */
 	qd->already_executed = false;
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index af2bf36dfb..d96c383dbf 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -43,6 +43,12 @@ typedef struct QueryDesc
 	QueryEnvironment *queryEnv; /* query environment passed in */
 	int			instrument_options; /* OR of InstrumentOption flags */
 
+	/*
+	 * Used by ExecParallelGetQueryDesc() to save the result of initial
+	 * partition pruning sent down by the leader to workers.
+	 */
+	List		*part_prune_results; /* List of Bitmapset */
+
 	/* These fields are set by ExecutorStart */
 	TupleDesc	tupDesc;		/* descriptor for result tuples */
 	EState	   *estate;			/* executor's query-wide state */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 21c388a1f4..d92d7572eb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -624,6 +624,7 @@ typedef struct EState
 	List	   *es_rteperminfos;	/* List of RTEPermissionInfo */
 	PlannedStmt *es_plannedstmt;	/* link to top of plan tree */
 	List		*es_part_prune_infos;	/* PlannedStmt.partPruneInfos */
+	List		*es_part_prune_results;	/* List of Bitmapset */
 	const char *es_sourceText;	/* Source text from QueryDesc */
 
 	JunkFilter *es_junkFilter;	/* top-level junk filter, if any */
-- 
2.35.3

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Amit Langote (#8)

Re: initial pruning in parallel append

On Wed, Aug 9, 2023 at 6:22 AM Amit Langote <amitlangote09@gmail.com> wrote:

I'm assuming it's not
too ugly if ExecInitAppend() uses IsParallelWorker() to decide whether
it should be writing to EState.es_part_prune_results or reading from
it -- the former if in the leader and the latter in a worker.

I don't think that's too ugly. I mean you have to have an if statement
someplace.

Yes, that makes sense.

It's just that I thought maybe I haven't thought hard enough about
options before adding a new IsParallelWorker(), because I don't find
too many instances of IsParallelWorker() in the generic executor code.
I think that's because most parallel worker-specific logic lives in
execParallel.c or in Exec*Worker() functions outside that file, so the
generic code remains parallel query agnostic as much as possible.

Oh, actually, there is an issue here. IsParallelWorker() is not the
right test. Imagine that there's a parallel query which launches some
workers, and one of those calls a user-defined function which again
uses parallelism, launching more workers. This may not be possible
today, I don't really remember, but the test should be "am I a
parallel worker with respect to this plan?" not "am I a parallel
worker at all?".Not quite sure what the best way to code that is. If
we could just test whether we have a ParallelWorkerContext, it would
be easy...

--
Robert Haas
EDB: http://www.enterprisedb.com

#10

Amit Langote

amitlangote09@gmail.com

over 2 years ago

In reply to: Robert Haas (#9)

Re: initial pruning in parallel append

On Wed, Aug 9, 2023 at 9:48 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 9, 2023 at 6:22 AM Amit Langote <amitlangote09@gmail.com> wrote:

I'm assuming it's not
too ugly if ExecInitAppend() uses IsParallelWorker() to decide whether
it should be writing to EState.es_part_prune_results or reading from
it -- the former if in the leader and the latter in a worker.

I don't think that's too ugly. I mean you have to have an if statement
someplace.

Yes, that makes sense.

It's just that I thought maybe I haven't thought hard enough about
options before adding a new IsParallelWorker(), because I don't find
too many instances of IsParallelWorker() in the generic executor code.
I think that's because most parallel worker-specific logic lives in
execParallel.c or in Exec*Worker() functions outside that file, so the
generic code remains parallel query agnostic as much as possible.

Oh, actually, there is an issue here. IsParallelWorker() is not the
right test. Imagine that there's a parallel query which launches some
workers, and one of those calls a user-defined function which again
uses parallelism, launching more workers. This may not be possible
today, I don't really remember, but the test should be "am I a
parallel worker with respect to this plan?" not "am I a parallel
worker at all?".Not quite sure what the best way to code that is. If
we could just test whether we have a ParallelWorkerContext, it would
be easy...

I checked enough to be sure that IsParallelWorker() is reliable at the
time of ExecutorStart() / ExecInitNode() in ParallelQueryMain() in a
worker. However, ParallelWorkerContext is not available at that
point. Here's the relevant part of ParallelQueryMain():

/* Start up the executor */
queryDesc->plannedstmt->jitFlags = fpes->jit_flags;
ExecutorStart(queryDesc, fpes->eflags);

/* Special executor initialization steps for parallel workers */
queryDesc->planstate->state->es_query_dsa = area;
if (DsaPointerIsValid(fpes->param_exec))
{
char *paramexec_space;

BTW, we do also use IsParallelWorker() in ExecGetRangeTableRelation()
which also probably only runs during ExecInitNode(), same as
ExecInitPartitionPruning() that this patch adds it to.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

#11

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Amit Langote (#10)

Re: initial pruning in parallel append

On Wed, Aug 9, 2023 at 8:57 AM Amit Langote <amitlangote09@gmail.com> wrote:

I checked enough to be sure that IsParallelWorker() is reliable at the
time of ExecutorStart() / ExecInitNode() in ParallelQueryMain() in a
worker. However, ParallelWorkerContext is not available at that
point. Here's the relevant part of ParallelQueryMain():

/* Start up the executor */
queryDesc->plannedstmt->jitFlags = fpes->jit_flags;
ExecutorStart(queryDesc, fpes->eflags);

/* Special executor initialization steps for parallel workers */
queryDesc->planstate->state->es_query_dsa = area;
if (DsaPointerIsValid(fpes->param_exec))
{
char *paramexec_space;

paramexec_space = dsa_get_address(area, fpes->param_exec);
RestoreParamExecParams(paramexec_space, queryDesc->estate);
}
pwcxt.toc = toc;
pwcxt.seg = seg;
ExecParallelInitializeWorker(queryDesc->planstate, &pwcxt);

BTW, we do also use IsParallelWorker() in ExecGetRangeTableRelation()
which also probably only runs during ExecInitNode(), same as
ExecInitPartitionPruning() that this patch adds it to.

I don't know if that's a great idea, but I guess if we're already
doing it, it doesn't hurt to expand the use a little bit.

--
Robert Haas
EDB: http://www.enterprisedb.com